CONTAINS and CONTAINSTABLE

Release 1.3 - ...

Introduction

CONTAINS returns exact or literal matches for a search phrase. Queries for the word run, for instance, will only match content containing the exact word run and not runs or running. Only content containing character-for-character matches is returned. However, you can select the degree of imprecision, closeness, or fuzziness in your search using additional query string options. Taken to its extreme, you can make CONTAINS functionally equivalent to FREETEXT.

As with the FREETEXT search, the CONTAINS predicate supports the ability to specify whether to search a single column, multiple columns, or all columns. The CONTAINS predicate supports several types of modifiers, including the following:

  • Boolean
  • Phrase
  • Prefix
  • Generational
  • Proximity
  • Weighted

Boolean Searches

You can use Boolean operators in your search condition (such as AND, AND NOT, and OR for each term of your search condition, shortcut equivalents resp.: & &! | ). Boolean operators allow you to search for combinations of multiple search tokens and phrases that might not be contiguous (right next to each other). The results include documents that contain both the words sword and shield, regardless of where they occur in the document content.

SQL CopyCode image Copy Code
/* Examples of a Boolean Search */
SELECT Nr, Title FROM vwtsContentPUB WITH (NOEXPAND)
WHERE CONTAINS ( *, N'nieuw & (product | produkt) &! lelijk', LANGUAGE 1043 );

SELECT Nr, Title FROM vwtsContentPUB WITH (NOEXPAND)
WHERE CONTAINS ( *, N'mooi AND (helicopter OR helikopter) AND NOT kapot', LANGUAGE 1043 );

Phrase Searches

You can search for phrases as opposed to a single word. To specify a phrase in a CONTAINS search, you have to wrap the entire phrase in double quotes. Example: "honden en katten".

If you look at the Messages tab in SSMS after running this query, you’ll notice that SQL Server returned an informational warning message because the search phrase contains the noise word en if you have used the Dutch language (LCID 1043).

The issue of full-text search conditions that contain noise words can be problematic when one token in a Boolean search is a stopword.

SQL CopyCode image Copy Code
/* Phrase search example (may return results) */
SELECT Nr, Title FROM vwtsContentPUB
WHERE CONTAINS ( *, N'"honden en katten" AND huisdier', LANGUAGE 1043 );

/* This query may return no results */
SELECT Nr, Title FROM vwtsContentPUB
WHERE CONTAINS ( *, N'honden & en & katten', LANGUAGE 1043 );

By default this query returns no results, even though there are documents that contain the word honden and katten. This is because the word en is on the Dutch stoplist, so no documents will ever match the Boolean condition "honden" and "en". You have three possible options to get around this behavior, as listed following:

  1. Strip stopwords out of search conditions before submitting them to the server. This could be done in a client application prior to performing the query.
  2. Remove stopwords you want to include in searches from the stopword list. In SQL Server 2008, you could create a custom stoplist (possibly based on an existing system stoplist), remove the stopwords you want to include in queries, and assign the newly created stoplist to a full-text index.
  3. Enable stopword transformations via server settings. Note that SQL Server 2008 still refers to this as "noise word" transformations. You can change this setting via sp_configure:

SQL CopyCode image Copy Code
/* Replace stopwords in search queries with wildcards */
EXECUTE sp_configure 'show advanced options', 1;
RECONFIGURE WITH OVERRIDE;
GO

EXECUTE sp_configure 'transform noise words', 1;
RECONFIGURE WITH OVERRIDE;
GO

After you change the transform noise words server option, SQL Server replaces stopwords with a wildcard asterisk (*) in search conditions. This means the search condition shown is transformed to '"honden" and "*" and "katten"', and will return results.

Prefix Searches

The CONTAINS predicate also allows you to do basic wildcard prefix searches. This is indicated by the wildcard asterisk (*) operator. A wildcard prefix search for the term run* returns documents containing the words run, runs, runt, running, runner: in short, anything that matches the first part of the search argument up to the *. No stemming is done in a prefix search, so that a search on mice* doesn’t return content containing the word mouse, for instance.

For wildcard searches to work, you must wrap the terms in double quotes. If you don’t include the search term in double quotes, SQL Server treats the trailing asterisk as part of the search term and attempts to match the search term exactly, trailing asterisk included. Also note that if your search phrase includes multiple words, all of the words in the phrase are treated as wildcard prefixes. That is, a search for "al* anon" performs a prefix search for words that begin with al* followed immediately by words that begin with anon*, including Al Anon, alcoholics anonymous, Allan Anonymous, and many others.

One commonly requested wildcard feature is suffix searches: using the wildcard character at the start of a search term like *ing to match all words ending in the suffix ing. SQL iFTS doesn’t allow this type of search. If you absolutely need this type of search functionality, one approach is to store all of your content in reverse, so “The rain in Spain stays mainly in the plain” could be stored as “nialp eht ni ylniam syats niapS ni niar ehT”. Then if you wanted to search for words ending in ain, you could simply reverse the search suffix and append the wildcard character to the end like this: nia*. It’s not clear why the FTS engine doesn’t support suffix-based wildcard searching, as some other RDBMS search components do (including both DB2 and Oracle).

Generational Searches

Generational searches are searches in which either, or all, of the following occur:

  • Words are stemmed, which simply means that additional forms of the words are generated and matched, including plural nouns and verb forms. This is known as inflectional term generation.
  • Search terms are replaced via language-specific thesaurus replacement sets. This is included in thesaurus term generation.
  • Search terms are expanded via language-specific thesaurus expansion sets. This is also part of thesaurus term generation.

Though the FREETEXT will automatically perform word stemming and thesaurus expansions and replacements, the CONTAINS predicate does not. In order to perform generational searches with CONTAINS, you have to use the FORMSOF operator and indicate whether you want to use INFLECTIONAL or THESAURUS forms.

SQL CopyCode image Copy Code
/* Inflectional and Thesaurus Term Generation examples */
SELECT Nr, Title FROM vwtsContentPUB WITH (NOEXPAND)
WHERE CONTAINS ( *, N'FORMSOF(INFLECTIONAL, punch)', LANGUAGE 1033 );

SELECT Nr, Title FROM vwtsContentPUB WITH (NOEXPAND)
WHERE CONTAINS ( *, N'FORMSOF(THESAURUS, punch)', LANGUAGE 1033 );

Proximity Searches

SQL Server iFTS also allows you to perform searches that require search terms to be in close proximity to one another. You can use proximity searches, via the NEAR operator, to return more relevant results. The theory behind proximity search is that documents with search terms that are close to one another are probably more relevant to the search topic. The NEAR operator returns a higher RANK for matching documents when the search terms are in close proximity to one another, and a lower RANK when the search terms are far apart. When the search terms are separated by more than 50 words, the RANK drops to 0. The query below performs a proximity search and restricts the results to those with a RANK greater than 0.

SQL CopyCode image Copy Code
/* Proximity search example */
SELECT c.Nr, c.Title, k.[RANK]
FROM vwtsContentPUB c WITH (NOEXPAND)
INNER JOIN CONTAINSTABLE (vwTsContentPUB, *, N'smartsite NEAR iXperion') k ON c.Nr = k.[KEY]
WHERE k.[RANK] > 0;

The NEAR operator is internally mapped to the AND operator. The NEAR operator is basically useless with the iFTS CONTAINS and FREETEXT predicates (it’s no different from using the AND operator). To get a true proximity search, you need to use CONTAINSTABLE or FREETEXTTABLE and restrict the results with a WHERE clause. Your WHERE clause predicate should specify that the RANK column returned by the iFTS function needs to be greater than some value for a higher quality match.

At the time of this writing, there’s no way to change the definition of “closeness” that proximity search uses. This feature may or may not be included in a service pack or in the next version of SQL Server.

If you use the same search term twice in a proximity search, as in "africa" NEAR "africa", iFTS looks for two distinct instances of the search term (in this case "africa") in close proximity to one another in the searched content. Documents containing only one instance of the search term are not returned by the proximity search.

Weighted Searches

It is possible to maximize the impact of the term you want to include while minimizing the contribution of the term you want to exclude. In iFTS terms, what you want is a weighted search. In this type of search, you apply weights to your tokens to make some tokens more relevant than others.

The WEIGHT value ranges from 0 to 1, with 1 being the highest. The weights you assign to your search terms also affect the result rankings generated by the CONTAINSTABLE function.

SQL CopyCode image Copy Code
/* Weighted search example */
SELECT c.Nr, c.Title, k.[RANK]
FROM vwtsContentPUB c WITH (NOEXPAND)
INNER JOIN CONTAINSTABLE (vwtsContentPUB, *, N'ISABOUT(smartsite WEIGHT(1), iXperion WEIGHT(.2))') k ON c.Nr = k.[KEY]
ORDER BY k.[RANK] DESC;