Multilinguality, stemmers and tokens

Release 1.3 - ...

Multilingual sites

The LANGUAGE parameter mentioned before has nothing to do with limiting the results on texts in a particular language. SQL Server is not clever enough to determine the language of text.

As an example, if you search for gift, you’ll get rows that contain the words gifted, gift, gifts, and gifting. Gift in German, however, means poison. This means you’ll get rows with English content containing the word gift, but also rows in German that refer to the German word for poison. If you have searched for gift in German language, however, SQL Server will also find German stemmed words for gift, while ignoring words like gifted, gift, gifts, and gifting.

In most languages (but not in all) capitalization is unimportant for word-breaking purposes.

Compound words and hyphenation

The Dutch language treats hyphenation differently than most other languages. In most languages, a hyphen implies that a word should be treated as a single unit, and the unit will sometimes appear unhyphenated. The English word data-base is indexed as data, base, and database. In Dutch, hyphenation is for the most part preserved, so that kop-van-jut is indexed as three separate words: kop, van, and jut. There are some exceptions such as the word kop-hals-rompboerderij, which is indexed as kop-hals-rompboerderij, kop-hals-romp, boerderij, and kophalsrompboerderij.

SQL CopyCode image Copy Code
/* iFTS (SQL Server 2008) parser test examples - compare these results */
SELECT * FROM sys.dm_fts_parser(
    N'FORMSOF(FREETEXT, kop-hals-romp)'
  , 1043 /* Dutch, use 1033 for English */
  , NULL
  , 0
);
SELECT * FROM sys.dm_fts_parser(
    N'FORMSOF(FREETEXT, kop-hals-rompboerderij)'
  , 1043
  , NULL
  , 0 
);

Because of differences in how various word breakers handle hyphenation and index hyphenated words, do not strip hyphens out of search queries. You should specify the same language settings you used to create the index. If you indexed your content in Dutch and you search for the word data-base using the English language word breaker, the search will attempt to locate the words data and base; you won’t find content containing the word database. If your content was indexed using the US English word breaker, however, you would find content containing the word database.

When iFTS indexes content, it doesn’t store any language-specific metadata with the tokens it adds to the full-text index. This means that there’s no mechanism for applying language-specific rules to indexed content after it’s already been indexed. In fact, language-specific rules are applied at only two distinct times:

  1. A limited set of language-specific rules (whitespace handling, hyphenation, and so forth) are applied during the word breaking portion of the population process.
  2. The full set of language-specific rules, possibly including stemming and thesaurus expansions and replacements, are applied to your search phrase at query time.

Each full-text indexed column can only be designated with a single LCID. To hold documents and content from multiple languages with different language-specific indexing rules and different character sets, the most recommended approach is to store an LCID with the content. This LCID can be used to narrow your full-text searches to only the content that’s relevant to the language you’re interested in.

SQL CopyCode image Copy Code
/* Example of querying multilingual table columns */
SELECT Nr, Title, QueryData, Column_LCID /* Just an example column */
FROM vwtsContentPUB 
WHERE FREETEXT
(
      (Title, Body, CTSpecificText1)
    , N'gift' LANGUAGE 1033 /* English */
) AND Column_LCID = 1033
UNION ALL
SELECT Nr, Title, QueryData, Column_LCID /* Just an example column */
FROM vwtsContentPUB
WHERE FREETEXT
(
      (Title, Body, CTSpecificText3)
    , N'gift' LANGUAGE 1043 /* Dutch */
) AND Column_LCID = 1043;

An alternative approach to handle multilingual content is to separate them into multiple Search Groups, resulting to language-specific indexed views like vwtsContentPUB_NL and vwtsContentPUB_EN.

SQL CopyCode image Copy Code
/* Example of querying multiple single-language indexed views */
SELECT Nr, Title, QueryData
FROM vwtsContentPUB_EN 
WHERE FREETEXT
(
      *
    , N'gift'
)
UNION ALL
SELECT Nr, Title, QueryData
FROM vwtsContentPUB_NL
WHERE FREETEXT
(
      *
    , N'gift'
);