FTS/iFTS native thesauri

Release 1.3 - ...

Introduction

SQL Server provides powerful full-text search thesauruses that can be used for customizing searches for relevant text. Using thesauruses, you can define two types of custom patternbased rules:

  • Expansions that can automatically expand the scope of your searches to include additional relevant search terms.
  • Replacements that substitute your search terms with other specified terms. Thesaurus files are language-specific and can be modified to suit your specific needs.

See this item for integrating FTS thesaurus feature into Smartsite Indexed Search.

Thesaurus Files

SQL Server uses language-specific XML thesaurus files stored in the file system. These files are stored in the MSSQL\FTData subdirectory under the SQL Server installation directory. Each thesaurus is named using the format ts<language>.xml, where is a three-character language code or global for the global thesaurus, e.g. tsNLD.xml for the Dutch thesaurus. Note the sample thesaurus file beneath.

XML CopyCode image Copy Code
<XML ID="Microsoft Search Thesaurus">
  <thesaurus xmlns="x-schema:tsSchema.xml">
    <diacritics_sensitive>0</diacritics_sensitive>

    <expansion>
      <sub>Internet Explorer</sub>
      <sub>IE</sub>
      <sub>IE5</sub>
    </expansion>

    <replacement>
      <pat>NT5</pat>
      <pat>W2K</pat>
      <sub>Windows 2000</sub>
    </replacement>

    <expansion>
      <sub>run</sub>
      <sub>jog</sub>
    </expansion>
  </thesaurus>
</XML>
The root <XML> element has an ID attribute that is set to Microsoft Search Thesaurus. The ID attribute is not mandatory, and doesn’t appear to affect the functionality of the thesaurus files if you change it to another value or remove it altogether. Subsequent rootlevel elements after the first <XML> element are ignored.

The <thesaurus> element must have its XML namespace set to x-schema:tsSchema.xml. This is mandatory: not setting the XML namespace for this element, or setting it to an incorrect URI, will result in your thesaurus file being ignored and having no effect on your queries.

You can edit XML thesaurus files using a simple text editor such as Notepad or a more specialized XML editor such as Altova XMLSpy. We can expect thesaurus files to be incorporated completely into the database in a future release.

The thesaurus files must be saved in Unicode format with the byte order mark. The byte order mark should be automatically added to the file by your editor when you save it in Unicode format.

Expansion Sets

An expansion set in the thesaurus file is defined by <sub> tags contained within an <expansion> tag. If any of the words contained in the <sub> tags are encountered, your search is automatically expanded to include all other words defined in the expansion set. You can include as many expansion sets as you want in a SQL Server 2008 XML thesaurus file, but each expansion set word is limited to 512 Unicode characters. Through expansion sets, you gain the ability to search for multiple synonyms of any given search term.

SQL CopyCode image Copy Code
/* Example of Thesaurus Expansion */
-- Original query:
SELECT Nr, Title FROM dbo.vwTsContent WHERE CONTAINS ( Body, N'FORMSOF(THESAURUS, IE)' );
-- Will be internally expanded to the equivalent query:
SELECT Nr, Title FROM dbo.vwTsContent WHERE CONTAINS ( Body, N'"Internet Explorer" OR IE OR IE5' );

Note that Expansion Sets can be used for multilingual search queries (without usage of foreign language stemmers).

Replacement Sets

Replacement sets provide an alternative to expansion sets. Like expansion sets, replacement sets are defined using XML in the thesaurus files. Replacement sets also have the same 512-character limitation on terms. You can define replacement sets with <pat> and <sub> tags contained within a <replacement> tag. Replacement sets simply replace your search term with another search term per your definition. Where expansion sets are commonly used to define synonyms for words, replacement sets are useful for redefining miss-spellings, acronyms and abbreviations in your searches.

SQL CopyCode image Copy Code
/* Example of Thesaurus Replacement */
-- Original query:
SELECT Nr, Title FROM dbo.vwTsContent WHERE CONTAINS ( Body, N'FORMSOF(THESAURUS, NT5)' );
-- Will be internally replaced to the equivalent query:
SELECT Nr, Title FROM dbo.vwTsContent WHERE CONTAINS ( Body, N'"Windows 2000"' );

Note that Replacement Sets can be used for miss-spelled or phonetically spelled search queries as well as translations.

(Re)loading thesaurus files

In versions prior to SQL Server 2008, a change to a thesaurus file requires the full-text search service to be restarted in order to pick up the changes. SQL Server 2008 introduces the new sys.sp_fulltext_load_thesaurus_file procedure which loads the file automatically causing SQL Server to pick up the changes. This procedure accepts up to two parameters:

  • A mandatory int LCID parameter indicating which file should be loaded. The LCID is mapped to a language-specific thesaurus file; for instance, LCID 1043 is mapped to the Dutch language thesaurus file tsNLD.xml.
  • An optional bit parameter indicating whether the thesaurus file should be loaded if it has been previously loaded. If you set this parameter to 0 (the default), the thesaurus file is reloaded whether it was previously loaded or not. A value of 1 will cause the thesaurus file to be loaded only if it wasn’t previously loaded.

SQL CopyCode image Copy Code
/* (Re)loading the Dutch language thesaurus file */
EXEC sys.sp_fulltext_load_thesaurus_file 1043;
You can modify which XML file is used by changing the entry
for SQL Server 2008 in HKLM\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10.MSSQLSERVER\MSSearch\Language\nld ➤ TsaurusFile
for SQL Server 2005 in HKLM\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL.1\MSSearch\Language\nld ➤ TsaurusFile

Global and Local Thesauruses

SQL Server 2005 had the neutral language thesaurus (tsneu.xml), which was used when you specified an LCID of 0. SQL Server 2008 replaces the neutral language thesaurus with the global thesaurus (tsglobal.xml). Like the neutral language thesaurus, the global thesaurus is used when you specify an LCID of 0. Unlike the neutral language thesaurus, the global thesaurus now works in tandem with local language thesauruses no matter which language you specify in your query. When you specify a full-text search with thesaurus expansion, the global thesaurus is consulted in addition to the local language thesaurus, regardless of which language you indicate.

Ambiguous rules

When ambiguous rules are encountered, expansion rules take precedence over replacement rules. An example of ambiguity occurs when a thesaurus file contains one replacement rule and two expansion rules for the same term. The expansion rules take precedence over the replacement rule, so the replacement rule is ignored. If more than one expansion rules mention the same term, the one that’s defined first is used; the other rule is ignored. When you try to load an ambiguous thesaurus file, SQL Server returns only a generic warning message; it’s up to you to manually locate and fix the offending rules.

Overlapping rules

When overlapping rules occur between the global and a local thesaurus, the local thesaurus rules take precedence over the global thesaurus.

When overlapping rules occur between terms like "IE 5" and "IE", the longest term takes precedence over the other(s).

Upgrading to SQL Server 2008

The thesaurus technology is hardly changed in SQL Server 2008 compared to SQL Server 2005. During the upgrade process, existing thesaurus files are copied to the MSSQL\FTData\FTNoiseThesaurusBak subdirectory of your SQL Server installation directory. You can simply copy your existing SQL Server 2005 thesaurus files from this subdirectory to the MSSQL\FTData subdirectory to start using them with SQL Server 2008.