Stoplists and noise word lists

Release 1.3 - ...

Introduction

All versions of SQL Server that have supported full-text search have included an option to ignore certain words that are considered unimportant for search purposes. In SQL Server 2005 and earlier, this option was known as the noise word list. Noise word lists were language-specific text files stored in the file system. In SQL Server 2008, noise word lists have been replaced by stoplists, which are likewise composed of words, known as stopwords, that are unimportant for search. Unlike noise word lists, stoplists are stored in the database rather than in the file system. SQL Server 2008 also provides T-SQL language improvements designed to make stoplists more flexible than noise word lists and to facilitate and ease their management.

The reasons for creating or using a stoplist include the following:

  • To prevent your full-text indexes from becoming bloated with unnecessary and unimportant words
  • To improve the quality of full-text search results by eliminating unimportant word matches from the results
  • To increase the efficiency of full-text searches by eliminating searching and stemming of irrelevant words

SQL Server 2005 and prior

In SQL Server 2005 and prior versions, FTS utilizes noise word lists to indicate words that should be ignored during full-text searches. These noise word lists are stored as plain text files under SQL Server’s MSSQL\FTDATA directory with names like noiseXXX.txt, where the XXX indicates a three-letter language code such as ENU for U.S. English (LCID 1033) and NLD for Dutch (LCID 1043). In prior versions of SQL Server, you were allowed a single noise word list per supported language.

SQL Server 2008 (and beyond)

SQL Server 2008 changes this with stoplists. Stoplists are stored in the database instead of the file system, and you can create as many stoplists as you like for any supported language. The only limitation is that you can assign only one stoplist to any given full-text index.

System Stoplists

By default, SQL Server 2008 provides system stoplists for most languages that iFTS supports. You can view the system stoplists by querying the sys.fulltext_system_stopwords catalog view. This view returns a column with the stopword entries from the system stoplists and a language_id column with the LCID of each entry.

SQL CopyCode image Copy Code
/* Query the Dutch Language System Stoplist */

SELECT *
FROM sys.fulltext_system_stopwords
WHERE language_id = 1043;
You can specify that a full-text index use the system stoplist by adding the WITH STOPLIST = SYSTEM clause to your CREATE FULLTEXT INDEX or ALTER FULLTEXT INDEX statements. If you don’t specify a stoplist when you create a full-text index, SQL Server defaults to the systemstoplist. If you don’t want your full-text index to use a stoplist at all, use the WITH STOPLIST = OFF clause when you create it.

Creating Custom Stoplists

You can create and manage custom stoplists in SQL Server 2008, using either the SSMS GUI or T-SQL statements. When you create a custom stoplist, you can either create an empty stoplist or base it on a system - or an existing full-text stoplist. To create a custom stoplist in the SSMS GUI, right-click on Your database (e.g. EmptySix) ➤ Storage ➤ Full Text Stoplists in the Object Explorer. Then select New Full-Text Stoplist... and follow the wizard. The owner of the stoplist is most likely dbo.

SQL CopyCode image Copy Code
/* Create custom stoplist using T-SQL */

CREATE FULLTEXT STOPLIST EmptySix_Stoplist
FROM SYSTEM STOPLIST /* or an existing custom stoplist; omit this line to create an empty stoplist  */
AUTHORIZATION dbo;

Managing Custom Stoplists

After you’ve created a custom stoplist, you can add or remove stopwords from the stoplist by right-clicking the name of the stoplist in the Object Explorer and selecting Properties from the pop-up context menu. Note: you can manage stopwords for any specific language (or for all languages) within a single full-text stoplist per database.

SQL CopyCode image Copy Code
/* Manage custom stoplist using T-SQL */ 
ALTER FULLTEXT STOPLIST EmptySix_Stoplist ADD 'mijn_stopwoord' LANGUAGE 1043; 
ALTER FULLTEXT STOPLIST EmptySix_Stoplist ADD 'my_stopword' LANGUAGE 1033; 
ALTER FULLTEXT STOPLIST EmptySix_Stoplist DROP 'mijn_stopwoord' LANGUAGE 1043;

Once you’ve created a custom stoplist, you can use the SSMS Create Full-Text Index wizard to create a new full-text index that uses it. You can also assign it to an existing full-text index by right-clicking on the full-text indexed table and selecting Full-Text index ➤ Properties. In the Full-Text Index Properties window, you can assign a custom stoplist to the full-text index. Note: if your full-text index is set for automatic change tracking, changing the stoplist will kick off a full population.

SQL CopyCode image Copy Code
/* Assign custom stoplist to an existing full-text index using T-SQL */ 
ALTER FULLTEXT INDEX ON vwtsContentPUB SET STOPLIST = EmptySix_Stoplist;
SQL CopyCode image Copy Code
/* Unassign any stoplist to an existing full-text index using T-SQL */
ALTER FULLTEXT INDEX ON vwtsContentPUB SET STOPLIST = OFF;

SQL Server 2008 provides three catalog views that are useful for retrieving the contents of system and custom stoplists:

SQL CopyCode image Copy Code
SELECT * FROM sys.fulltext_system_stopwords;
SELECT * FROM sys.fulltext_stoplists;
SELECT * FROM sys.fulltext_stopwords;

Determining Stoplist

In order to determine which stoplist (if any) is used for a full-text index, one could execute the following query (SQL Server 2008 only):

SQL CopyCode image Copy Code
SELECT object_name(object_id) AS IndexedObject,
CASE ISNULL(fi.stoplist_id, -1)
 WHEN 0 THEN 'System Stoplist'
 WHEN -1 THEN 'No Stoplist'
 ELSE fs.name
END AS UsedStoplist
FROM sys.fulltext_indexes fi
LEFT JOIN sys.fulltext_stoplists fs ON fi.stoplist_id = fs.stoplist_id;

Upgrading Noise Word Lists to Stoplists

When you upgrade a full-text catalog from a SQL Server 2005 database, SQL Server doesn’t upgrade your old noise word lists automatically. It will copy the noise word lists to the MSSQL\FTData\FTNoiseThesaurusBak subdirectory under your SQL Server 2008 installation directory, but that’s about it. If you only used the default system noise word lists in SQL Server 2005, you won’t need to upgrade your old noise word lists, since the SQL Server 2008 system stoplists have equivalent content.

If you customized your old noise word lists, however, you’ll need to create a custom stoplist and import the old noise words into the stoplist. SQL Server doesn’t provide a standard utility to perform this function, but it’s easy enough to create one with the OPENROWSET rowset provider.

Stoplists and Indexing

When you create or populate a full-text index, SQL Server uses the stoplist associated with the index to ignore instances of the stopwords during full-text indexing. Basically, iFTS won’t index any stopwords it encounters during the indexing process. SQL Server will only ignore exact matches of the stopwords, however.

For example, when you add fish as a stopword, a search query on words like fishes, fished or fishing will still yield results, while only ignoring the literal word fish.

Noise file

The location of the specified noise file is set in the registry: HKLM\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL.1\MSSearch\Language\nld ➤ NoiseFile.

After editing the noise file, rebuild the catalog (*Note, if you have a large catalog, do this off-hours as it will impact performance):

SQL CopyCode image Copy Code
/* Rebuild the catalog after editing noise file(s) */
ALTER FULLTEXT CATALOG SmsSearch
REBUILD WITH ACCENT_SENSITIVITY = OFF;