Diacritics and Unicode

Release 1.3 - ...

ISO/IEC 8859-1 and Unicode

ISO/IEC 8859-1 describes an 8-bit single-byte coded graphic character set, informally referred to as Latin-1. It is generally intended for "Western European" languages. When using Latin_1 as collation of your SQL Server databbase, all characters defined in this set will be saved in the database natively, that is: no conversion to entities will be made. As an exception there are three characters within this set: &, < and > will be converted by Smartsite to resp. &, < and >.

The set without these characters contains: !"#$'()*+,-./0123456789:;=?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_` abcdefghijklmnoprstuvwxyz{|}~¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿−ÀÁÂ ÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ

When you use FTS search queries, handling diacritics depends on the way the full-text catalog is indexed. If no ACCENT SENSITIVITY is given, the diacritics handling will depend on the used collation. Same rule applies for CASE SENSITIVITY. Make sure some collation choices - like CS or AS usage - are deliberately made.

When accent insensitive, the diacritics will internally be replaced with the "plain" character. E.g. a search for coëfficient or coefficient would return (misspelled) words like coefficient, coefficiënt, cöéffîcìêñt, etc.

Even if all is set to case insensitive, iFTS may still notice some differences in casing when linguistically relevant. Note that c# will probably obtain no results because # will be considered a word-breaker and c will probably be in the stoplist. However, C# may have results because C# is a well known programming language.

More information on ISO/IEC 8859-1

Note that any other character not covered in ISO 8859-1 will be translated into an entity prior to storage in the database. However, since Smartsite iXperion's Faceted Search do support Unicode characters by converting these entities into an nvarchar-typed view column, it is unnecessary to take these internal conversions into account.

Features & Modules