SQL database

Smartsite 7.9 - ...

Purpose

A SQL database provider supports to use a database as a source of Enterprise Search documents. This for example allows to use a database table such that each table row results in an ES document.

The database provider currently supports SQL Server, database version SQL Server 2008 R2 and higher.

Configuration

Common configuration

Part of the configuration is common to all sources. Refer to All providers.

Database

Select a database connection name from the list. Database connection strings are maintained in configuration file Smartsite.Data.config, which resides on the webserver that runs the Smartsite manager. For security reasons it is not possible to interactively enter a connection string.

Discovery query

The discovery query yields the set of documents for the source. Each query row results in one Enterprise Search document. A database application may delete a particular row. Enterprise Search deletes the ES document if the discovery query no longer yields that row.

The following is a discovery query for table Products of the Microsoft Northwind example database. Column Modified was added to the table for example purposes.

SELECT
'Product-'+CAST(ProductID AS nvarchar) AS SystemLocation,
Modified AS SystemLocationModified
FROM Products

Columns:

Column name Description
SystemLocation This column is mandatory. A value uniquely identifies the document within the source. Just ProductID is also a unique key and would also be suitable. Prefix 'Product-' is included such that additional tables can be added to this source, using an UNION query.
SystemLocationModified

This column is optional, although recommended. A value indicates when the document was modified last, at the source. The provider records the value, at the ES side. The provider assumes the document is not modified if both values are available and the values are equal. The provider skips reindexing in that case, resulting in a performance gain. The provider reindexes the document if either value is not available or if the values differ. Reindexing occurs every discovery cyle in that case.

After completing a discovery and after the configured discovery interval of for example 60 minutes Enterprise Search runs a next discovery.

Primary query

The primary query as well as the secondary queries yield fields for one ES document. Enterprise Search runs these queries whenever it is necessary to index or reindex the document. ES supplies the above SystemLocation value as @Key, allowing to select the appropriate database data. Example:

SELECT
ProductName AS DocTitle,
'Northwind' AS DocAuthors,
'http://localhost/products/'+CAST(ProductID AS nvarchar) AS DocUrl,
Modified AS DocModified,
Modified AS SystemLocationModified
FROM Products
WHERE ProductID=RIGHT(@Key, LEN(@Key)-LEN('Product-'));

All columns are opional. Columns used in the example:

Column name Description
DocTitle The product name is used as document title. Column DocTitle corresponds to document field doc_title. For standard fields refer to standard fields.
DocAuthors A fixed value Northwind is used as author for all documents obtained from this source.
DocUrl This URL will apear in the search result for a found document. The user can navigate to the page addressed by the URL. The URL could for example be http://localhost/products/123. The page should present document details and/or the document itself.
DocModified The product modified timestamp is used as the date time the document is modified last.
SystemLocationModified The product modified timestamp is also used for the mechanism to skip reindexing if the timestamp at the source is equal to the timestamp recorded by ES. If available it is recommended to include this timestamp, for performance.

Secondary queries

The secondary queries as well as the primary query yield fields for one ES document. Example:

SELECT
c.CategoryName AS DocKeywords
FROM Products p JOIN Categories c ON p.CategoryID=c.CategoryID
WHERE ProductID=RIGHT(@Key, LEN(@Key)-LEN('Product-'));

All columns are opional. Columns used in the example:

Column name Description
DocKeywords The product category is used as a keyword for the document. The primary query and secondary queries can return multiple rows. For a multivalue field such as doc_keywords ES will use each row to add one keyword to the field.

Primary query versus secondary queries

The primary query and secondary queries behave similarly:

  • Both types of queries yield fields for the ES document identified by @Key.
  • Both types of queries can return no rows, one row or multiple rows.

The primary query and secondary queries differ as follows:

  • ES considers the document deleted if the primary query yields no rows for key @Key. Secondary queries may yield no rows without causing a delete of the ES document.
  • The primary query can yield a SystemLocationModified and ES willl use it. ES ignores any SystemLocationModified in the result of secondary queries.

Document data accumulation

Enterprise search builds-up document data using the primary and secondary queries. Both the primary and secondary queries may yield multiple rows. As a result there may be multiple values for a particular document field: provenient from more than one query, and/or provenient from more than one row. ES accumulates field data as follows.

Column name Field name Description
DocAuthors doc_authors Add. This field can contain multiple values, and ES adds the value to the collection of values being built up for the document.
DocBody doc_body Merge. This field contains text. ES merges values by concatenating texts, separated by a space.
DocCreated doc_created Replace. ES replaces the current value by the next value. Effectively the last value applies.
DocDescription doc_description Merge.
- doc_fileformat Not supported.
DocIdentifier doc_identifier Merge.
DocKeywords doc_keywords Add.
DocLanguage doc_language Replace.
DocModified doc_modified Replace.
DocPublisher doc_publisher Merge.
DocTitle doc_title Merge.
DocUrl doc_url Replace.
ExtraAbc: extra field with a value of type string extra_abc Merge.
ExtraAbc: extra field with a value of other type than string extra_abc Replace.
SystemLocationModified   Replace. This applies to the timestamps returned by rows of the primary query only.

Special behaviour:

  • Merge: it is possible to request the replace behaviour instead. ES performs a replace rather than a merge if the next text starts with _replace_, underscores included, comma excluded.

Maintenance of documents

New documents

The sql database provider uses the discovery query during a discovery cycle, resulting in a set of rows for new and existing ES documents. A new row results in a new ES document. A row is new if the SystemLocation in the row is new given the system locations of the ES documents currently recorded for the source.

Having completed a discovery the provider waits the discovery interval, for example configured as one hour, and then starts a next discovery.

Recording a new ES document schedules the document for indexing. Whether indexing occurs immediately or with a delay depends on the number of ES documents currently queued for processing.

Modified documents

The provider detects a modified document during a discovery cycle. The discovery interval may for example be configured as one hour. The provider schedules reindexing of the document. Whether indexing occurs immediately or with a delay depends on the number of ES documents currently queued for processing.

Removed documents

The provider detects a removed document during an aged information check cycle. This is a check if the age of the information recorded for a document exceeds a configured age of for example 48 hours. The provider marks the document as dirty if exceeded, scheduling a recheck of the document. The recheck causes removal of the document. Whether the recheck occurs immediately or with a delay depends on the number of ES documents currently queued for processing.