Including files within the indexed views

Release 1.3 - ...

By using SQL Server full-text indexing, Faceted Search can also search within Microsoft Office documents, PDF documents and any other (binary) file format for which an iFilter is installed. The latest Microsoft Office Filter Packs even include filters for Open Document formats.

Documents stored in de file system and accessible by Smartsite can also be included in the search using FileLink items (items of contenttype FLK) in conjunction with SyncFileLinks. The process of mirroring a folder on disc with a folder item within the cms, including FileLink items for each file, can even be fully automated. This article describes how.

SyncFileLinks

The SyncFileLinks macro is the first step in this process. This macro compares the specified folder on disc with the specified folder item within the cms (recursively). When it encounters folders and/or files on disc which do not have a corresponding item within the cms, a piece of import xml is generated. 

For each item within the cms which does not have a corresponding file or folder on disc anymore, an import xml instruction node will be generated to delete the item.

These pieces of xml are written hierarchically to one import xml file. After this file has been imported, the cms folder item will completely mirror the folder on disc.

Asynchronous Import

For a fully automated process, the generated import xml file should be imported automatically as well. This is where the in 1.3 introduced Asynchronous Import (discussed within the Developer Guide section, Advanced Topics) module comes into play. This module is able to asynchronously import the specified xml file. And to make things even more flexible, the asynchronous import can be invoked using a simple webservice call.

SyncFileLinks Configuration item

The SyncFileLinks Configuration item (added to the database when installing 1.3 build 3) contains xml with various configuration settings, such as the target cms folder item and the folder on disc which must be mirrored. It also contains account information which will be used when calling the webservice.

XML CopyCode image Copy Code
<syncfolders username="username" password="password">
 <item id="ixs" targetfolder="FS_MIRROR">indexedcontent</item>
</syncfolders>

The Sync FileLinks application page and the syncfilelinks command line script (described below) just require the id of an item node to acquire the appropriate configuration settings.

Sync FileLinks Application Page

The Sync FileLinks Application Page (item with code: FS_SYNCFILELINKS) is included within the Faceted Search content added to the database when installing 1.3 build 3. This application page uses the two building blocks (SyncFileLinks macro and a webservice call to the Asynchronous Importer) described above, as well as it contains logic to check the status of the asynchronous import job.

Simply put, you only need to call this application page a couple of times with the appropriate (querystring) parameters to complete the process of synchronizing the cms folder item with the contents on disc.

Command line script file

Finally, calling the Sync FileLinks application page (a couple of times) should be executed automatically as well. And that's implemented within the syncfilelinks.cmd command line script file. This file, when you have installed 1.3 build 3 including Faceted Search module, can be found within the system folder of your site.

This command line script uses signalsmartsite to render the Sync FileLinks application page a couple of times. The first time it checks if the cms folder item is in sync with the folder on disc. If not, an import xml file is generated and an asynchronous import job is started (using a webservice call). The guid of this import job is returned.

Using this guid, the script then renders the Sync FileLinks page again to check the status of the import job. When this job returns status Ready, signalsmartsite is called once again to start the FileLinkCopy job.
This FileLinkCopy job copies the binary content of the referenced files of each item of contenttype FLK to the (usually) CTSpecificBinary1 field, so SQL server will be able to index them. Notice that the FileLinkCopy job configuration within the smartsite.config file usually contains a maxfilesize parameter, which prevents large files from being written to the database.

The only input this command script requires is a syncid, which corresponds with an id within the SyncFileLinks Configuration item.

Scheduled Task

Well, that's it. Just create a windows scheduled task which calls the syncfilelinks.cmd script file and you're set. Schedule it to run daily at e.g. 02:00 AM. As command-line parameter, specify the appropriate syncid.