File system

Smartsite 7.9 - ...

Purpose

A file system provider supports to use folders on disk as a source of Enterprise Search documents. This for example allows to index and search PDF documents residing on disk.

Configuration

Common configuration

Part of the configuration is common to all sources. Refer to All providers.

File system

Specify:

  • Folders to include.
  • Folders to exclude. This has precedence over folders to include.
  • File extensions to include. One or more extensions are required in order to select files.

The provider discovers documents according to this specification, and indexes these documents. Changes in the specification are reflected with a delay:

  • The changed specification may cover new folders and files. New files are dected during the next discovery cycle, configured with an interval of for example 1 hour.
  • The changed specification may be an abandonment of folders and files. Abandoned files are detected during a next aged information check cycle, configured with an interval of for example 48 hours.

Accessing files outside the site folder

For security reasons it is not possible to specify any folder as an include folder. Only folders within the site folder can be specified as a folder to include, hence all subfolders of for example E:\Sites\EsDemoSite.

The Windows mklink command can be used to create a symbolic links for a folder on the same server or another server, such that the remote folder appears as a local folder within the site. For example in order to create a symbolic link for a network folder \\nwsh01\Docs perform

mklink /D E:\Sites\EsDemoSite\LinkedDocs \\nwsh01\Docs

Remote folder Docs will appear as a folder LinkedDocs within the site, allowing to select that folder for search purposes.

Deleting folder LinkedDocs deletes the symbolic link, not the linked folder.

Handling of document fields

Fallbacks:

  • Document title, field doc_title. The file system provider attempts to obtain a document title from the document metadata. If no title could be obtained it uses the document file name, extension included.

Maintenance of documents

New documents

The file provider detects new files during a discovery cycle. It recursively looks for folders according to the inclusion and exclusion criteria, and within folders it looks for files according to the file extension criteria. A new file results in a new ES document. A file is new if its file path is new given the SystemLocation system locations of the ES documents currently recorded for the source.

Having completed a discovery the provider waits the discovery interval, for example configured as one hour, and then starts a next discovery.

In addition the file provider detects a new document when a new file is created, using a file system watcher. This is a real time detection. Detection also works for files in folders linked with the above mklink approach.

Recording a new ES document schedules the document for indexing. Whether indexing occurs immediately or with a delay depends on the number of ES documents currently queued for processing.

Modified documents

The provider detects a modified document during a discovery cycle. The provider considers a document modified if the last file write time differes from the SystemLocationModified timestamp recorded for the ES document.

In addition the file provider detects a modified document real time when a file is modified, using the file system watcher.

The provider schedules reindexing of the document in both cases. Whether indexing occurs immediately or with a delay depends on the number of ES documents currently queued for processing.

Removed documents

The provider detects a removed document during an aged information check cycle. This is a check if the age of the information recorded for a document exceeds a configured age of for example 48 hours. The provider marks the document as dirty if exceeded, scheduling a recheck of the document. The recheck causes removal of the document. Whether the recheck occurs immediately or with a delay depends on the number of ES documents currently queued for processing.

In addition the file provider detects a removed document real time when a file is deleted, using the file system watcher. It deletes the information recorded for the document and it deletes the document from the Elastic Search index, real time.