Load a source

Purpose

An Enterprise Search source (hereafter referred to as the to-source) can be loaded with documents from another source, the from-source. The from-source can reside

  • in the index of the local Elastic Search instance that also contains the to-source,
  • in another index of the local Elastic Search instance,
  • in the index of an Elastic Search instance at a foreign endpoint.

Purposes of the load include:

  • When upgrading the version of Elastic Search it is possible to load the higher version source from the lower version source. Elastic Search offers upgrade paths at index level, some without limitations, some that are increasingly limited depending on whether a minor or major version upgrade is performed. Loading a source works in more cases and for example also supports a version downgrade.
  • When changing the document field mapping it is required to recreate an index. Changes of the document field mapping include adding or removing fields, changing the type of fields (text, keyword, datatype) and more. Recreating an index requires to repopulate the index, which in turn requires to revisit the original web sites and other sources. Loading a source allows to populate a source from in house data instead.
  • In an OTAP setup (ontwikkel - test - acceptatie - productie) it may be useful to load a source in one of the O, T, A or P systems from any of the other systems.

Database documents versus Elastic Search documents

An Enterprise Search document consists of a DB document and as ES document.

  • A DB document resides in the database,¬†as a row of table esDocuments. The DB document serves administrative purposes.
  • An ES document resides in the Elastic Search index. It includes the document text and document metadata. The ES document serves storage, indexing and search purposes.

DB documents are normally leading: adding a DB document causes the indexer to add an ES document; removing a DB document causes removal of the ES document from the Elastic Search index.

In contrast loading the to-source causes the to-source to be reconstructed from the from-source. ES documents are obtained from the from-source, and are used to create DB documents and ES documents in the to-source.

Prerequisites for the to-source

Loading the to-source is possible if the following prerequisites are met.

  • The to-source must initially be empty. In the manager it is possible to empty the source, causing removal of all DB and ES documents of the source.
  • The provider for the to-source should sufficiently match the provider used for the from-source. For example the providers should both be web crawlers and the sources should be configured with a merely corresponding set of websites to crawl. Loaded documents are marked as dirty, causing the local provider to check loaded documents against the to-source. A mismatch between the provider or a mismatch between the configuration of the sources would cause removal of most or all documents just loaded.

There is no need for an exact match of the document field mappings:

  • From-fields may no longer exist as to-fields. Field content will be ignored during the load.
  • New to-fields may not have a corresponding from-field. These to-fields remain empty during the load.
  • The type (text, keyword, datatype) of the to-field and from-field may differ. Field content is reindexed during the load.

A load uses the document text and document metadata as stored in the from-index (for information: using the Elastic Search _source field). A load does not require revisiting web sites, databases, file systems and other sources originally visited.

Manager action

A load source can be initiated from the list of sources of a silo. Select an empty source as to-source. Further specification of the from-source depends on whether the local ES is selected or a foreign ES is selected.

  • For the local ES select a silo, and within the silo select a source as to-source.
  • A foreign ES can be selected if foreign ES endpoints are configured in the Web.config, see below. If a foreign ES is selected enter the name of the foreign index and enter the code of the foreign from-source.

Configuration in Web.config

Loading a source from a foreign ES requires an Elastic Search endpoint address to retrieve documents from. File Web.config specifies endpoint addresses, for example:

XML CopyCode image Copy Code
<configuration>
    <appSettings>
        <add key="enterprisesearch.elastic.endpoint" value="http://localhost:9200"/>
        <add key="enterprisesearch.elastic.endpoint.test" value="http://localhost:9202"/>
    </appSettings>
</configuration>

  • Key enterprisesearch.elastic.endpoint specifies the regular endpoint address for the local Elastic Search instance.
  • Key enterprisesearch.elastic.endpoint.test specifies and endpoint address named test.

The endpoint address itself cannot be entered in the Smartsite manager, for security reasons. Instead use the endpoint name test.