Sitemap xml
Purpose
The sitemap xml provider is the web crawler provider, configured to discover and index Enterprise Search documents in xml format, available over HTTP.
Rather than configuring a web crawler for one or more URLs of start HTML pages configure the web crawler with one or more URLs of xml pages. Such an xml may be in one of the following formats:
- ES document data. This is an xml containing elements for the fields of one ES document, for example a title in element <doc_title>...</doc_title>.
- ES URL set. This is an xml containing a set of URLs addressing the above ES document xmls. The URL set can be considered a sitemap. The sitemap provider can be configured with the URL of the sitemap xml, which in turn addresses relevant document xmls.
- ES sitemap index. This is an xml addressing a number of the above URL set xmls. The sitemap index can be considered a sitemap of sitemaps. The sitemap provider can be configured with the URL of the sitemap index xml, which in turn addresses URL sets, which in turn address ES document xmls.
When configuring the web crawler for sitemap usage uncheck Discover urls. The sitemap index, sitemaps and documents will still be visited, however looking for hyperlinks in content will be suppressed.
Maintenance of documents
New documents
The sitemap provider discovers URL entries in the URL set, for example:
<url>
<loc>http://kennis/esdocumentdata.xml?nrcontent=10079</loc>
<lastmod>2018-02-06T11:04:51</lastmod>
</url>
A new URL in the URL sets leads to a new ES document. An ES document xml may be specified directly in the web crawler. In that case the URL of the ES document xml is used and, if new, leads to a new ES document.
Having completed a discovery cycle the provider starts a new discovery cycle after for example one hour.
Modified documents
The above URL entry includes a last modified date time. This timestamp is for information only and the provider ignores the timestamp. The provider performs an If-Not-Modified-Since check instead:
- Having indexed the document a first time the provider records the current date time as a timestamp for the document.
- In a next discovery cycle the provider sends the timestamp as an HTTP If-Not-Modified-Since timestamp.
- The source may establish that the document is not modified since this date time, and respond with a HTTP 304-Not-Modified. The document is not transferred, and ES skips reindexing the document, improving performance.
- Or the sourcce may respond with the updated document, and ES will reindex the document and update the registered timestamp.
Note that the source should implement the If-Not-Modified-Since - 304-Not-Modified handshake. Otherwise the document is just obtained and reindexed every discovery cycle.
Removed documents
The source of ES document xmls need not to specify removal of a document. It is sufficient to omit the document from the URL set.
The provider will check if documents still exist if a document has not been reindexed or seen for more than for example 48 hours. The provider will request the document xml using the known document xml URL. The source of the document should respond with a HTTP 404-Not-Found. The provider deletes the document.