Web crawler

Smartsite 7.9 - ...

Purpose

The web crawler provider supports to use a web site as a source of Enterprise Search documents. It crawles configured web site pages and referred pages, recursively. One page typically results in one ES document.

The web crawler as described here handles HTML pages. The web crawler is also used for other web crawling tasks, described separately: sitemaps and xml document data, and RSS feeds.

Configuration

Application settings

Application setting enterprisesearch.querystringfilter.exclude resides in the Web.config and cannot currently be set as part of the configuration of the webcrawler.

The web crawler includes a mechanism to prevent endless visiting of dynamically generated pages with URLs that differ by query string parameters only, for example .../calendar?year=2018&month=11&day=8. Visiting a link Tomorrow would dynamically yield a next page, and so on.

Visited pages may however also be produced by an underlying content management system, for example resuling in URLs such as .../item?id=12345. Parameter id is relevant and value 12345 must be honoured.

Query parameter id is honoured by default. Addition parameters such as nr may be specified using the above setting, which accepts a comma separated list.

Common configuration

Part of the configuration is common to all sources. Refer to All providers.

Source

Specify one or more urls of web pages where to start crawling.

The provider visits these start urls, and urls found on the visited web pages, recursively. The provider stays within the configured web sites by restricting the set of visited urls: a visited url must match one of the configured start urls or must match one of the must-contain string.

The host must match, for example doc.seneca.nl must match.
The scheme must match, for example https must match.

Enable or disable the above discovery of urls.

Enable url discovery for regular web crawling. This for example allows to configure a home page, and let the provider visit landing pages and follow-up pages.
Disable url discovery if the web crawler is configured for special web crawling tasks such as sitemap and xml document data crawling, or reading from RSS feeds.

Optionally specify Uri must contain substrings.

The provider limits visited urls to urls containing one of the substrings. It looks for the substring in the entire url, hence including the url path, however excluding bookmarks (introduced with #).
A visited web site may redirect a start url to another, different url. The web crawler compares visited urls against the start urls. This would filter all visited urls, because differing from the start urls. Remediate by adding the redirect urls as must-contain values. The web crawler also compares visited urls against the must-contain values, causing matching urls to be accepted and not rejected.

Optionally specify Uri must not contain substrings.

The provider excludes urls from visiting, for urls containing one of the substrings. It looks for the substring in the entire url except bookmarks as for the Uri must contain substrings.

Optionally specify file extensions to ignore.

The provider ignores urls specifying a file having one of the configured extensions. This is based on the textual presence of the extension in the url. For example an url having the extension .mp4 can be ignored without visiting the url.
An url may be extensionless. The provider will visit the url in this case, obtaining the mime type of the resource, for example video/mp4. The provider will not retrieve the video itself at this point. The provider skips the resource if there is an association between the mime type video/mp4 and the extension .mp4. The administrator can maintain associations in the Smartsite Manager, using action Mime Types to maintain mime types and using action File Types to maintain extensions and to associate an extension with a mime type.
The provider supplies an initial list of extensions:
- The lists consists of extensions found in optional application setting enterprisesearch.providers.webcrawler.defaultignoredextensions of the Web.config. The setting is a , or ; separated list of extensions.
- The list consists of a default list if defaultignoredextensions is omitted. The default list is hardcoded and currently consists of the following extensions: 7z, asf, avi, bat, bmp, cmd, com, cpl, css, dll, eot, exe, flv, gif, hta, ico, inf, iso, jar, jpeg, jpg, js, json, mp3, mp4, mpeg, mpg, msi, msp, ogg, pif, png, ps1, qt, rar, reg, scr, svg, tiff, ttf, vb, vbe, vbs, vob, wav, wmv, woff, woff2, xml and zip.
- The list is completed in both cases using optional application setting enterprisesearch.providers.webcrawler.extraignoredextensions.
This list can be edited as necessary. Settings defaultignoredextensions and defaultignoredextensions will not be reapplied after their initial use.
Setting enterprisesearch.providers.webcrawler.alwaysignoredextensions can be used to specify extensions that are always ignored. Extensions in this list need not to be respecified for the source.

Optionally specify html elements to remove because not contributing to the search result.

Elements can be specified by name. Defaults to header and footer. This causes html elements <header> and <footer> to be skipped when building text for the document to be sent to Elastic Search for indexing. Skipping the element includes skipping nested content, hence text is skipped starting at for example <header> up to the closing </header>. If a skipped element contains hyperlinks by means of the <a> element and "href" attribute these hyperlinks are still found and used to find additional web pages.
Elements can be specified by class, using attribute "class". Defaults to "noindex". The presence of this class name identifies an html element to remove, as above.

Optionally set a sleep time after load.

The provider will sleep the configured number of ms after loading a resource from an url. The purpose is to reduce the load put on the crawled web site. Note that mulitple threads are doing the work of reading from a configured source, build ES documents, and send these documents to the Elasic Search index. The number of threads may for example be 2. At a given time both threads may be working on the same web site, doubling the load.

Discovery of urls

The web crawler parses a visited HTML page, looking for hyperlinks to additional HTML pages, if url discovery is enabled. It adds new urls as newly discovered documents, to be indexed as soon as possible.

The web crawler looks for HTML href attributes of <a> elements and uses the href value as hyperlink. These attributes may appear anywhere in the page, whether head section or body section. The web crawler however skips script blocks between <script ...> and </script>.

Handling of document fields

Fallbacks:

Document title, field doc_title. The web crawler provider attempts to obtain a document title from the <head> element of the HTML page, or from document metadata if a document is visited. If no title could be obtained it uses the last segment of the URL, normally capturing the page friendly name. It performs some cleanup of the segment, including replacement of "-" and "_" by a space.

Maintenance of documents

New documents

The web crawler visits configured start urls during the discovery cycle. It detects additional web pages during the process of indexing ES documents queued for indexing. A new web page results in a new ES document. A web page is considered new if its complete url, for example url parameters included, is new. The provider records the url as the system location for the ES document.

Having completed a discovery the provider waits the discovery interval, for example configured as one hour, and then starts a next discovery.

Recording ES documents for the start urls and recording new ES documents for found new urls schedules the documents for indexing. Whether indexing occurs immediately or with a delay depends on the number of ES documents currently queued for processing.

Modified documents

The provider detects a modified web page during an aged information check. It revisits the web page if the age of the information recorded for the ES document exceeds the configured age of for example 48 hours.

The provider records the current date time as system location modified timestamp wheneven it indexes a web page. When revisiting the web page it sends the timestamp as part of an If Modified Since hand shake.

The web site should respond with a HTTP status 304 - Not Modified if the page is not modified since the sent timestamp. The provider will skip reindexing in that case.
Or the web site responds with a 200 - OK and the web page if modifications occurred. The provider will schedule the page for reindexing in that case.

For web sites not implementing the If Modified Since hand shake the provider will reindex the web pages unconditionally every for example 48 hours.

The provider schedules reindexing of the modified web pages. Whether indexing occurs immediately or with a delay depends on the number of ES documents currently queued for processing.

Removed documents

The provider detects a removed document during the aged information check. It attempts to visit the known url. The web site should respond with a 404 - Not found. The provider will delete the ES document if this HTTP status is returned, and also for other failures to obtain the web page.

An ES document may require removal, even if the web page still exists. The document link may no longer apply, given a change of the configuration: for example given an added Uri must not contain substring. This is detected during a manual reindex, or automatically during a reindex caused by the aged index check. ES detects a document link that no longer applies, and removes the document from ES.

Metadata mapping

A web page may contain metadata in the <head> section.

<meta name="DCTERMS.title" content="Example title">
<meta property="og:title" content="Example title">
<meta name="category" content="News">

Attributes name and property both result in metadata.
Known metadata results in populating of document fields, for example the Dublin Core title metadata is used for field doc_title. See Document fields.

Unanticipated metadata is lost unless mapped to a document field, using an extra field.

Configure the extra field at the silo, see Silo wide fields. For example configure field extra_category, using ES data type keyword.
Add a mapping from metadata category to field extra_category, in the configuration of the web crawler.

Notes:

The format per mapping is <metadata name>:<extra field name>.
Multiple metadata fields may be mapped to the same document field. Metadata values will be accumulated.
Metadata mapping is currently supported by the web crawler provider and not by other provider types.

Noindex, nofollow

A web page may contain metadata such as

<meta name="robots" content="noindex, nofollow" />

noindex requests to skip indexing the page. The page is not sent to Elastic Search and the page will not appear in search results.
nofollow requests to skip following hyperlinks on the page in order to discover additional documents. Note that these additional documents may still be discovered if hyperlinked and followed from other pages.

Notes:

The compare of robots, noindex and nofollow is case insensitive.
The web crawler honours the request.
Specifying only noindex requests to skip indexing and requests to follow hyperlinks.
Specifying only nofollow requests to skip following hyperlinks, and requests to index the page.
Omitting the metadata results in the default to index and to follow.
Using index or follow is the default and has no effect, for example "noindex, follow" is the same as "noindex".

Administrator