RSS feed
Purpose
The Really Simple Syndication (RSS) feed provider is the web crawler provider, configured to discover and index resources referred in RSS feeds and available over HTTP.
Rather than configuring a web crawler for one or more URLs of start HTML pages configure the web crawler with one or more URLs of RSS feeds. Such a feed may be in one of the following formats:
- RSS version 1.0
- RSS version 2.0
- Atom version 1.0
When configuring the web crawler for RSS usage uncheck Discover urls. The RSS feeds will still be visited and referred resources will be visited; however looking for hyperlinks in resource content will be suppressed.
Maintenance of documents
New documents
The RSS feed provider discovers links in the items of the RSS feed, for example:
<item>
<link>https://www.volkskrant.nl/buitenland/sarkozy-aangehouden-en-verhoord~a4582560/</link>
<pubDate>Tue, 20 Mar 2018 08:33:00 GMT</pubDate>
</item>
A new link in the feed leads to a new ES document.
The feed item may contain additional elements, for example a <title> element. The provider ignores these elements, and entirely builds the ES document from the resource addressed by the link.
Having completed a discovery cycle the provider starts a new discovery cycle after for example one hour.
A feed typically implements a sliding windows, for example presenting a list of 10 items that are the newest ones. The provider does not delete ES documents for links that are no longer listed in the feed.
Modified documents
The above item includes a publication date time. The provider reindexes a rediscovered document if the date time differs from the registered date time, regardless whether earlier or later.
Removed documents
The provider will check if documents still exist if a document has not been reindexed or seen for more than for example 48 hours. The provider will request the resource using the known resource link. The source of the document should respond with a HTTP 404 (Not Found) error if the resource has gone, and the provider will delete the document.
This means that the RSS feed is in control over the lifetime of the ES document: as long as the resource is online the ES document will continue to exist.