Document counts

Purpose

The manager reports various document counts. Below an explanation of the counts.

DB documents versus ES documents

The manager distinguishes DB documents from ES documents.

  • Database table esDocuments contains a row per document, containing sufficient information to manage the document. This can be named the DB document.
  • Actual document text and metadata is stored in the Elastic Search index. This can be named the ES document.

A document always involves a DB document. The document may or may not have an associated ES document.

  • DB document and ES document: this is the regular case, for an actual document.
  • DB document only, without associated ES document: the DB document is an administrative document. Example usages include:
    • A sitemap XML is recorded as a DB document only. The sitemap does not yield an actual document; it refers to actual documents.
    • A document URL may be a redirection to a target URL for the actual HTML document. The redirection URL is recorded as a DB document only, and the target URL is recorded as an actual document.

DB documents are leading. A DB document includes an indicator whether there is or there should be an ES document.

Counts background information

A document may or may not be marked as dirty.

  • A document marked as dirty requires first time indexing or requires next time reindexing.
  • A document marked as not dirty currently requires no work.

A document may or may not be in error.

  • A document for which an error occurred during processing is marked as in error, and an error message is included.
  • A document that is successfully processed is marked as not in error.

Counts in the manager

The manager reports counts for actual documents: DB documents with an associated ES document.

 List of sources:

  • Document count: the number of actual documents for the source. The count may be less if indexing of the source is still in course. Whether the documents are marked as dirty is not considered. The count excludes documents in error.

Indexing status:

  • Indexed documents: the number of actual documents for the source, as for the above document count.
  • Dirty documents: the number of actual documents marked as dirty. The count should decrease as a result of indexing. The count excludes documents in error.
  • Documents in error: the number of actual documents marked as in error.

 Facet counts in the list of documents:

  • Count per source: the total number of actual documents for the source, both successfully indexed and in error. Whether the documents are marked as dirty is not considered.
  • Success count after filtering on the source: the number of actual documents for the source, successfully indexed.
  • Not indexed count after filtering on the source: the number of actual documents for the source, marked as in error.
  • Dirty count after filtering on a source: the total number of actual documents for the source marked as dirty, both successfully indexed and in error.