_search/*/_*_document

Smartsite 8.0 - ...

Purpose

Performs operations on documents. An Enterprise Search / Elastic Search document typically corresponds to a physical document such as a pdf, or a web page, or a database object, and more.

Invalidate document

Invalidating a document, also called marking the document as dirty, causes the document to be reindexed. Typically this occurs within 10s.

POST _search/<index names>/_invalidate_document
{
"source": "CODE-OF-THE-SOURCE",
"system_location": "https://.../..."
}

Result:

{
"index": "test",
"source": "CODE-OF-THE-SOURCE",
"system_location": "https://.../...",
"system_number": 12345,
"count": 1
}

Request notes:

  • <index names>. Comma separated list of index names. Usually this is one index name; the index that contains the document to invalidate.
  • "source". Code of the source, as specified in the Smartsite Manager. This is the source that contains the document to invalidate. The source belongs to a silo. A silo has a code. The silo code corresponds to an index name. A mismatch between this index name and the index name in the path results in an error, http status code Status400BadRequest.
  • "system_location". Identification of the document, unique within the source. The format is provider specific. For the web crawler provider this is a url as uccurring in web page links, with any # anchor part removed. System locations can be found in the Smartsite Manager, tab Enterprise Search, Documents, details.

Result notes:

  • Http status code.
    • Status200OK and "count": 1. The document is found and invalidated. This is regardless whether the document was already marked as dirty.
    • Status404NotFound and "count": 0. The document is not found.
    • Status400BadRequest if there is any issue with the request, such as source not specified or not found, system location not specified, or the above mismatch.
    • Status401Unauthorized if the user has no access to the index.
  • "index". Name of the index corresponding to the silo of the source.
  • "source". Equal to the value in the request.
  • "system_location". Equal to the value in the request.
  • "system_number". Document number in the Elastic Search index.
  • "count". Number of invalidated documents: 0 if the document is not found, 1 if found.

The dry run mode is supported (by adding ?dry=true to the path). The result json is returned, with the count omitted. It is not attempted to invalidate the document.

Remove document

Removes both the Elastic Search document and the document registration in the database. This occurs immediately. Note that discovery may cause the document to be recreated. Removing is therefore only useful if it is known that the document will not be discovered anymore. For the web crawler this means that no web page should be found containing a hyperlink to the page or physical document corresponding to the removed document.

POST _search/<index names>/_remove_document
{
"source": "CODE-OF-THE-SOURCE",
"system_location": "https://.../..."
}

Result:

{
"index": "test",
"source": "CODE-OF-THE-SOURCE",
"system_location": "https://.../...",
"system_number": 12345,
"count": 1
}

Request notes:

  • As for _invalidate_document.

Result notes:

  • As for _invalidate_document.
    • Status200OK and "count": 1. The document is found in the database, and this registration is removed. Whether the document was also removed from the Elastic Search index does not affect this result.
    • Status404NotFound and "count": 0. The document registration is not found in the database.

The dry run mode is supported. The result json is returned, with the count omitted. It is not attempted to remove the document.

Add document

Adds a document system location, for example an url, that causes the document to be found and indexed. Typically this occurs within 10s.

POST _search/<index names>/_add_document
{
"source": "CODE-OF-THE-SOURCE",
"system_location": "https://.../..."
}

Result:

{
"index": "test",
"source": "CODE-OF-THE-SOURCE",
"system_location": "https://.../...",
"system_number": 12345,
"count": 1
}

Request notes:

  • As for _invalidate_document.

Result notes:

  • As for _invalidate_document.
    • Status200OK and "count": 1. The document system location is new. It is added, and indexing will occur soon.
    • Status200OK and "count": 0. The document system location exists. The document is invalidated, and re-indexing will occur soon.

The dry run mode is supported (by adding ?dry=true to the path). The result json is returned, with the count omitted. It is not attempted to add the document.

Get document

Gets document details.

POST _search/<index names>/_get_document
{
"source": "CODE-OF-THE-SOURCE",
"system_location": "https://.../..."
}

Result:

{
"index": "test",
"source": "CODE-OF-THE-SOURCE",
"system_location": "https://.../...",
"system_number": 12345,
"count": 1,
"_source": {
"doc_title": "...",
"doc_authors": [
"..."
]
}
}

Request notes:

  • As for _invalidate_document.

Result notes:

  • As for _invalidate_document.
    • Status200OK and "count": 1. The document registration is found in the database.
    • Status404NotFound and "count": 0. The document registration is not found in the database.
  • "_source": this field is present if the document registration is found in the database and the document details are found in the Elastic Search index. It contains document fields, extra fields and system fields. The format is equal to the format in the search result.

The dry run mode is supported (by adding ?dry=true to the path). The result json is returned, with the count and source omitted. It is not attempted to get the document.