Document data xml
Purpose
The document data xml contains elements for the fields of one ES document, for example a title in element <doc_title>...</doc_title>.
Example
Example content:
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<documentdata xmlns="http://smartsite.nl/namespaces/documentdata/1.0">
<version>1.0</version>
<meta>
<dependencies>
<item type="CONTENT" nr="9636" />
<item type="CONTENT" nr="8725" />
<item type="TERM" nr="1034" />
<item type="TERM" nr="1152" />
</dependencies>
</meta>
<esdocument>
<doc_title>Asbest</doc_title>
<doc_url>http://Kennis/Kennissysteem-Woningcorperatie/Seneca-Woont/Asbest.html</doc_url>
<doc_body type="html">
...
</doc_body>
<extra_acties_title>
<value>Wanneer gevaarlijk?</value>
<value>Waar vind je asbest?</value>
</extra_acties_title>
<extra_begrippen_title>
<value>Hechtgebonden asbest</value>
</extra_begrippen_title>
<extra_trefwoorden_title>
<value>Asbest</value>
<value>Hechtgebonden asbest</value>
</extra_trefwoorden_title>
</esdocument>
</documentdata>
Xml specification
Namespace
The namespace is as specified in the example.
Version
The version must be 1.0; currently this is the only possible version.
Dependencies
Dependencies are part of the mechanism to achieve real time reindexing of a document in cases where the document depends on a number of items and there is a change of such an item.
Dependencies are optional.
The source of the document in the above example specifies:
- The document depends on two content items with numbers 9636 and 8725
- The document depends on two thesaurus terms with numbers 1034 and 1152.
The provider stores dependencies for the document (in table esDependencies, along with the document registration in table esDocuments).
For real time reindexing an additional mechanism is required to inform Enterprise Search that an item has changed. This allows ES to mark depending documents as dirty, which in turn causes reindexing of the document.
One mechanism in the Smartsite manager is to intercept content item changes and thesaurus term changes, and mark depending documents as dirty. The source of the documents and the receiving ES must be the same environment for this, such that content item numbers and thesaurus term numbers are the same.
ES document
The table below specifies the elements that can be used to populate an ES document.
- For details per standard field refer to the page about standard fields.
- An element name in round brackets indicates that the element is not supported although the field with that name exists.
- An indicator 1 specifies an element such doc_title, containing a single value.
- An indicator n specifies an element such doc_keywords, containing value subelements.
Element name | Type | 1_n | Description |
---|---|---|---|
doc_authors | String | n | Authors. |
doc_body | String | 1 | Main text of the document. |
doc_created | DateTime | 1 | Creation date. |
doc_description | String | 1 | Description. |
(doc_fileformat) | FileFormat | 1 | Currently not supported. |
doc_identifier | String | 1 | Identifier. |
doc_keywords | String | n | Keywords. |
doc_language | DocumentLanguage | 1 | Language. Several values are accepted, including dutch, english, french, german, nl, en, fr, de; case insensitive. |
doc_modified | DateTime | 1 | Modification date. |
doc_publisher | String | 1 | Publisher. |
doc_title | String | 1 | Title. |
doc_url | Uri | 1 | Uniform resource locator used to open the original document. |
extra_* | * | 1_n | See the page about extra fields. |
(system_autocomplete) | String | n | Handled by the system. |
(system_document_size) | LongInteger | 1 | Handled by the system. |
(system_guid) | Guid | 1 | Handled by the system. |
(system_location) | String | 1 | Handled by the system. |
(system_number) | LongInteger | 1 | Handled by the system. |
(system_phrase_suggester) | String | 1 | Handled by the system. |
(system_provider_code) | String | 1 | Handled by the system. |
(system_source_code) | String | 1 | Handled by the system. |
system_usergroups | String | n | Codes of authorization groups that are allowed to find and view the document. |
Text type
An element containing HTML should have an attribute type="html". This requests to strip HTML tags from the text and to replace HTML character entities by their actual characters, before indexing the field. This measure prevents that HTML element names and character entity names end up in the Elastic Search index, polluting the search results.
Valid values for the type attribute are:
- html
- text (default)
Applied to a single value:
<doc_body type="html">
...
</doc_body>
Applied to all values in a multivalue:
<doc_authors type="html">
<value>...</value>
<value>...</value>
</doc_authors>
or applied to some values in a multivalue:
<doc_authors>
<value type="html">...</value>
<value>...</value>
</doc_authors>
or, overriding the parent:
<doc_authors type="html">
<value>...</value>
<value type="text">...</value>
</doc_authors>