Document data xml

Smartsite 7.9 - ...

Purpose

The document data xml contains elements for the fields of one ES document, for example a title in element <doc_title>...</doc_title>.

Example

Example content:

<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<documentdata xmlns="http://smartsite.nl/namespaces/documentdata/1.0">
    <version>1.0</version>
    <meta>
        <dependencies>
            <item type="CONTENT" nr="9636" />
            <item type="CONTENT" nr="8725" />
            <item type="TERM" nr="1034" />
            <item type="TERM" nr="1152" />
        </dependencies>
    </meta>
    <esdocument>
        <doc_title>Asbest</doc_title>
        <doc_url>http://Kennis/Kennissysteem-Woningcorperatie/Seneca-Woont/Asbest.html</doc_url>
        <doc_body type="html">
            ...
        </doc_body>
        <extra_acties_title>
            <value>Wanneer gevaarlijk?</value>
            <value>Waar vind je asbest?</value>
        </extra_acties_title>
        <extra_begrippen_title>
            <value>Hechtgebonden asbest</value>
        </extra_begrippen_title>
        <extra_trefwoorden_title>
            <value>Asbest</value>
            <value>Hechtgebonden asbest</value>
        </extra_trefwoorden_title>
    </esdocument>
</documentdata>

Xml specification

Namespace

The namespace is as specified in the example.

Version

The version must be 1.0; currently this is the only possible version.

Dependencies

Dependencies are part of the mechanism to achieve real time reindexing of a document in cases where the document depends on a number of items and there is a change of such an item.

Dependencies are optional.

The source of the document in the above example specifies:

  • The document depends on two content items with numbers 9636 and 8725
  • The document depends on two thesaurus terms with numbers 1034 and 1152.

The provider stores dependencies for the document (in table esDependencies, along with the document registration in table esDocuments).

For real time reindexing an additional mechanism is required to inform Enterprise Search that an item has changed. This allows ES to mark depending documents as dirty, which in turn causes reindexing of the document.

One mechanism in the Smartsite manager is to intercept content item changes and thesaurus term changes, and mark depending documents as dirty. The source of the documents and the receiving ES must be the same environment for this, such that content item numbers and thesaurus term numbers are the same.

ES document

The table below specifies the elements that can be used to populate an ES document.

  • For details per standard field refer to the page about standard fields.
  • An element name in round brackets indicates that the element is not supported although the field with that name exists.
  • An indicator 1 specifies an element such doc_title, containing a single value.
  • An indicator n specifies an element such doc_keywords, containing value subelements.
Element name Type 1_n Description
doc_authors String n Authors.
doc_body String 1 Main text of the document.
doc_created DateTime 1 Creation date.
doc_description String 1 Description.
(doc_fileformat) FileFormat 1 Currently not supported.
doc_identifier String 1 Identifier.
doc_keywords String n Keywords.
doc_language DocumentLanguage 1 Language. Several values are accepted, including dutch, english, french, german, nl, en, fr, de; case insensitive.
doc_modified DateTime 1 Modification date.
doc_publisher String 1 Publisher.
doc_title String 1 Title.
doc_url Uri 1 Uniform resource locator used to open the original document.
extra_* * 1_n See the page about extra fields.
(system_autocomplete) String n Handled by the system.
(system_document_size) LongInteger 1 Handled by the system.
(system_guid) Guid 1 Handled by the system.
(system_location) String 1 Handled by the system.
(system_number) LongInteger 1 Handled by the system.
(system_phrase_suggester) String 1 Handled by the system.
(system_provider_code) String 1 Handled by the system.
(system_source_code) String 1 Handled by the system.
system_usergroups String n Codes of authorization groups that are allowed to find and view the document.

Text type

An element containing HTML should have an attribute type="html". This requests to strip HTML tags from the text and to replace HTML character entities by their actual characters, before indexing the field. This measure prevents that HTML element names and character entity names end up in the Elastic Search index, polluting the search results.

Valid values for the type attribute are:

  • html
  • text (default)

Applied to a single value:

<doc_body type="html">
...
</doc_body>

Applied to all values in a multivalue:

<doc_authors type="html">
<value>...</value>
<value>...</value>
</doc_authors>

or applied to some values in a multivalue:

<doc_authors>
<value type="html">...</value>
<value>...</value>
</doc_authors>

or, overriding the parent:

<doc_authors type="html">
<value>...</value>
<value type="text">...</value>
</doc_authors>