Elasticsearch GraphDB Connector

compared with
Current by Milena Yankova
on Oct 16, 2015 14:51.

Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (41)

View Page History
In general, this corresponds to _INSERT adds or modifies data_ and _SELECT queries existing data_.

Each connector implementation defines its own URI prefix to distinguish it from other connectors. For the Elasticsearch GraphDB Connector, this is [http://www.ontotext.com/connectors/elasticsearch#]. Each command or predicate executed by the connector uses this prefix, e.g., [http://www.ontotext.com/connectors/elasticsearch#createConnector] to create a connector instance for Elasticsearch.

Individual instances of a connector are distinguished by unique names that are also URIs. They have their own prefix to avoid clashing with any of the command predicates. For Elasticsearch, the instance prefix is [http://www.ontotext.com/connectors/elasticsearch/instance#].

h3. Sample data

All examples use the following sample data, which describes five fictitious wines: Yoyowine, Franvino, Noirette, Blanquito and Rozova as well as the grape varieties required to make these wines. The minimum required ruleset level in GraphDB is RDFS.

{div:style=width: 70em}{noformat}
The Elasticsearch GraphDB Connector requires the [GraphDB Enterprise|GraphDB-Enterprise] edition. If you only have [GraphDB SE|GraphDB-SE], please check out the [Lucene GraphDB Connector] instead.

The connector works at a lower level than the cluster synchronisation and thus it requires a transactional entity pool (to ensure entity IDs are consistent within the cluster). The default entity pool is a non-transactional one. Please, refer to [GraphDB-SE Entity Pool|GraphDB-SE Entity Pool] to enable a transactional entity pool.

{note}
The above command creates a new Elasticsearch connector instance that connects to the Elasticsearch instance accessible at port 9300 on the localhost as specified by the "elasticsearchUrl" key.

The "types" key defines the RDF type of the entities to synchronise and, in the example, it is only entities of the type <[http://www.ontotext.com/example/wine#Wine]> (and its subtypes). The "fields" key defines the mapping from RDF to Elasticsearch. The basic building block is the property chain, i.e., a sequence of RDF properties where the object of each property is the subject of the following property. In the example, three bits of information are mapped - the grape the wines are made of, sugar content, and year. Each chain is assigned a short and convenient field name: "grape", "sugar", and "year". The field names are later used in the queries.

Grape is an example of a property chain composed of more than one property. First, we take the wine's madeFromGrape property, the object of which is an instance of the type Grape, and then we take the rdfs:label of this instance. Sugar and year are both composed of a single property that links the value directly to the wine.

|| field name || Elasticsearch config ||
| \_graphdb_id | "type":"long", "index":"not_analyzed", "store":"yes" |
| \_chains | "type":"long", "index":"not_analyzed", "store":"no" |
| grape | "type":"string", "index":"analyzed", "store":"yes" |
| sugar | "type":"string", "index":"analyzed", "store":"yes" |
| year | "type":"integer", "index":"analyzed", "store":"yes" |

_graphdb_id and \_chains_ are used internally by GraphDB and are always required.

h2. Dropping a connector instance
{noformat}{div}

*?cntUri* is bound to the prefixed URI of the connector instance that was used during creation, e.g., <[http://www.ontotext.com/connectors/elasticsearch/instance#my_index]>, while *?cntStr* is bound to a string, representing the part after the prefix, e.g., "my_index".

h2. Instance status check

# Get a query instance of the requested connector instance by using the RDF notation "X a Y" (= X rdf:type Y), where X is a variable and Y is a connector instance URI. X is bound to a query instance of the connector instance.
# Assign a query to the query instance by using the system predicate :query.
# Request the matching entities through the :entities predicate.

The result looks like this:

|| ?entity || ?grape || ?sugar ?year ||
| :Yoyowine | :CabernetSauvignon | 2013 |
| :Franvino | :Merlo | 2012 |

{anchor:sorting}

h2. Sorting

It is possible to sort the entities returned by a connector query according to one or more fields. Sorting is achieved by the *orderBy* predicate the value of which is a comma-delimited list of fields. Each field can be prefixed with a minus to indicate sorting in descending order. For example:

{div:style=width: 70em}{noformat}
{noformat}{div}

The result contains a single wine, Franvino. If you execute the query without the limit and offset, Franvino will be second in the list:

|| entity ||
h4. defaultValue (string), optional, specifies a default value for the field

The default value (defaultValue) provides means for specifying a default value for the field when the property chain has no matching values in GraphDB. The default value can be a plain literal, a literal with a datatype (xsd: prefix supported), a literal with language, or a URI. It has no default value.

h4. indexed (boolean), optional, default true

{note:title=Limitation}
Physical fields cannot be combined with parent() as their values come from different property chains. If you really need to filter the same parent level you can rewrite {nf}parent(?myField) in (<urn:x>, <urn:y>){nf} as {nf}parent(?myField/1) in (<urn:x>, <urn:y>) \|\| parent(?myField/2) in (<urn:x>, <urn:y>) \|\| parent(?myField/3) ...{nf} and surround it with parentheses if it is part of a bigger expression.
{note}

| ?var not in (_value1_, _value2_, ...) | The negated version of the in-operator. | {nf}?status not in ("archived"){nf} |
| bound(?var) | Tests if the field _var_ has a valid value. This can be used to make the field compulsory. | bound(?name) |
| _expr1_ \|\| _expr2_ | Logical disjunction of expressions _expr1_ and _expr2_. | {nf}bound(?name) \|\| bound(?company){nf} |
| _expr1_ && _expr2_ | Logical conjunction of expressions _expr1_ and _expr2_. | {nf}bound(?status) && ?status in ("active", "new"){nf} |
| \!_expr_ | Logical negation of expression _expr_. | {nf}\!bound(?company){nf} |
| ( expr ) | Grouping of expressions | {nf}(bound(?name) \|\| bound(?company)) && bound(?address){nf} |

{note}
h4. Accessing an element beyond the chain

The construction *?var -> _uri_* (alternatively *?var o _uri_* or just *?var _uri_*) is used to access additional values that are accessible through the property _uri_. In essence, this construction corresponds to the triple pattern _value_ _uri_ ?effectiveValue, where ?value is a value bound by the field _var_. The effective value of ?var -> _uri_ can be used with the *in* or *not in* operator like this: {nf}?company -> rdf:type in (<urn:c>, <urn:d>){nf}. It can be combined with parent() like this: {nf}parent(?company) -> rdf:type in (<urn:c>, <urn:d>){nf}. The same construction can be applied to the *bound* operator like this: {nf}bound(?company -> <urn:hasBranch>){nf}, or even combined with parent() like this: {nf}bound(parent(?company) -> <urn:hasGroup>){nf}.
The construction *?var \->* *{_}uri{_}* (alternatively *?var o* *{_}uri{_}* or just *?var* *{_}uri{_}*) is used to access additional values that are accessible through the property _uri_. In essence, this construction corresponds to the triple pattern _value_ _uri_ ?effectiveValue, where ?value is a value bound by the field _var_. The effective value of ?var \-> _uri_ can be used with the *in* or *not in* operator like this: {nf}?company \-> rdf:type in (<urn:c>, <urn:d>){nf}. It can be combined with parent() like this: {nf}parent(?company) \-> rdf:type in (<urn:c>, <urn:d>){nf}. The same construction can be applied to the *bound* operator like this: {nf}bound(?company \-> <urn:hasBranch>){nf}, or even combined with parent() like this: {nf}bound(parent(?company) \-> <urn:hasGroup>){nf}.

The URI parameter can be a full URI within < > or the special string _rdf:type_ (alternatively just _type_), which will be expanded to [http://www.w3.org/1999/02/22-rdf-syntax-ns#type].

h4. Filtering by RDF graph

The construction *graph(?var)* is used to access the RDF graph of a field's value. The typical use case is to sync only explicit values: {nf}graph(?a) not in (<[http://www.ontotext.com/implicit]>){nf}. The construction can be combined with *parent()* like this: {nf}graph(parent(?a)) in (<urn:a>){nf}.

h4. Entity filters and default values
{noformat}{div}

Now, if you map this data to Elasticsearch so that the property *:taggedWith _x_* *:taggedWith* *{_}x{_}* is mapped to separate fields *taggedWithPerson* and *taggedWithLocation* according to the type of _x_ (we are not interested in events), you can map taggedWith twice to different fields and then use an entity filter to get the desired values:

{div:style=width: 70em}{noformat}

{note}
Note that *type* is the short way to write <[http://www.w3.org/1999/02/22-rdf-syntax-ns#type]>.
{note}

|| Article URI || Entity mapped? || Value in taggedWithPerson || Value in taggedWithLocation || Explanation ||
| :Article1 | yes | :Einstein | :Berlin | :taggedWith has the values :Einstein, :Berlin and :Cannes-FF. The filter leaves only the correct values in the respective fields. The value :Cannes-FF is ignored as it does not match the filter. |
| :Article2 | yes | | :Berlin | :taggedWith has the value :Berlin. After the filter is applied, only taggedWithLocation is populated. |
| :Article3 | yes | :Mozart | | :taggedWith has the value :Mozart. After the filter is applied, only taggedWithPerson is populated |
| :Article4 | yes | :Mozart | :Berlin | :taggedWith has the values :Berlin and :Mozart. The filter leaves only the correct values in the respective fields. |
| :Article5 | yes | | | :taggedWith has no values. The filter is not relevant. |
| :Article6 | yes | | | :taggedWith has the value :Cannes-FF. The filter removes it as it does not match. |

This can be checked by issuing a faceted search for taggedWithLocation and taggedWithPerson:

|| ?facetName || ?facetValue || ?facetCount ||
| taggedWithLocation | [http://www.ontotext.com/example2#Berlin] | 3 |
| taggedWithPerson | [http://www.ontotext.com/example2#Mozart] | 2 |
| taggedWithPerson| http://www.ontotext.com/example2#Einstein | [http://www.ontotext.com/example2#Einstein] | 1 |

h1. Overview of connector predicates
h2. Order of control

Even though SPARQL per se is not sensitive to the order of triple patterns, the Elasticsearch GraphDB Connector expects to receive certain predicates before others so that queries can be executed properly. In particular, predicates that specify the query or query options need to come before any predicates that fetch results.

The diagram in [#Overview of connector predicates] provides a quick overview of the predicates.

h1. Upgrading from previous versions
|| field || note ||
| firstName | produced, if the option "index" was true; used explicitly in queries |
| \_facet_firstName | produced, if the option "facet" was true; used implicitly for facet search |
| \_sort_firstName | produced, if the option "sort" was true; used implicitly for ordering connector results |

The current version always produces a single Elasticsearch field per field definition in the configuration. This means that you have to create all appropriate fields based on your needs. See more under [#List of creation parameters].

{tip}
To mimic the functionality of the old \_facet_fieldName fields, you can either create a non-analysed [copy field|#Copy fields] (for textual fields) or just use the normal field (for non-textual fields).
{tip}

{tip}
To mimic the functionality of the old \_sort_fieldName fields, you can create a non-analysed [copy field|#Copy fields] (for textual fields) or just use the normal field (for non-textual fields).

{tip}


h2. The option manageExternalIndex

Prior to 6.2, the option _manageExternalIndex_ could be used to control the management of both the mapping and the index. In the current implementation there are separate options, _manageMapping_ and _manageIndex_. For more information, see [#Mapping and index management].