Creating a connector instance is done by sending a SPARQL query with the following configuration data:
* the name of the connector instance (e.g., my_index);
* an Elasticsearch instance to synchronise to;
* classes to synchronise;
* properties to synchronise.
The configuration data has to be provided as a JSON string representation and passed together with the create command.
{tip:title=What we recommend}
Use the GraphDB Connectors management interface provided by the GraphDB Workbench as it lets you create the configuration easily, and then create the connector instance directly or copy the configuration and execute it elsewhere.
{tip}
The create command is triggered by a SPARQL *INSERT* with the *createConnector* predicate, e.g., it creates a connector instance called *my_index*, which synchronises the wines from the sample data above:
{div:style=width: 70em}{noformat}
PREFIX : <http://www.ontotext.com/connectors/elasticsearch#>
PREFIX inst: <http://www.ontotext.com/connectors/elasticsearch/instance#>
INSERT DATA {
inst:my_index :createConnector '''
{
"elasticsearchNode": "localhost:9300",
"types": [
"http://www.ontotext.com/example/wine#Wine"
],
"fields": [
{
"fieldName": "grape",
"propertyChain": [
"http://www.ontotext.com/example/wine#madeFromGrape",
"http://www.w3.org/2000/01/rdf-schema#label"
]
},
{
"fieldName": "sugar",
"propertyChain": [
"http://www.ontotext.com/example/wine#hasSugar"
],
},
{
"fieldName": "year",
"propertyChain": [
"http://www.ontotext.com/example/wine#hasYear"
]
}
]
}
''' .
}
{noformat}{div}
The above command creates a new Elasticsearch connector instance that connects to the Elasticsearch instance accessible at port 9300 on the localhost as specified by the "elasticsearchUrl" key.
The "types" key defines the RDF type of the entities to synchronise and, in the example, it is only entities of the type <http://www.ontotext.com/example/wine#Wine> (and its subtypes). The "fields" key defines the mapping from RDF to Elasticsearch. The basic building block is the property chain, i.e., a sequence of RDF properties where the object of each property is the subject of the following property. In the example, three bits of information are mapped - the grape the wines are made of, sugar content, and year. Each chain is assigned a short and convenient field name: "grape", "sugar", and "year". The field names are later used in the queries.
Grape is an example of a property chain composed of more than one property. First, we take the wine's madeFromGrape property, the object of which is an instance of the type Grape, and then we take the rdfs:label of this instance. Sugar and year are both composed of a single property that links the value directly to the wine.
h4. Mapping and index management
By default, GraphDB manages (create, delete or update if needed) the Elasticsearch index and the Elasticsearch mapping. This makes it easier to use Elasticsearch as everything is done automatically. This behaviour can be changed by the following options:
* _manageIndex_: if true, GraphDB manages the index. True by default.
* _manageMapping_: if true, GraphDB manages the mapping. True by default.
{note}
Note that if either of the options is set to false, you have to create, update or remove the index/mapping and, in case Elasticsearch is misconfigured, the connector instance will not function correctly.
{note}
h5. Using a non-managed schema
The present version provides no support for changing some advanced options, such as stopwords, on a per field basis. The recommended way to do that for now is to manage the mapping yourself and tell the connector to just sync the object values in the appropriate fields. Here is an example:
{div:style=width: 70em}{noformat}
PREFIX : <http://www.ontotext.com/connectors/elasticsearch#>
PREFIX inst: <http://www.ontotext.com/connectors/elasticsearch/instance#>
INSERT DATA {
inst:my_index :createConnector '''
{
"elasticsearchNode": "localhost:9300",
"types": [
"http://www.ontotext.com/example/wine#Wine"
],
"fields": [
{
"fieldName": "grape",
"propertyChain": [
"http://www.ontotext.com/example/wine#madeFromGrape",
"http://www.w3.org/2000/01/rdf-schema#label"
]
},
{
"fieldName": "sugar",
"propertyChain": [
"http://www.ontotext.com/example/wine#hasSugar"
]
},
{
"fieldName": "year",
"propertyChain": [
"http://www.ontotext.com/example/wine#hasYear"
]
}
],
"manageMapping": "false"
}
''' .
}
{noformat}{div}
This creates the same connector instance as above but it expects fields with the specified fieldnames to be already present in the index mapping, as well as some internal GraphDB fields. For the example, you must have the following fields:
|| field name || Elasticsearch config ||
| _graphdb_id | "type":"long", "index":"not_analyzed", "store":"yes" |
| _chains | "type":"long", "index":"not_analyzed", "store":"no" |
| grape | "type":"string", "index":"analyzed", "store":"yes" |
| sugar | "type":"string", "index":"analyzed", "store":"yes" |
| year | "type":"integer", "index":"analyzed", "store":"yes" |
_graphdb_id and _chains_ are used internally by GraphDB and are always required.
h2. Dropping a connector instance
Dropping a connector instance removes all references to its external store from GraphDB as well as the Elasticsearch index associated with it.
The drop command is triggered by a SPARQL *INSERT* with the *dropConnector* predicate where the name of the connector instance has to be in the subject position, e.g., this removes the connector *my_index*:
{div:style=width: 70em}{noformat}
PREFIX : <http://www.ontotext.com/connectors/elasticsearch#>
PREFIX inst: <http://www.ontotext.com/connectors/elasticsearch/instance#>
INSERT DATA {
inst:my_index :dropConnector "" .
}
{noformat}{div}
h2. Listing available connector instances
Listing connector instances returns all previously created instances. It is a *SELECT* query with the *listConnectors* predicate:
{div:style=width: 70em}{noformat}
PREFIX : <http://www.ontotext.com/connectors/elasticsearch#>
SELECT ?cntUri ?cntStr {
?cntUri :listConnectors ?cntStr .
}
{noformat}{div}
*?cntUri* is bound to the prefixed URI of the connector instance that was used during creation, e.g., <http://www.ontotext.com/connectors/elasticsearch/instance#my_index>, while *?cntStr* is bound to a string, representing the part after the prefix, e.g., "my_index".
h2. Instance status check
The internal state of each connector instance can be queried using a *SELECT* query and the *connectorStatus* predicate:
{div:style=width: 70em}{noformat}
PREFIX : <http://www.ontotext.com/connectors/elasticsearch#>
SELECT ?cntUri ?cntStatus {
?cntUri :connectorStatus ?cntStatus .
}
{noformat}{div}
*?cntUri* is bound to the prefixed URI of the connector instance, while *?cntStatus* is bound to a string representation of the status of the connector represented by this URI. The status is key-value based.
h1. Working with data
h2. Adding, updating and deleting data
From the user point of view, all synchronisation happens transparently without using any additional predicates or naming a specific store explicitly, i.e., you should simply execute standard SPARQL INSERT/DELETE queries. This is achieved by intercepting all changes in the plugin and determining which abstract documents need to be updated.
h2. Simple queries
Once a connector instance has been created, it is possible to query data from it through SPARQL. For each matching abstract document, the connector instance returns the document subject. In its simplest form, querying is achieved by using a *SELECT* and providing the Elasticsearch query as the object of the *query* predicate:
{div:style=width: 70em}{noformat}
PREFIX : <http://www.ontotext.com/connectors/elasticsearch#>
PREFIX inst: <http://www.ontotext.com/connectors/elasticsearch/instance#>
SELECT ?entity {
?search a inst:my_index ;
:query "grape:cabernet" ;
:entities ?entity .
}
{noformat}{div}
The result binds *?entity* to the two wines made from grapes that have "cabernet" in their name, namely :Yoyowine and :Franvino.
{note}
Note that you should use the field names you chose when you created the connector instance. They can be identical to the property URIs but you should escape any special characters according to what Elasticsearch expects.
{note}
# Get a query instance of the requested connector instance by using the RDF notation "X a Y" (= X rdf:type Y), where X is a variable and Y is a connector instance URI. X is bound to a query instance of the connector instance.
# Assign a query to the query instance by using the system predicate :query.
# Request the matching entities through the :entities predicate.
It is also possible to provide per query search options by using one or more option predicates. The option predicates are described in detail below.
h4. Raw queries
To access a Elasticsearch query parameter that is not exposed through a special predicate, use a raw query. Instead of providing a full text query in the :query part, specify raw Elasticsearch parameters. For example, to boost some parts of your full text query as described [here|http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_boosting_query_clauses.html], execute the following query:
{div:style=width: 70em}{noformat}
PREFIX : <http://www.ontotext.com/connectors/elasticsearch#>
PREFIX inst: <http://www.ontotext.com/connectors/elasticsearch/instance#>
SELECT ?entity {
?search a inst:my_index ;
:query '''
{
{
"bool" : {
"should" : [ {
"query_string" : {
"query" : "<full-text-query-not-boosted>"
}
}, {
"query_string" : {
"query" : "<full-text-query-boosted>",
"boost" : 4.0
}
} ]
}
}
}
''' ;
:entities ?entity .
}
{noformat}{div}
h3. Combining Elasticsearch results with GraphDB data
The bound ?entity can be used in other SPARQL triples in order to build complex queries that fetch additional data from GraphDB, for example, to see the actual grapes in the matching wines as well as the year they were made:
{div:style=width: 70em}{noformat}
PREFIX : <http://www.ontotext.com/connectors/elasticsearch#>
PREFIX inst: <http://www.ontotext.com/connectors/elasticsearch/instance#>
PREFIX wine: <http://www.ontotext.com/example/wine#>
SELECT ?entity ?grape ?year {
?search a inst:my_index ;
:query "grape:cabernet" ;
:entities ?entity .
?entity wine:madeFromGrape ?grape .
?entity wine:hasYear ?year
}
{noformat}{div}
The result looks like this:
|| ?entity || ?grape || ?sugar ||
| :Yoyowine | :CabernetSauvignon | 2013 |
| :Franvino | :Merlo | 2012 |
| :Franvino | :CabernetFranc | 2012 |
{note}
Note that :Franvino is returned twice because it is made from two different grapes, both of which are returned.
{note}
h3. Entity match score
It is possible to access the match score returned by Elasticsearch with the *score* predicate. As each entity has its own score, the predicate should come at the entity level. For example:
{div:style=width: 70em}{noformat}
PREFIX : <http://www.ontotext.com/connectors/elasticsearch#>
PREFIX inst: <http://www.ontotext.com/connectors/elasticsearch/instance#>
SELECT ?entity ?score {
?search a inst:my_index ;
:query "grape:cabernet" ;
:entities ?entity .
?entity :score ?score
}
{noformat}{div}
The result looks like this but the actual score might be different as it depends on the specific Elasticsearch version:
|| ?entity || ?score ||
| :Yoyowine | 0.9442660212516785 |
| :Franvino | 0.7554128170013428 |
h2. Basic facet queries
Consider the sample wine data and the my_index connector instance described previously. You can also query facets using the same instance:
{div:style=width: 70em}{noformat}
PREFIX : <http://www.ontotext.com/connectors/elasticsearch#>
PREFIX inst: <http://www.ontotext.com/connectors/elasticsearch/instance#>
SELECT ?facetName ?facetValue ?facetCount WHERE {
# note empty query is allowed and will just match all documents, hence no :query
?r a inst:my_index ;
:facetFields "year,sugar" ;
:facets _:f .
_:f :facetName ?facetName .
_:f :facetValue ?facetValue .
_:f :facetCount ?facetCount .
}
{noformat}{div}
It is important to specify the facet fields by using the *facetFields* predicate. Its value is a simple comma-delimited list of field names. In order to get the faceted results, use the facets predicate. As each facet has three components (name, value and count), the facets predicate binds a blank node, which in turn can be used to access the individual values for each component through the predicates *facetName*, *facetValue*, and *facetCount*.
The resulting bindings look like the following:
|| facetName || facetValue || facetCount ||
| year | 2012 | 3 |
| year | 2013 | 2 |
| sugar | dry | 3 |
| sugar | medium | 2 |
You can easily see that there are three wines produced in 2012 and two in 2013. You also see that three of the wines are dry, while two are medium. However, it is not necessarily true that the three wines produced in 2012 are the same as the three dry wines as each facet is computed independently.
{tip:title=Faceting of textual fields}
Faceting by analysed textual field works but might produce unexpected results. Analysed textual fields are composed of tokens and faceting uses each token to create a faceting bucket. For example, "North America" and "Europe" produce three buckets: "north", "america" and "europe", corresponding to each token in the two values. If you need to facet by a textual field and still do full-text search on it, it is best to create a copy of the field with the setting "analyzed":false. For more information, see [#Copy fields].
{tip}
h2. Advanced facet and aggregation queries
While basic faceting allows for simple counting of documents based on the discrete values of a particular field, there are more complex faceted or aggregation searches in Elasticsearch. The Elasticsearch GraphDB Connector provides a mapping from Elasticsearch results to RDF results but no mechanism for specifying the queries other than executing a [raw query|#Raw queries].
h3. Supported Elasticsearch facets and aggregations
The Elasticsearch GraphDB Connector supports mapping of the following facets and aggregations:
* Facets: terms, histogram, date histogram;
* Aggregations: terms, histogram, date histogram, range, min, max, sum, avg, stats, extended stats, value count.
For aggregations, the connector also supports sub-aggregations.
{info}
For more information on each supported facet or aggregation type, please, refer to the documentation of Elasticsearch.
{info}
h3. RDF mapping of the results
The results are accessed through the predicate *aggregations* (much like the basic facets are accessed through *facets*). The predicate binds multiple blank nodes that each contains a single aggregation bucket. The individual bucket items can be accessed through these predicates:
|| predicate || meaning || Elasticsearch counterpart ||
| :name | Bucket name | getName() |
| :key | Key or value associated with the bucket | getValue() or getKey() |
| :count | Count of documents in the bucket | getDocCount(), getValue() |
| :from | Start of range | getFrom(), getFromAsDate() |
| :to | End of range (RangeFacet) | getTo(), getToAsDate() |
| :min | Minimum value | getMin(), getValue() |
| :max | Maximum value | getMax(), getValue() |
| :sum | Sum value | getSum(), getValue() |
| :avg | Average value | getAvg(), getValue() |
| :sum_of_squares | Sum of squares value | getSumOfSquares() |
| :variance | Variance value | getVariance() |
| :std_deviation | Standard deviation value | getStdDeviation() |
| :parent | Sub-aggregations: points to the parent (upper level) blank node | |
| :level | Sub-aggregations: level number where 1 is the uppermost level and the following levels are 2, 3 and so on | |
| :levelName | Sub-aggregations: level name | getKey() or getValue() |
{anchor:sorting}
h2. Sorting
It is possible to sort the entities returned by a connector query according to one or more fields. Sorting is achieved by the *orderBy* predicate the value of which is a comma-delimited list of fields. Each field can be prefixed with a minus to indicate sorting in descending order. For example:
{div:style=width: 70em}{noformat}
PREFIX : <http://www.ontotext.com/connectors/elasticsearch#>
PREFIX inst: <http://www.ontotext.com/connectors/elasticsearch/instance#>
SELECT ?entity {
?search a inst:my_index ;
:query "year:2013" ;
:orderBy "-sugar" ;
:entities ?entity .
}
{noformat}{div}
The result contains wines produced in 2013 sorted according to their sugar content in descending order:
|| entity ||
| Rozova |
| Yoyowine |
By default, entities are sorted according to their matching score in descending order.
{note}
Note that if you join the entity from the connector query to other triples stored in GraphDB, GraphDB might scramble the order. To remedy this, use ORDER BY from SPARQL.
{note}
{tip:title=Sorting by textual fields}
Sorting by an analysed textual field works but might produce unexpected results. Analysed textual fields are composed of tokens and sorting uses the least (in the lexicographical sense) token. For example, "North America" will be sorted before "Europe" because the token "america" is lexicographically smaller than the token "europe". If you need to sort by a textual field and still do full-text search on it, it is best to create a copy of the field with the setting "analyzed":false. For more information, see [#Copy fields].
{tip}
h2. Limit and offset
Limit and offset are supported on the Elasticsearch side of the query. This is achieved through the predicates *limit* and *offset*. Consider this example in which an offset of 1 and a limit of 1 are specified:
{div:style=width: 70em}{noformat}
PREFIX : <http://www.ontotext.com/connectors/elasticsearch#>
PREFIX inst: <http://www.ontotext.com/connectors/elasticsearch/instance#>
SELECT ?entity {
?search a inst:my_index ;
:query "sugar:dry" ;
:offset "1" ;
:limit "1" ;
:entities ?entity .
}
{noformat}{div}
The result contains a single wine, Franvino. If you execute the query without the limit and offset, Franvino will be second in the list:
|| entity ||
| Yoyowine |
| *Franvino* |
| Blanquito |
{note}
Note that the specific order in which GraphDB returns the results depends on how Elasticsearch returns the matches, unless sorting is specified.
{note}
h2. Snippet extraction
Snippet extraction is used to extract highlighted snippets of text that match the query. The snippets are accessed through the dedicated predicate *snippets*. It binds a blank node that in turn provides the actual snippets via the predicates *snippetField* and *snippetText*. The predicate snippets must be attached to the entity, as each entity has a different set of snippets. For example, in a search for Cabernet:
{div:style=width: 70em}{noformat}
PREFIX : <http://www.ontotext.com/connectors/elasticsearch#>
PREFIX inst: <http://www.ontotext.com/connectors/elasticsearch/instance#>
SELECT ?entity ?snippetField ?snippetText {
?search a inst:my_index ;
:query "grape:cabernet" ;
:entities ?entity .
?entity :snippets _:s .
_:s :snippetField ?snippetField ;
:snippetText ?snippetText .
}
{noformat}{div}
The query returns the two wines made from Cabernet Sauvignon or Cabernet Franc grapes as well as the respective matching fields and snippets:
|| ?entity || ?snippetField || ?snippetText ||
| :Yoyowine | grape | <em>Cabernet</em> Sauvignon |
| :Franvino | grape | <em>Cabernet</em> Franc |
{note}
Note that the actual snippets might be different as this depends on the specific Elasticsearch implementation.
{note}
It is possible to tweak how the snippets are collected/composed by using the following option predicates:
* *:snippetSize* sets the maximum size of the extracted text fragment, 250 by default;
* *:snippetSpanOpen* text to insert before the highlighted text, <em> by default;
* *:snippetSpanClose* text to insert after the highlighted text, </em> by default.
The option predicates are set on the query instance, much like the :query predicate.
h2. Total hits
You can get the total number of hits by using the *totalHits* predicate, e.g., for the connector instance my_index and a query that retrieves all wines made in 2012:
{div:style=width: 70em}{noformat}
PREFIX : <http://www.ontotext.com/connectors/elasticsearch#>
PREFIX inst: <http://www.ontotext.com/connectors/elasticsearch/instance#>
SELECT ?totalHits {
?r a inst:my_index ;
:query "year:2012" ;
:totalHits ?totalHits .
}
{noformat}{div}
As there are three wines made in 2012, the value 3 (of type xdd:long) binds to ?totalHits.
h1. List of creation parameters
The creation parameters define how a connector instance is created by the :createConnector predicate. Some are required and some are optional. All parameters are provided together in a JSON object, where the parameter names are the object keys. Parameter values may be simple JSON values such as a string or a boolean, or they can be lists or objects.
All of the creation parameters can also be set conveniently from the Create Connector user interface in the GraphDB Workbench without any knowledge of JSON.
h3. elasticsearchNode (string), required, Elasticsearch instance to sync to
As Elasticsearch is a third-party service, you have to specify the node where it is running. The format of the node value is of the form *hostname.domain:port*. There is no default value.
{note}
Note that Elasticsearch exposes two protocols, the native _transport_ protocol over port _9300_ and the _RESTful_ API over port _9200_. The Elasticsearch GraphDB Connector uses the transport protocol over port 9300.
{note}
h3. elasticsearchCluster (string), optional, Elasticsearch cluster name
This option sets the cluster name that the connector instance will connect to.
Every Elasticsearch instance uses a unique cluster name to identify, discover and join other nodes. By default this is "elasticsearch" but it is advisable to change it. Please, see [Configuration: Cluster name|https://www.elastic.co/guide/en/elasticsearch/reference/current/setup-configuration.html#cluster-name] in the Elasticsearch documentation.
h3. indexCreateSettings (string), optional, settings for creating the Elasticsearch index
This option is passed directly to Elasticsearch when creating the index. It can be in JSON, YAML or properties format.
h3. types (list of URI), required, specifies the types of entities to sync
The RDF types of entities to sync are specified as a list of URIs. At least one type URI is required.
h3. languages (list of string), optional, valid languages for literals
RDF data is often multilingual but you can map only some of the languages represented in the literal values. This can be done by specifying a list of language ranges to be matched to the language tags of literals according to RFC 4647, Section 3.3.1. Basic Filtering. In addition, an empty range can be used to include literals that have no language tag. The list of language ranges maps all existing literals that have matching language tags.
h3. fields (list of field object), required, defines the mapping from RDF to Elasticsearch
The fields define exactly what parts of each entity will be synchronised as well as the specific details on the connector side. The field is the smallest synchronisation unit and it maps a property chain from GraphDB to a field in Elasticsearch. The fields are specified as a list of field objects. At least one field object is required. Each field object has further keys that specify details.
h4. fieldName (string), required, name of the field in Elasticsearch
The name of the field defines the mapping on the connector side. It is specified by the key fieldName with a string value. The field name is used at query time to refer to the field. There are few restrictions on the allowed characters in a field name but to avoid unnecessary escaping (which depends on how Elasticsearch parses its queries), we recommend to keep the field names simple.
h4. propertyChain (list of URI), required, defines the property chain to reach the value
The property chain (propertyChain) defines the mapping on the GraphDB side. A property chain is defined as a sequence of triples where the entity URI is the subject of the first triple, its object is the subject of the next triple and so on. In this model, a property chain with a single element corresponds to a direct property defined by a single triple. Property chains are specified as a list of URIs where at least one URI must be provided.
The URI of the document will be synchronised to the special field "id" in Elasticsearch. You may use it to query Elasticsearch directly and retrieve the matching entity URI.