- Sample data
- Creating a connector
- Dropping a connector
- Listing available connectors
- Status check
- Adding, updating and deleting data
- Querying data
- Limit and offset
- Snippet extraction
- Total hits
- Creation parameters
- Overview of connector predicates
The GraphDB Connectors provide extremely fast normal and facet (aggregation) searches that are typically implemented by an external component or service such as Elasticsearch, but have the additional benefit to stay automatically up-to-date with the GraphDB repository data.
The Connectors provide synchronisation at the entity level, where an entity is defined as having a unique identifier (a URI) and a set of properties and property values. In terms of RDF, this corresponds to a set of triples that have the same subject. In addition to simple properties (defined by a single triple), the Connectors support property chains. A property chain is defined as a sequence of triples where each subsequent's triple's object is the subject of the following triple.
The main features of the GraphDB Connectors are:
- maintain an index that is always in sync with the data stored in GraphDB
- multiple independent instances per repository
- the entities to synchronise are defined by:
- a list of fields (on the Elasticsearch side) and property chains (on the GraphDB side) whose values to sync
- a list of rdf:type's of entities to sync
- a list of languages to sync (default is all languages)
- additional filtering by property and value
- full-text search using native Elasticsearch queries
- snippet extraction: highlighting of search terms in the search result
- faceted search
- sorting by any preconfigured field
- paging of results using offset and limit
Each feature will be described in detail later on.
All examples below will use the following sample data. It describes five fictitious wines: Yoyowine, Franvino, Noirette, Blanquito and Rozova, as well as the grape varieties needed to make those wines. The minimum needed ruleset level in GraphDB is RDFS.
All interaction with the Elasticsearch GraphDB Connector shall be done through SPARQL queries.
There are three types of SPARQL queries:
- INSERT for creating and deleting connectors.
- SELECT for listing connectors and querying connector configuration parameters.
- INSERT/SELECT for storing and querying data as part of the normal GraphDB data workflow.
In general this corresponds to INSERT adds or modifies data and SELECT queries existing data.
Each connector implementation defines its own URI prefix to distinguish it from other connectors. For the Elasticsearch GraphDB Connector this is *http://www.ontotext.com/connectors/elasticsearch#*. Each command or predicate that will be executed by the connector uses this prefix, e.g. <http://www.ontotext.com/connectors/elasticsearch##createConnector> for creating a connector for Elasticsearch.
Individual instances of a connector are distinguished by unique names that are also URIs. They have their own prefix in order not to clash with any of the command predicates. For Elasticsearch, the instance prefix is http://www.ontotext.com/connectors/elasticsearch/instance#.
Creating a connector should by done by sending a SPARQL query with the following configuration data:
- Name of the connector (e.g. my_index)
- Classes to synchronise
- Properties to synchronise
The configuration data must be provided as a JSON string representation and passed together with the create command.
|What we recommend|
Use the GraphDB Connectors management interface provided by the GraphDB Workbench. It will let you create the configuration easily and then create the connector directly or copy the configuration and execute it elsewhere.
The create command is triggered by a SPARQL INSERT with the createConnector predicate, e.g. this will create a connector called my_index that will synchronise the wines from sample data above:
Note that one of the fields has "sort": true. This will be explained under sorting below.
The above command will create a new Elasticsearch connector that will connect to the Elasticsearch instance accessible at port 9500 on the localhost as specified by the "elasticsearchUrl" key.
The "types" key defines the RDF type of the entities to synchronise and in the example it is only entities of type <http://www.ontotext.com/example/wine#Wine> (and its subtypes). The "fields" key defines the mapping from RDF to Elasticsearch. The basic building block is the property chain, i.e. a sequence of RDF properties where the object of each property is the subject of the following property. In the example we map three bits of information, the wine's grape, sugar content and year. Each chain is assigned a short and convenient field name: "grape", "sugar" and "year". The field names will be later used in the queries.
Grape is an example of a property chain composed of more than one property. First we take the wine's madeFromGrape property whose object is an instance of type Grape and then we take the rdfs:label of that instance. Sugar and year are both composed of a single property that links the value directly to the wine.
Dropping a connector removes all references to its external store from GraphDB as well as the SOLR core associated with it. Dropping a connector is achieved through a SPARQL INSERT query with the following parameter:
- Name of the connector
The drop command is triggered by a SPARQL INSERT with the dropConnector predicate, e.g. this will remove the connector :my_index:
Listing connectors should return all previously created connectors. It is a SELECT query with the listConnectors predicate:
?cntUri will be bound to the prefixed URI of the connector that was used during creation, e.g. <http://www.ontotext.com/connectors/elasticsearch/instance#my_index>, while ?cntStr will be bound to a string representing the part after the prefix, e.g. "my_index".
The internal state of each connector can be queried using a SELECT query and the connectorStatus predicate:
?cntUri is bound to the connector prefixed URI, while ?cntStatus is a string representation of the status for the connector represented by that URI. The status is key-value based.
From the user's point of view all synchronisation should happen transparently without using any additional predicates or naming a specific store explicitly, i.e. the user should simply execute standard SPARQL INSERT/DELETE queries. This is achieved by intercepting all changes in the plugin and determining which abstract documents need to be updated.
Once a connector has been created it should be possible to query data from it through SPARQL. For each matching abstract document, the connector returns the document's subject. In its simplest form querying is achieved by using a SELECT and providing the Elasticsearch query as the object of the :query predicate:
The result will bind ?entity to the two wines made from grapes that have "cabernet" in their name, namely :Yoyowine and :Franvino.
Note that you must use the field names you chose when you created the connector. It is perfectly valid to have field names identical to the property URIs but then you are responsible for escaping any special characters according to what Elasticsearch expects.
First we get an instance of the requested connector by using the RDF notation "X a Y" (= X rdf:type Y), where X is a variable and Y is a connector. X will be bound to an instance of that connector. Then we assign a query to that instance by using the system predicate :query. Finally we request the matching entities through the :entities predicate.
It is also possible to provide per query search options by using one or more option predicates. The option predicates will be described in detail further below.
The bound ?entity can be used in other SPARQL triples in order to build complex queries that fetch additional data from GraphDB. For example to see the actual grapes in the matching wines as well as the year they were made:
The result will look like:
Note that :Franvino is returned twice because it is made from two different grapes, which are both returned.
It is possible to access the match score returned by Elasticsearch with the :score predicate. Since each entity has its own score, the predicate must come at the entity level, for example:
The result will look like this but the actual score might be different as it depends on the specific Elasticsearch version:
Consider the sample wine data and the my_index connector described previously. We can use the same connector to query facets too:
It is important to specify the fields we want to facet by using the facetFields predicate. Its value must be a simple comma-delimited list of field names. In order to get the facetted results we have to use the facets predicate and since each facet has three components (name, value and count), the facets predicate will bind a blank node that in turn can be used to access the individual values for each component through the predicates facetName, facetValue and facetCount.
The result bindings will look like in the table below:
We can easily see that there are three wines that were produced in 2012 and two in 2013. We also see three of the wines are dry, while two are medium. However, it is not necessarily true that the three wise produced in 2012 are the same as the three dry wines as each facet is computed independently.
It is possible to sort the entities returned by a connector query according to one or more fields. In order to be able to use a certain field for sorting you have to specify that during the creation of the connector instance. Sorting is achieved through the orderBy predicate whose value must be a comma-delimited list of fields to sort according to. Each field may be prefixed with a minus to indicate sorting in descending order. For example:
The result will contain wines produced in 2013 sorted according to their sugar content in descending order:
By default entities are sorted according to their matching score in descending order.
Note that GraphDB might scramble the order if you join the entity from the connector query to other triples stored in GraphDB. In order to remedy this use ORDER BY from SPARQL.
Limit and offset are supported on the Elasticsearch side of the query. This is achieved through the predicates limit and offset. Consider this example, in which we specify an offset of 1 and a limit of 1:
The result will contain a single wine, Franvino, as it would be second in the list if we executed the query without the limit and offset:
Note that the specific order in which GraphDB returns the results depends both on how Elasticsearch returns the matches, unless you specified sorting.
Snippet extraction is used to extract highlighted snippets of text that match the query. The snippets are accessed through the dedicate predicate :snippets, which will bind a blank node that in turn provides the actual snippets via the predicates :snippetField and :snippetText. The predicate :snippets must be attached to the entity, as each entity has a different set of snippets. For example, in a search for Cabernet:
The query will return the two wines made from Cabernet Sauvignon or Cabernet Franc grapes as well as the respective matching fields and snippets:
Note that the actual snippets might be somewhat different as this depends on the specific Elasticsearch implementation.
It is possible to tweak how the snippets are collected/composed by using the following option predicates:
- :snippetSize sets the maximum size of the extracted text fragment, 250 by default.
- :snippetSpanOpen text to insert before the highlighted text, <em> by default.
- :snippetSpanClose text to insert after the highlighted text, </em> by default.
The option predicates are set on the connector instance, much like the :query predicate.
You can get the total number of hits by using the :totalHits predicate, e.g. for the connector :my_index and a query that would retrieve all wines made in 2012:
Since there are three wines made in 2012, the value 3 (of type xdd:long) will be bound to ?totalHits.
The creation parameters define how a connector instance is created by the :createConnector predicate. There are some required parameters and some that are optional. All parameters are provided together in a JSON object, where the parameter names are the object keys. Parameter values may be simple JSON values such as a string or a boolean, or they can be lists or objects.
All of the creation parameters can also be set conveniently from the Create Connector user interface in the GraphDB Workbench without any knowledge of JSON.
The compulsory parameters must be present in every connector. They are responsible for the core behaviour.
Since Elasticsearch is a third-party service, you have to specify the node where it is running. The format of the node value is of the form hostname.domain:port. There is no default value.
The RDF types of entities to sync are specified as a list of URIs. At least one type URI must be provided.
The fields define exactly what parts of each entity will be synchronised as well as the specific details on the connector side. The field is the smallest synchronisation unit and it maps a property chain from GraphDB to a field in Elasticsearch. The fields are specified as a list of field objects. At least one field object must be provided. Each field object has further keys that specify details.
The name of the field defines the mapping on the connector side. It is specified by the key fieldName with a string value. The field name is used at query time to refer to the field. There are few restrictions on the allowed characters in a field name but to avoid unnecessary escaping (which depends on how Elasticsearch parses its queries) we recommend to keep the field names simple.
The property chain (propertyChain) defines the mapping on the GraphDB side. A property chain is defined as a sequence of triples where the entity URI is the subject of the first triple, its object is the subject of the next triple and so on. In this model, a property chain with a single element corresponds to a direct property defined by a single triple. Property chains are specified as a list of URIs and at least one URI must be provided. If you need to store the entity URI in the connector you may map it by defining a property chain with a single special URI: $self. Only one field per connector may use the $self notation.
The default value (defaultValue) provides means for specifying a default value for the field when the property chain has no matching values in GraphDB. It has no default value. Currently only literals are supported.
Fields are indexed by default but that can be changed by using the Boolean option "index". True by default. Fields that are not indexed will be unavailable for queries but may still be used for faceting or sorting if these are enabled.
Fields are synchronised for faceting by default but that can be changed by using the Boolean option "facet". True by default. Fields that are not synchronised for faceting will not be available for faceted search.
Fields are not synchronised for sorting by default but that can be changed by using the Boolean option "sort". False by default. Fields that are not synchronised for sorting will not be available for ordering the results.
When literal fields are indexed in Elasticsearch they will be analysed according to the analyser settings. Should you require that a given field is not analysed you may use syncAsIs. False by default.
RDF data is often multilingual but you may want to map only some of the languages represented in the literal values. This can be done by specifying a list of language codes.
The entityFilter parameter is used to fine-tune the set of entities and/or individual values for configured fields, based on the field value. Entities and field values will be synchronised to Elasticsearch if and only if they pass the filter. The entity filter is a bit like a FILTER() inside a SPARQL query but not quite the same. Each configured field can be referred to in the entity filter by prefixing it with a "?" much like referring to a variable in SPARQL. Several operators are supported:
|?var in (value1, value2, ...)||Tests if the field var's value is one of the specified values. Values that do not match will be treated as if they were not present in the repository.||?status in ("active", "new")|
|?var not in (value1, value2, ...)||The negated version of the in-operator.||?status not in ("archived")|
|bound(?var)||Tests if the field var has a valid value. This can be used to make the field compulsory.||bound(?name)|
|expr1 || expr2||Logical disjunction of expressions expr1 and expr2.||bound(?name) || bound(?company)|
|expr1 && expr2||Logical conjunction of expressions expr1 and expr2.||bound(?status) && ?status in ("active", "new")|
|!expr||Logical negation of expression expr.||!bound(?company)|
|( expr )||Grouping of expressions||(bound(?name) || bound(?company)) && bound(?address)|
The atomic operators in, not in and bound accept either an operand that is a field name variable as in the examples above, or a special construction composed of a field name variable followed by a URI. The URI will be used as a property of the particular field value bound to the field name variable, fetched from GraphDB and then its value will be used to evaluate the entity filter expression. It can be illustrated with this SPARQL snippet:
Instead of using the values of ?fieldName, the values of ?evaluatedValue will be used.
In addition to full URIs within < > the filters support the shorthand form type, which stands for <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>.
Entity filters can be combined with default values in order to get more flexible behaviour.
A typical use-case for an entity filter is having soft deletes, i.e. instead of deleting an entity it is marked as deleted by the presence of a specific value for a given property.
For example, if we create a connector like this:
and then insert some entities:
We could create the index like this to specify a default value for city:
The default value will be used for entity:b as it has no value for city in the repository. Since the value is "London" the entity will be synchronised.
Sometimes data represented in RDF is not ideally suited to map directly to non-RDF. For example, if we have news articles and they can be tagged with different concepts (locations, persons, events, etc) one possible way to model that is a single property :taggedWith. Consider the following RDF data:
Now if we want to map this data to Elasticsearch such that the property :taggedWith x is mapped to separate fields taggedWithPerson and taggedWithLocation according to the type of x (we are not interested in events), we can map :taggedWith twice to different fields and then use an entity filter to get the desired values:
Note: type is the short way to write <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>.
The six articles in the RDF data above will be mapped as such:
|Article URI||Entity mapped?||Value in taggedWithPerson||Value in taggedWithLocation||Explanation|
|:Article1||yes||:Einstein||:Berlin||:taggedWith has the values :Einstein, :Berlin and :Cannes-FF. The filter leaves only the correct values in the respective fields. The value :Cannes-FF is ignored as it does not match the filter.|
|:Article2||yes||:Berlin||:taggedWith has the value :Berlin. After the filter is applied only taggedWithLocation is populated.|
|:Article3||yes||:Mozart||:taggedWith has the value :Mozart. After the filter is applied only taggedWithPerson is populated|
|:Article4||yes||:Mozart||:Berlin||:taggedWith has the values :Berlin and :Mozart. The filter leaves only the correct values in the respective fields.|
|:Article5||yes||:taggedWith has no values. The filter is not relevant.|
|:Article6||yes||:taggedWith has the value :Cannes-FF. The filter removes it as it does not match.|
This can be checked by issuing a facet search for taggedWithLocation and taggedWithPerson:
If the filter was applied you should get only :Berlin for taggedWithLocation and only :Einstein and :Mozart for taggedWithPerson:
The following diagram presents a summary of all predicates that can administer (create, drop, check status) connector instances or issue queries and retrieve results. It can be used as a quick reference of what a particular predicate needs to be attached to. For example, to retrieve entities you need to use :entities on a search instance and to retrieve snippets you need to use :snippets on an entity. Variables that are bound as a result of query are shown in green, blank helper nodes are shown in blue, literals in red and URIs in orange. The predicates are represented by labelled arrows.
Even though SPARQL per se is not sensitive to the order of triple patterns, the connectors expect to receive certain predicates before others so that queries can be executed properly. In particular, predicates that specify the query or query options need to come before any predicates that fetch results. Refer to the diagram in Overview of connector predicates for a quick overview of the predicates.
The Elasticsearch connector in this release is beta and does not fully support synchronisation in a GraphDB cluster. In essence, each worker in the cluster will try to synchronise the changes to the same Elasticsearch URI. There are two possible scenarios:
In this scenario, each worker will see the same instance at the provided Elasticsearch URI, e.g. because the URI is unique within the network that connects the workers together. Typically those are URIs that contain normal IP addresses or hostnames that resolve to one and the same IP address on all the workers. Whenever a worker receives updates, it will connect to the Elasticsearch and synchronise the data there. The process will be identical for all workers and thus each update on the Elasticsearch side will be executed as many times as the number of workers in the cluster. This should only impact update performance but should not lead to any inconsistency errors.
The user can also provide a URI that is not unique within the cluster, e.g. URIs based on the localhost or URIs with hostnames that resolve to different IPs on each cluster. In other words, each worker will see a different instance of Elasticsearch and thus when the worker sends an update to the Elasticsearch instance there will be no redundant operations.