Solr GraphDB Connector

compared with
Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (114)

View Page History
h1. Overview

The GraphDB Connectors provide extremely fast keyword and faceted (aggregation) searches that are typically implemented by an external component or service, but have the additional benefit of staying automatically up-to-date with the GraphDB repository data.
The GraphDB Connectors provide extremely fast normal and facet (aggregation) searches that are typically implemented by an external component or service such as Solr, but have the additional benefit to stay automatically up-to-date with the GraphDB repository data.

The Connectors provide synchronisation at the _entity_ level, where an entity is defined as having a unique identifier (a URI) and a set of properties and property values. In terms of RDF, this corresponds to a set of triples that have the same subject. In addition to simple properties (defined by a single triple), the Connectors support _property chains_. A property chain is defined as a sequence of triples where each triple's object is the subject of the subsequent following triple.

h1. Features
* multiple independent instances per repository
* the entities for synchronisation are defined by:
** a list of fields (on the Solr side) and property chains (on the GraphDB side), the side) whose values of which are to will be synchronised
** a list of the rdf:type rdf:type's of the entities for synchronisation
** a list of languages for synchronisation (the default is all languages)
** additional filtering by property and value
* full-text search using native Solr queries
* paging of results using _offset_ and _limit_

Each feature is described in detail below.

h1. Sample data

All examples use the following sample data describing data. It describes five fictitious wines: Yoyowine, Franvino, Noirette, Blanquito and Rozova, as well as the grape varieties needed to make these wines. The minimum needed ruleset level needed in GraphDB is RDFS.

{noformat} {code}
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
:hasSugar "medium" ;
:hasYear "2013"^^xsd:integer .
{noformat} {code}

h1. Usage

All interactions with the Solr GraphDB Connector are shall be done through SPARQL queries.

There are three types of SPARQL queries:
In general this corresponds to _INSERT adds or modifies data_ and _SELECT queries existing data_.

Each connector implementation defines its own URI prefix to distinguish it from other connectors. For the Solr GraphDB Connector this is *http://www.ontotext.com/connectors/elasticsearch#*. *http://www.ontotext.com/connectors/solr#*. Each command or predicate that is executed by the connector uses this prefix, e.g. <http://www.ontotext.com/connectors/solr##createConnector> for creating a connector for Solr.

Individual instances of a connector are distinguished by unique names that are also URIs. They have their own prefix to avoid clashing with any of the command predicates. For Solr, the instance prefix is http://www.ontotext.com/connectors/solr/instance#.
h2. Creating a connector

Creating a connector is should be done by sending a SPARQL query with the following configuration data:

* the name of the connector (e.g. my_index),
* classes to synchronise,
* properties to synchronise.

The configuration data must be provided as a JSON string representation and passed together with the create command.

{tip:title=What we recommend}
Use the GraphDB Connectors management interface provided by the GraphDB Workbench. It lets will let you create the configuration easily and then create the connector directly or copy the configuration and execute it elsewhere.
{tip}

The create command is triggered by a SPARQL *INSERT* with the *createConnector* predicate, e.g. this creates will create a connector called *my_index* that will synchronise the wines from the sample data above:

{noformat} {code}
PREFIX : <http://www.ontotext.com/connectors/solr#>
PREFIX inst: <http://www.ontotext.com/connectors/solr/instance#>
''' .
}
{noformat} {code}

Note that one of the fields has _"sort": true_. This will be explained further when discussing sorting below.
Note that one of the fields has _"sort": true_. This is explained further under [Sorting|#sorting].

The above command creates a new Solr connector that connects to the Solr instance accessible at port 8983 on the localhost as specified by the "solrUrl" key.

The "types" key defines the RDF type of the entities to synchronise and, in the example, it is only entities of the type <http://www.ontotext.com/example/wine#Wine> (and its subtypes). The "fields" key defines the mapping from RDF to Solr. The basic building block is the property chain, i.e. a sequence of RDF properties where the object of each property is the subject of the following property. In the example, we map three bits of information - the wine's grape, the sugar content, and the year. Each chain is assigned a short and convenient field name: "grape", "sugar", and "year". The field names are later used in the queries.

Grape is an example of a property chain composed of more than one property. First, we take the wine's madeFromGrape property, the object of which is an instance of type Grape, and then we take the rdfs:label of this instance. Sugar and year are both composed of a single property that links the value directly to the wine.


h4. Non managed schemas
Currently we don't expose a powerful interface to change the analyzers, stopwords on per field basis. The recommended way to do that for now is to manage the schema yourself and tell the connector to just sync the object values in the appropriate fields. Here is an example:

{code}
PREFIX : <http://www.ontotext.com/connectors/solr#>
PREFIX inst: <http://www.ontotext.com/connectors/solr/instance#>

INSERT DATA {
inst:my_index :createConnector '''
{
"solrUrl": "http://localhost:8983/solr",
"types": [
"http://www.ontotext.com/example/wine#Wine"
],
"fields": [
{
"fieldName": "grape",
"propertyChain": [
"http://www.ontotext.com/example/wine#madeFromGrape",
"http://www.w3.org/2000/01/rdf-schema#label"
]
},
{
"fieldName": "sugar",
"propertyChain": [
"http://www.ontotext.com/example/wine#hasSugar"
],
"sort": true
},
{
"fieldName": "year",
"propertyChain": [
"http://www.ontotext.com/example/wine#hasYear"
]
}
],
"manageExternalIndex": "false"
}
''' .
}
{code}

this will create the same connector as above but expects fields with the specified fieldnames to be already present in the core!

{warning}
Currently the connector chain tracking information is stored in Solr - this will lead to a requirement for additional fields in the core's schema so you are responsible for adding those. The recommended way to handle this is to:
* create the connector on a testing environment
* copy the generated schema
* change the schema as needed and deploy it in production
{warning}



h2. Dropping a connector

The drop command is triggered by a SPARQL *INSERT* with the *dropConnector* predicate, e.g. this will remove the connector *:my_index*:

{noformat} {code}
PREFIX : <http://www.ontotext.com/connectors/solr#>
PREFIX inst: <http://www.ontotext.com/connectors/solr/instance#>
inst:my_index :dropConnector "" .
}
{noformat} {code}

h2. Listing available connectors
Listing connectors returns all previously created connectors. It is a *SELECT* query with the *listConnectors* predicate:

{noformat} {code}
PREFIX : <http://www.ontotext.com/connectors/solr#>

?cntUri :listConnectors ?cntStr .
}
{noformat} {code}

*?cntUri* is will be bound to the prefixed URI of the connector that was used during creation, e.g. <http://www.ontotext.com/connectors/solr/instance#my_index>, while *?cntStr* is will be bound to a string, representing the part after the prefix, e.g. "my_index".

h2. Status check
The internal state of each connector can be queried using a *SELECT* query and the *connectorStatus* predicate:

{noformat} {code}
PREFIX : <http://www.ontotext.com/connectors/solr#>

?cntUri :connectorStatus ?cntStatus .
}
{noformat} {code}

*?cntUri* is bound to the connector prefixed URI, while *?cntStatus* is a string representation of the status of the connector represented by that URI. The status is key-value based.
*?cntUri* will be bound to the connector prefixed URI, while *?cntStatus* will be bound to a string representation of the status of the connector represented by this URI. The status is key-value based.

h2. Adding, updating and deleting data

From the user's point of view all synchronisation happens will happen transparently without using any additional predicates or naming a specific store explicitly, i.e. the user should simply execute standard SPARQL INSERT/DELETE queries. This is achieved by intercepting all changes in the plugin and determining which abstract documents need to be updated.

h2. Querying data

Once a connector has been created, it is will be possible to query data from it through SPARQL. For each matching abstract document, the connector returns the document's subject. In its simplest form, querying is achieved by using a *SELECT* and providing the Elasticsearch Solr query as the object of the *:query* predicate:
{noformat}
{code}
PREFIX : <http://www.ontotext.com/connectors/solr#>
PREFIX inst: <http://www.ontotext.com/connectors/solr/instance#>
:entities ?entity .
}
{noformat} {code}

The result will bind ?entity to the two wines made from grapes that have "cabernet" in their name, namely :Yoyowine and :Franvino.

Note that you must use the field names you chose when you created the connector. It is perfectly valid to have field names identical to the property URIs but then you have to make sure you escape responsible for escaping any special characters according to what Solr expects.

First, we get an instance of the requested connector by using the RDF notation "X a Y" (= X rdf:type Y), where X is a variable and Y is a connector. X is will be bound to an instance of this connector. Then, we assign a query to this that instance by using the system predicate *:query*. Finally we request the matching entities through the *:entities* predicate.

It is also possible to provide per query search options by using one or more option predicates. The option predicates are described in further detail details below.

h4. Raw queries

If you want to access a query parameter from solr that is not exposed with special predicate, you can do it with our raw query mechanism. Basically instead of providing a full text query in the :query part, you just specify the parameters from the solr endpoint. Let's say you want to sort the facets returned in a different order as described in [facet.sort|https://wiki.apache.org/solr/SimpleFacetParameters#facet.sort]. Here is an example query that will do that for you:

{code}
PREFIX : <http://www.ontotext.com/connectors/solr#>
PREFIX inst: <http://www.ontotext.com/connectors/solr/instance#>

SELECT ?entity {
?search a inst:my_index ;
:query '''
{
"facet":"true",
"indent":"true",
"facet.sort":"index",
"q":"*:*",
"wt":"json"
}
''' ;
:entities ?entity .
}
{code}

You can get those parameter when you do your query from the admin interface in solr, or from the response payload(they are included). We also support the query parameters from the select endpoint in solr if you prefer that. Here is an example:

{code}

PREFIX : <http://www.ontotext.com/connectors/solr#>
PREFIX inst: <http://www.ontotext.com/connectors/solr/instance#>

SELECT ?entity {
?search a inst:my_index ;
:query '''q=*%3A*&wt=json&indent=true&facet=true&facet.sort=index''' ;
:entities ?entity .
}
{code}

note that you have to specify *_q=_* as the first parameter because we use it for detection of the raw query.



h3. Combining Solr results with GraphDB data

The bound ?entity can be used in other SPARQL triples in order to build complex queries that fetch additional data from GraphDB. For example, to see the actual grapes in the matching wines as well as the year they were made:

{noformat} {code}
PREFIX : <http://www.ontotext.com/connectors/solr#>
PREFIX inst: <http://www.ontotext.com/connectors/solr/instance#>
?entity wine:hasYear ?year
}
{noformat} {code}

The result looks will look like this:

|| ?entity || ?grape || ?sugar ||
h3. Entity match score

It is possible to access the match score returned by Solr with the *:score* predicate. As each entity has its own score, the predicate must come at the entity level, for level. For example:

{noformat} {code}
PREFIX : <http://www.ontotext.com/connectors/solr#>
PREFIX inst: <http://www.ontotext.com/connectors/solr/instance#>
?entity :score ?score
}
{noformat} {code}

The result looks will look like this but the actual score might be different as it depends on the specific Solr version:

|| ?entity || ?score ||
Consider the sample wine data and the my_index connector described previously. We can use the same connector to query facets too:

{noformat} {code}
PREFIX : <http://www.ontotext.com/connectors/solr#>
PREFIX inst: <http://www.ontotext.com/connectors/solr/instance#>
_:f :facetCount ?facetCount .
}
{noformat} {code}

It is important to specify the fields we want to facet by using the *facetFields* predicate. Its value must be a simple comma-delimited list of field names. In order to get the faceted results, we have to use the *facets* predicate and as each facet has three components (name, value and count), the *facets* facets predicate will bind binds a blank node, which in turn can be used to access the individual values for each component through the predicates *facetName*, *facetValue*, and *facetCount*.

The resulting bindings will look like in the table below:

|| facetName || facetValue || facetCount ||
We can easily see that there are three wines produced in 2012 and two in 2013. We also see that three of the wines are dry, while two are medium. However, it is not necessarily true that the three wines produced in 2012 are the same as the three dry wines as each facet is computed independently.

{anchor:sorting}
h2. Sorting

It is possible to sort the entities returned by a connector query according to one or more fields. In order to be able to use a certain field for sorting, you have to specify this at the time of creating the connector instance. Sorting is achieved by the *orderBy* predicate the value of which must be a comma-delimited list of fields. Each field may be prefixed with a minus to indicate sorting in descending order. For example:

{noformat} {code}
PREFIX : <http://www.ontotext.com/connectors/solr#>
PREFIX inst: <http://www.ontotext.com/connectors/solr/instance#>
:entities ?entity .
}
{noformat} {code}

The result will contain wines produced in 2013, sorted according to their sugar content in descending order:

|| entity ||
h2. Limit and offset

Limit and offset are supported on the Solr side of the query. This is achieved through the predicates *limit* and *offset*. Consider this example, in which we specify an offset of 1 and a limit of 1:

{noformat} {code}
PREFIX : <http://www.ontotext.com/connectors/solr#>
PREFIX inst: <http://www.ontotext.com/connectors/solr/instance#>
:entities ?entity .
}
{noformat} {code}

The result will contain a single wine, Franvino, as it would be second in the list, if we executed the query without the limit and offset:

|| entity ||
| Blanquito |

Note that the specific order in which GraphDB returns the results, depends both on both how Solr returns the matches, unless you specified sorting.

h2. Snippet extraction

Snippet extraction is used to extract highlighted snippets of text that match the query. The snippets are accessed through the dedicated predicate *:snippets*. It *:snippets*, which binds a blank node, which node that in in turn provides the actual snippets via the predicates *:snippetField* and *:snippetText*. The predicate :snippets must be attached to the entity, as each entity has a different set of snippets. For example, in a search for Cabernet:

{noformat} {code}
PREFIX : <http://www.ontotext.com/connectors/solr#>
PREFIX inst: <http://www.ontotext.com/connectors/solr/instance#>
:snippetText ?snippetText .
}
{noformat} {code}

The query will return the two wines made from Cabernet Sauvignon or Cabernet Franc grapes as well as the respective matching fields and snippets:
You can get the total number of hits by using the *:totalHits* predicate, e.g. for the connector :my_index and a query that would retrieve all wines made in 2012:

{noformat} {code}
PREFIX : <http://www.ontotext.com/connectors/solr#>
PREFIX inst: <http://www.ontotext.com/connectors/solr/instance#>
:totalHits ?totalHits .
}
{noformat} {code}

Since As there are three wines made in 2012, the value 3 (of type xdd:long) is will be bound to ?totalHits.

h1. Creation parameters

The creation parameters define how a connector instance is created by the :createConnector predicate. Some parameters There are some required parameters and some - that are optional. All parameters are provided together in a JSON object, where the parameter names are the object keys. Parameter values may be simple JSON values such as a string or a boolean, or they can be lists or objects.

All of the creation parameters can also be set conveniently from the Create Connector user interface in the GraphDB Workbench without any knowledge of JSON.
h4. Property chain to map: propertyChain (list of URI)

The property chain (propertyChain) defines the mapping on the GraphDB side. A property chain is defined as a sequence of triples where the entity URI is the subject of the first triple, its object is the subject of the next triple and so on. In this model, a property chain with a single element corresponds to a direct property defined by a single triple. Property chains are specified as a list of URIs and at least one URI must be provided. If you need to store the entity URI in the connector, you may map it by defining a property chain with a single special URI: $self. Only one field per connector may use the $self notation.

h4. The default value: defaultValue (string)
h4. Skipping the analyser: syncAsIs (boolean)

When literal fields are indexed in Solr, they will be analysed according to the analyser settings. Should you require that a given field is not analysed, you may use syncAsIs. False by default.

h2. Optional parameters


h3. Literals in what language: languages (list of string)

h3. Entity filtering: entityFilter (string)

The _entityFilter_ parameter is used to fine-tune the set of entities and/or individual values for the configured fields, based on the field value. Entities and field values will be synchronised to Solr if, and only if, they pass the filter. The entity filter is similar to a FILTER() inside a SPARQL query but not exactly the same. Each configured field can be referred to in the entity filter by prefixing it with a "?", much like referring to a variable in SPARQL. Several operators are supported:

|| Operator || Meaning || Example ||
The atomic operators _in_, _not in_ and _bound_ accept either an operand that is a field name variable as in the examples above, or a special construction composed of a field name variable followed by a URI. The URI will be used as a property of the particular field value bound to the field name variable, fetched from GraphDB and then its value will be used to evaluate the entity filter expression. It can be illustrated with this SPARQL snippet:

{noformat} {code}
?fieldName <http://some.property#uri> ?evaluatedValue
{noformat} {code}

Instead of using the values of ?fieldName, the values of ?evaluatedValue will be used.

In addition to full URIs within < >, > the filters support the shorthand form *type*, which stands for <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>.

Entity filters can be combined with default values in order to get more flexible behaviour.

A typical use-case for an entity filter is having soft deletes, i.e. instead of deleting an entity, it is marked as deleted by the presence of a specific value for a given property.

h4. Basic entity filter example

For example, if we create a connector like this:
{code}
{noformat}
PREFIX : <http://www.ontotext.com/connectors/solr#>
PREFIX inst: <http://www.ontotext.com/connectors/solr/instance#>
''' .
}
{noformat} {code}

and then insert some entities:

{noformat} {code}
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix : <http://www.ontotext.com/example#> .
:city "London" .

# the entity below will not be synchronised because it lacks is lacking the property completely: bound(?city)
:beta
rdf:type :gadget ;
:name "Mary Syncless" ;
:city "Liverpool" .
{noformat} {code}

We could create the following index to specify a default value for _city_:

{noformat} {code}
...
{
...
}
{noformat} {code}

The default value will be used for entity:b as it has no value for city in the repository. As the value is "London", the entity will be synchronised.
h4. Advanced entity filter example

Sometimes data represented in RDF is not well suited to map directly to non-RDF. For example, if we have news articles and they can be tagged with different concepts (locations, persons, events, etc.), one possible way to model this that is a single property :taggedWith. Consider the following RDF data:

{noformat} {code}
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
rdfs:comment "An article about the Cannes Film Festival in 2013." ;
:taggedWith :Cannes-FF .
{noformat} {code}

Now, if we want to map this data to Solr in such a way that the property *:taggedWith _x_* is mapped to separate fields *taggedWithPerson* and *taggedWithLocation* according to the type of _x_ (we are not interested in events), we can map :taggedWith twice to different fields and then use an entity filter to get the desired values:

{noformat} {code}
PREFIX : <http://www.ontotext.com/connectors/solr#>
PREFIX inst: <http://www.ontotext.com/connectors/solr/instance#>
''' .
}
{noformat} {code}

Note: *type* is the short way to write <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>.
|| Article URI || Entity mapped? || Value in taggedWithPerson || Value in taggedWithLocation || Explanation ||
| :Article1 | yes | :Einstein | :Berlin | :taggedWith has the values :Einstein, :Berlin and :Cannes-FF. The filter leaves only the correct values in the respective fields. The value :Cannes-FF is ignored as it does not match the filter. |
| :Article2 | yes | | :Berlin | :taggedWith has the value :Berlin. After the filter is applied, only taggedWithLocation is populated. |
| :Article3 | yes | :Mozart | | :taggedWith has the value :Mozart. After the filter is applied, only taggedWithPerson is populated |
| :Article4 | yes | :Mozart | :Berlin | :taggedWith has the values :Berlin and :Mozart. The filter leaves only the correct values in the respective fields. |
| :Article5 | yes | | | :taggedWith has no values. The filter is not relevant. |
| :Article6 | yes | | | :taggedWith has the value :Cannes-FF. The filter removes it as it does not match. |

This can be checked by issuing a faceted search for taggedWithLocation and taggedWithPerson:

{noformat} {code}
PREFIX : <http://www.ontotext.com/connectors/solr#>
PREFIX inst: <http://www.ontotext.com/connectors/solr/instance#>
:facetCount ?facetCount .
}
{noformat} {code}

If the filter was applied, you should get only :Berlin for taggedWithLocation and only :Einstein and :Mozart for taggedWithPerson:

|| ?facetName || ?facetValue || ?facetCount ||
}


{plantuml}


h1. Solr specifics

h2. Solr schema generation

When data is added to the index for the first time, the connector will create the appropriate schema fields in the core, based on the values in the synchronised fields. Here is a mapping from RDF types to Solr field types. For this to work properly, we assume you need to be are using the schemaless Solr home from the examples in the Solr distribution. It contains the mapped Solr field types and it is declared as managed, which allows the connector to add fields.

|| RDF type || Solr type ||
h2. Solr deployment

To In order to be able to create new Solr cores on the fly, you have to use the custom admin handler, provided with the Solr Connector. The admin handler has to be added to Solr:

* Get the solr-extension jar from the GraphDB distribution and copy it <solr-home>.
* Tell Solr to scan the jar and use our custom handler instead of the default one - add the following lines to the root *solr* tag in solr.xml in <solr-home>:
{code}
{noformat}
<str name="adminHandler">org.apache.solr.handler.admin.ExtendedCoreAdminHandler</str>
<str name="sharedLib">${sharedLib:}</str>
{noformat} {code}

The GraphDB distribution also provides a completely <solr-home> that you can extract at a convenient place and then use it to run Solr.

Note that this is a limitation of Solr and you are not required to use the custom handler. But, if choose not to, However, in that case you will have to create be responsible for creating the Solr core yourself.

h1. Caveats
Even though SPARQL per se is not sensitive to the order of triple patterns, the connectors expect to receive certain predicates before others so that queries can be executed properly. In particular, predicates that specify the query or query options need to come before any predicates that fetch results.

The diagram in [#Overview of connector predicates] provides a quick overview of the predicates.


h2. Using a connector with a GraphDB cluster

h3. Multiple undesired synchronisations to a single Solr instance

In this scenario, each worker will see the same instance at the provided Solr URI (because the URI is unique within the network that connects the workers together). Typically these are URIs that contain normal IP addresses or hostnames that resolve to one and the same IP address on all the workers. Whenever a worker receives updates, it connects to the Solr and synchronises the data there. The process is identical for all workers and thus each update on the Solr side is executed as many times as the number of workers in the cluster. This should only impact update performance but should not lead to any inconsistency errors.

h3. Multiple synchronisations to multiple Solr instances

The user can also provide a URI that is not unique within the cluster, e.g. URIs based on the localhost or URIs with hostnames that resolve to different IPs on each cluster. In other words, each worker will see a different instance of Solr and thus when the worker sends an update to the Solr instance there will be no redundant operations.

h1. Migrating from Lucene4 plugin

You can easily migrate your existing [lucene4 plugin|https://confluence.ontotext.com/display/EM/Lucene4+OWLIM+Plug-in] setup to the new connectors interface.

h3. Create index queries

We provide an automated migration tool for your create index queries. The tool is distributed with GraphDB 6.0 onward and can be found in the tools subdirectory. Here is how to use it:

{code}
java -jar migration.jar --file <input-file> <output-file>
{code}
where *input-file* is your old sparql file and *output-file* is the new sparql file

you can find possible options with
{code}
java -jar migration.jar --help
{code}

h3. Select queries using the index
We changed the syntax for the search queries to be able to match our needs for new features and better design. Here is an example query using the lucene4 plugin:

{code}
PREFIX luc4:<http://www.ontotext.com/owlim/lucene4#>
SELECT ?c ?snippet WHERE {
?c rdf:type <http://data.ontotext.com/ontologies/ontology1/Type1> .

?c luc4:content ("gold" "limit=10;snippet.size=200") .
?c luc4:snippet ?snippet .
?c luc4:score ?score .
}
{code}

and here is it's connectors variant:

{code}
PREFIX conn:<http://www.ontotext.com/connectors/lucene#>
PREFIX inst:<http://www.ontotext.com/connectors/lucene/instance#>

SELECT ?c ?snippet WHERE {
[] a inst:content ;
conn:query "gold" ;
conn:limit "10" ;
conn:snippetSize "200" ;
conn:entities ?entity

?entity conn:snippets ?s .
?s conn:snippetText ?snippet .
}
{code}

note the following changes:

* We are using special predicates for everything - no more key value options in a string
* The query is actually an instance of the index
* snippets belong to the entity
* snippets are now first class objects - you can also get the field of the match
* indexes are now an instance of another namespace. This allows you to create indexes with the name "entities" for example.

Look at [#Overview of connector predicates] for more info on the new syntax and how everything is linked together.