Solr GraphDB Connector

compared with
Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (48)

View Page History
h1. Overview and features

The GraphDB Connectors provide extremely fast normal and faceted (aggregation) searches that are searches, typically implemented by an external component or a service such as Solr but have the additional benefit to stay of staying automatically up-to-date with the GraphDB repository data.

The Connectors provide synchronisation at the _entity_ level, where an entity is defined as having a unique identifier (a URI) and a set of properties and property values. In terms of RDF, this corresponds to a set of triples that have the same subject. In addition to simple properties (defined by a single triple), the Connectors support _property chains_. A property chain is defined as a sequence of triples where each triple's object is the subject of the following triple.
In general, this corresponds to _INSERT adds or modifies data_ and _SELECT queries existing data_.

Each connector implementation defines its own URI prefix to distinguish it from other connectors. For the Solr GraphDB Connector, this is http://www.ontotext.com/connectors/solr#. Each command or predicate executed by the connector uses this prefix, e.g., http://www.ontotext.com/connectors/solr#createConnector to create a connector instance for Solr.

Individual instances of a connector are distinguished by unique names that are also URIs. They have their own prefix to avoid clashing with any of the command predicates. For Solr, the instance prefix is http://www.ontotext.com/connectors/solr/instance#.
In order to be able to create new Solr cores on the fly, you have to use the custom admin handler provided with the Solr Connector. These are the necessary steps:

# Copy the file {{solr-core-admin-handler.jar}} from the folder tools of the GraphDB distribution to your Solr home.
# Tell Solr to scan the jar and use our custom handler instead of the default one. Add this to the root *solr* tag in solr.xml in your Solr home:
{code:language=html/xml}

{note}
Note that this is a limitation of Solr and you are not required to use the custom handler. If you do not want wish to deploy it, you will be responsible for creating the Solr core yourself.
{note}

Creating a connector instance is done by sending a SPARQL query with the following configuration data:

* the name of the connector instance (e.g., my_index);
* a Solr instance to synchronise to;
* classes to synchronise;

{tip:title=What we recommend}
Use the GraphDB Connectors management interface provided by the GraphDB Workbench as it will let you create the configuration easily, and then create the connector instance directly or copy the configuration and execute it elsewhere.
{tip}

The create command is triggered by a SPARQL *INSERT* with the *createConnector* predicate, e.g., this will create a connector instance called *my_index* that will synchronise the wines from the sample data above:

{div:style=width: 70em}{noformat}
The above command creates a new Solr connector instance that connects to the Solr instance accessible at port 8983 on the localhost as specified by the "solrUrl" key.

The "types" key defines the RDF type of the entities to synchronise and, in the example, it is only entities of the type <http://www.ontotext.com/example/wine#Wine> (and its subtypes). The "fields" key defines the mapping from RDF to Solr. The basic building block is the property chain, i.e., a sequence of RDF properties where the object of each property is the subject of the following property. In the example, we map three bits of information - the grape the wines are made of, sugar content, and year. Each chain is assigned a short and convenient field name: "grape", "sugar", and "year". The field names are later used in the queries.

Grape is an example of a property chain composed of more than one property. First, we take the wine's madeFromGrape property, the object of which is an instance of the type Grape, and then we take the rdfs:label of this instance. Sugar and year are both composed of a single property that links the value directly to the wine.

h4. Schema and core management
* _manageSchema_: if true, GraphDB will manage the schema. True by default.

The automatic core management requires the custom Solr admin handler provided with the GraphDB distribution. For more information, see [#Solr deployment]. core creation].

Note that if either of the options is set to false, you will be responsible for creating, updating or removing the core/schema and, if you have misconfigured Solr, the connector instance will not function correctly.
h5. Using a non-managed schema

The present version provides no support for changing some advanced options, such as stopwords, on a per field basis. The recommended way to do that this for now is to manage the schema yourself and tell the connector to just sync the object values in the appropriate fields. Here is an example:

{div:style=width: 70em}{noformat}
{noformat}{div}

This will create the same connector instance as above but it expects would expect fields with the specified fieldnames to be already present in the core, as well as some internal GraphDB fields. For the example, you must have the following fields:

|| field name || Solr config ||
Dropping a connector instance removes all references to its external store from GraphDB as well as the Solr core associated with it.

The drop command is triggered by a SPARQL *INSERT* with the *dropConnector* predicate where the name of the connector instance has to be in the subject position, e.g., this will remove the connector *:my_index*:

{div:style=width: 70em}{noformat}
{noformat}{div}

*?cntUri* will be bound to the prefixed URI of the connector instance that was used during creation, e.g., <http://www.ontotext.com/connectors/solr/instance#my_index>, while *?cntStr* will be bound to a string, representing the part after the prefix, e.g., "my_index".

h2. Instance status check
h2. Adding, updating and deleting data

From the user point of view all synchronisation will happen transparently without using any additional predicates or naming a specific store explicitly, i.e., the user should simply execute standard SPARQL INSERT/DELETE queries. This is achieved by intercepting all changes in the plugin and determining which abstract documents need to be updated.

h2. Simple queries
{noformat}{div}

You can get thoese parameters when you do your query from the admin interface in Solr, or from the response payload (where they are included). We also support the query parameters from the select endpoint in Solr, if you prefer that. Here is an example:

{div:style=width: 70em}{noformat}
{noformat}{div}

Note that you have to specify *_q=_* as the first parameter because we use it for detection of detecting the raw query.


{noformat}{div}

It is important to specify the fields we want to facet by using the *facetFields* predicate. Its value must be a simple comma-delimited list of field names. In order to get the faceted results, we have to use the *facets* predicate and, as each facet has three components (name, value and count), the facets predicate binds a blank node, which in turn can be used to access the individual values for each component through the predicates *facetName*, *facetValue*, and *facetCount*.

The resulting bindings will look like in the table below:
h3. Supported Solr facets and aggregations

The Solr GraphDB Connector supports mapping of range, interval and pivot facets. Please, refer to the documentation of Solr for more information.
The Solr GraphDB Connector supports mapping of range, interval and pivot facets. For more information, please, refer to the documentation of Solr.

h3. RDF mapping of the results

{note:title=Solr caveat}
Solr imposes an additional requirement on fields used for sorting. They must be defined with with multivalued = false.
{note}

It is possible to tweak how the snippets are collected/composed by using the following option predicates:

* *:snippetSize* sets the maximum size of the extracted text fragment, 250 by default.;
* *:snippetSpanOpen* text to insert before the highlighted text, <em> by default.;
* *:snippetSpanClose* text to insert after the highlighted text, </em> by default.

h2. Total hits

You can get the total number of hits by using the *:totalHits* predicate, e.g., for the connector instance :my_index and a query that would retrieve all wines made in 2012:

{div:style=width: 70em}{noformat}
h1. List of creation parameters

The creation parameters define how a connector instance is created by the :createConnector predicate. There Some are some required parameters and some that are optional. All parameters are provided together in a JSON object, where the parameter names are the object keys. Parameter values may be simple JSON values such as a string or a boolean, or they can be lists or objects.

All of the creation parameters can also be set conveniently from the Create Connector user interface in the GraphDB Workbench without any knowledge of JSON.
h3. languages (list of string), optional, valid languages for literals

RDF data is often multilingual but you may want to map only some of the languages represented in the literal values. This can be done by specifying a list of language ranges that will be matched to the language tags of literals according to RFC 4647, Section 3.3.1. Basic Filtering. In addition, an empty range can be used to include literals that have no language tag. The list of language ranges will map all existing literals that have matching language tags.

h3. fields (list of field object), required, defines the mapping from RDF to Solr
When literal fields are indexed in Solr, they will be analysed according to the analyser settings. Should you require that a given field is not analysed, you may use "analyzed". This option has no effect for URIs (they are never analysed). True by default.

This option affects the Solr type that will be used for the field. True will use a type suitable for the values (i.e., text or numeric), while false will use the type "string", which is never analysed by Solr.

h4. multivalued (boolean), optional, default true
"fieldName": "grapeFacet",
"propertyChain": [
"@facet" "@grape"
],
"analyzed": false,
h4. datatype (string), optional, manual datatype override

By default, the Solr GraphDB Connector will use datatype of literal values to determine how they should be mapped to Solr types. See [#Datatype mapping] for For more information on the supported datatypes, see [#Datatype mapping].

The mapping can be overriden through the property "datatype", which can be specified per field. The value of "datatype" may be any of the xsd: types supported by the automatic mapping or a native Solr type prefixed by native:, e.g., both xsd:long and native:tlongs will map to the tlongs type in Solr.

h3. Copy fields

Often, it is convenient to synchronise one and the same data multiple times with different settings to accomodate for different use cases, e.g., faceting or sorting vs full-text search. The Solr GraphDB Connector has explicit support for fields that copy their value from another field. This is achieved by specifying a single element in the property chain of the form @otherFieldName, where otherFieldName is another non-copy field. For example, with this snippet:

h1. Datatype mapping
| literal | xsd:date | tdates |

The datatype mapping can be affected by the synchronisation options too, e.g. options, too. For example, a non-analysed field that has xsd:long values will not use "tlongs" but "string" instead.


Note that for any given field the automatic mapping will use the first value it sees. This will work fine for clean datasets but might lead to problems, if your dataset has non-normalised data, e.g., the first value has no datatype but other values have.

h1. Advanced filtering and fine tuning
| ( expr ) | Grouping of expressions | {nf}(bound(?name) || bound(?company)) && bound(?address){nf} |

In addition to the operators, there are some constructions that can be used to write filters based not on the values but on values related to them:

h4. Accessing the previous element in the chain

The construction *parent(?var)* can be used to go to a previous level in a property chain. It can be applied recursively as many times as needed, e.g., *parent(parent(parent(?var)))* will go back in the chain three times. The effective value of *parent(?var)* can be used with the *in* or *not in* operator like this: {nf}parent(?company) in (<urn:a>, <urn:b>){nf}.

h4. Accessing an element beyond the chain
The URI parameter can be a full URI within < > or the special string _rdf:type_ (alternatively just _type_), which will be expanded to http://www.w3.org/1999/02/22-rdf-syntax-ns#type.

h4. Filtering by RDF context graph

The construction *context(?var)* *graph(?var)* can be used to access the RDF context graph of a field's value. The typical use case is to sync only explicit values: {nf}context(?a) {nf}graph(?a) not in (<http://www.ontotext.com/implicit>){nf}. The construction can be combined with *parent()* like this: {nf}context(parent(?a)) {nf}graph(parent(?a)) in (<urn:a>){nf}.

h4. Entity filters and default values
Entity filters can be combined with default values in order to get more flexible behaviour.

A typical use-case for an entity filter is having soft deletes, i.e., instead of deleting an entity, it is marked as deleted by the presence of a specific value for a given property.

h3. Basic entity filter example

For example, if we create a connector instance like this: such as:
{div:style=width: 70em}{noformat}
PREFIX : <http://www.ontotext.com/connectors/solr#>
:city "London" .

# the entity below will not be synchronised because it is lacking lacks the property completely: bound(?city)
:beta
rdf:type :gadget ;
:name "Peter Syncfree" .

# the entity below will not be synchronised - because it has a different city value:
# ?city in ("London") will remove the value "Liverpool" so bound(?city) will be false
:gamma
h3. Advanced entity filter example

Sometimes data represented in RDF is not well suited to map directly to non-RDF. For example, if we have news articles and they can be tagged with different concepts (locations, persons, events, etc.), one possible way to model that this is a single property :taggedWith. Consider the following RDF data:

{div:style=width: 70em}{noformat}
h1. Overview of connector predicates

The following diagram shows a summary of all predicates that can administer (create, drop, check status) connector instances or issue queries and retrieve results. It can be used as a quick reference of what a particular predicate needs to be attached to. For example, to retrieve entities, you need to use :entities on a search instance and to retrieve snippets, you need to use :snippets on an entity. Variables that are bound as a result of a query are shown in green, blank helper nodes are shown in blue, literals in red, and URIs in orange. The predicates are represented by labelled arrows.

{plantuml}
# deploy the new GraphDB version;
# modify the INSERT statement according to the changes described below;
# re-create the connector instance with the modified INSERT statement;.

You might also need to change your queries to reflect any changes in field names or extra fields.

|| field || note ||
| firstName | produced, if the option "index" was true; used explicitly in queries |
| _facet_firstName | produced, if the option "facet" was true; used implicitly for facet search |
| _sort_firstName | produced, if the option "sort" was true; used implicitly for ordering connector results |

The current version always produces a single Solr field per field definition in the configuration. This means that you are responsible for creating all appropriate fields based on your needs. See more under [#Creation parameters].