Lucene GraphDB Connector

compared with
Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (43)

View Page History
h1. Overview and features

The GraphDB Connectors provide extremely fast normal and faceted (aggregation) searches that are searches, typically implemented by an external component or a service such as Lucene but have the additional benefit to stay of staying automatically up-to-date with the GraphDB repository data.

The Connectors provide synchronisation at the _entity_ level, where an entity is defined as having a unique identifier (a URI) and a set of properties and property values. In terms of RDF, this corresponds to a set of triples that have the same subject. In addition to simple properties (defined by a single triple), the Connectors support _property chains_. A property chain is defined as a sequence of triples where each triple's object is the subject of the following triple.
In general, this corresponds to _INSERT adds or modifies data_ and _SELECT queries existing data_.

Each connector implementation defines its own URI prefix to distinguish it from other connectors. For the Lucene GraphDB Connector, this is http://www.ontotext.com/connectors/lucene#. Each command or predicate executed by the connector uses this prefix, e.g., http://www.ontotext.com/connectors/lucene#createConnector to create a connector instance for Lucene.

Individual instances of a connector are distinguished by unique names that are also URIs. They have their own prefix to avoid clashing with any of the command predicates. For Lucene, the instance prefix is http://www.ontotext.com/connectors/lucene/instance#.
Creating a connector instance is done by sending a SPARQL query with the following configuration data:

* the name of the connector instance (e.g., my_index);
* classes to synchronise;
* properties to synchronise.

{tip:title=What we recommend}
Use the GraphDB Connectors management interface provided by the GraphDB Workbench as it will let you create the configuration easily, and then create the connector instance directly or copy the configuration and execute it elsewhere.
{tip}

The create command is triggered by a SPARQL *INSERT* with the *createConnector* predicate, e.g., this will create a connector instance called *my_index* that will synchronise the wines from the sample data above:

{div:style=width: 70em}{noformat}
The above command will create a new Lucene connector instance.

The "types" key defines the RDF type of the entities to synchronise and, in the example, it is only entities of the type <http://www.ontotext.com/example/wine#Wine> (and its subtypes). The "fields" key defines the mapping from RDF to Lucene. The basic building block is the property chain, i.e., a sequence of RDF properties where the object of each property is the subject of the following property. In the example, we map three bits of information - the grape the wines are made of, sugar content, and year. Each chain is assigned a short and convenient field name: "grape", "sugar", and "year". The field names are later used in the queries.

Grape is an example of a property chain composed of more than one property. First, we take the wine's madeFromGrape property, the object of which is an instance of the type Grape, and then we take the rdfs:label of this instance. Sugar and year are both composed of a single property that links the value directly to the wine.


Dropping a connector instance removes all references to its external store from GraphDB as well as all Lucene files associated with it.

The drop command is triggered by a SPARQL *INSERT* with the *dropConnector* predicate where the name of the connector instance has to be in the subject position, e.g., this will remove the connector *:my_index*:

{div:style=width: 70em}{noformat}
{noformat}{div}

*?cntUri* will be bound to the prefixed URI of the connector instance that was used during creation, e.g., <http://www.ontotext.com/connectors/lucene/instance#my_index>, while *?cntStr* will be bound to a string, representing the part after the prefix, e.g., "my_index".

h2. Instance status check
h2. Adding, updating and deleting data

From the user point of view all synchronisation will happen transparently without using any additional predicates or naming a specific store explicitly, i.e., the user should simply execute standard SPARQL INSERT/DELETE queries. This is achieved by intercepting all changes in the plugin and determining which abstract documents need to be updated.

h2. Simple queries
{noformat}{div}

It is important to specify the fields we want to facet by using the *facetFields* predicate. Its value must be a simple comma-delimited list of field names. In order to get the faceted results, we have to use the *facets* predicate and, as each facet has three components (name, value and count), the facets predicate binds a blank node, which in turn can be used to access the individual values for each component through the predicates *facetName*, *facetValue*, and *facetCount*.

The resulting bindings will look like in the table below:
It is possible to tweak how the snippets are collected/composed by using the following option predicates:

* *:snippetSize* sets the maximum size of the extracted text fragment, 250 by default.;
* *:snippetSpanOpen* text to insert before the highlighted text, <em> by default.;
* *:snippetSpanClose* text to insert after the highlighted text, </em> by default.

h2. Total hits

You can get the total number of hits by using the *:totalHits* predicate, e.g., for the connector instance :my_index and a query that would retrieve all wines made in 2012:

{div:style=width: 70em}{noformat}
h1. List of creation parameters

The creation parameters define how a connector instance is created by the :createConnector predicate. There Some are some required parameters and some that are optional. All parameters are provided together in a JSON object, where the parameter names are the object keys. Parameter values may be simple JSON values such as a string or a boolean, or they can be lists or objects.

All of the creation parameters can also be set conveniently from the Create Connector user interface in the GraphDB Workbench without any knowledge of JSON.
h3. languages (list of string), optional, valid languages for literals

RDF data is often multilingual but you may want to map only some of the languages represented in the literal values. This can be done by specifying a list of language ranges that will be matched to the language tags of literals according to RFC 4647, Section 3.3.1. Basic Filtering. In addition, an empty range can be used to include literals that have no language tag. The list of language ranges will map all existing literals that have matching language tags.

h3. fields (list of field object), required, defines the mapping from RDF to Lucene
"fieldName": "grapeFacet",
"propertyChain": [
"@facet" "@grape"
],
"analyzed": false,
h4. datatype (string), optional, manual datatype override

By default, the Lucene GraphDB Connector will use datatype of literal values to determine how they should be mapped to Lucene types. See [#Datatype mapping] for For more information on the supported datatypes, see [#Datatype mapping].

The datatype mapping can be overriden through the parameter "datatype", which can be specified per field. The value of "datatype" may be any of the xsd: types supported by the automatic mapping.
h3. Copy fields

Often, it is convenient to synchronise one and the same data multiple times with different settings to accomodate for different use cases, e.g., faceting or sorting vs full-text search. The Lucene GraphDB Connector has explicit support for fields that copy their value from another field. This is achieved by specifying a single element in the property chain of the form @otherFieldName, where otherFieldName is another non-copy field. For example, with this snippet:

h1. Datatype mapping
| literal | xsd:date | DateTools.timeToString(), day precision |

The datatype mapping can be affected by the synchronisation options too, e.g., a non-analysed field that has xsd:long values will be indexed with a StringField.



Note that for any given field the automatic mapping will use the first value it sees. This will work fine for clean datasets but might lead to problems, if your dataset has non-normalised data, e.g., the first value has no datatype but other values have.

h1. Advanced filtering and fine tuning
| ( expr ) | Grouping of expressions | {nf}(bound(?name) || bound(?company)) && bound(?address){nf} |

In addition to the operators, there are some constructions that can be used to write filters based not on the values but on values related to them:

h4. Accessing the previous element in the chain

The construction *parent(?var)* can be used to go to a previous level in a property chain. It can be applied recursively as many times as needed, e.g., *parent(parent(parent(?var)))* will go back in the chain three times. The effective value of *parent(?var)* can be used with the *in* or *not in* operator like this: {nf}parent(?company) in (<urn:a>, <urn:b>){nf}.

h4. Accessing an element beyond the chain
The URI parameter can be a full URI within < > or the special string _rdf:type_ (alternatively just _type_), which will be expanded to http://www.w3.org/1999/02/22-rdf-syntax-ns#type.

h4. Filtering by RDF context graph

The construction *context(?var)* *graph(?var)* can be used to access the RDF context graph of a field's value. The typical use case is to sync only explicit values: {nf}context(?a) {nf}graph(?a) not in (<http://www.ontotext.com/implicit>){nf}. The construction can be combined with *parent()* like this: {nf}context(parent(?a)) {nf}graph(parent(?a)) in (<urn:a>){nf}.

h4. Entity filters and default values
Entity filters can be combined with default values in order to get more flexible behaviour.

A typical use-case for an entity filter is having soft deletes, i.e., instead of deleting an entity, it is marked as deleted by the presence of a specific value for a given property.

h3. Basic entity filter example

For example, if we create a connector instance like this: such as:
{div:style=width: 70em}{noformat}
PREFIX : <http://www.ontotext.com/connectors/lucene#>
:city "London" .

# the entity below will not be synchronised because it is lacking lacks the property completely: bound(?city)
:beta
rdf:type :gadget ;
:name "Peter Syncfree" .

# the entity below will not be synchronised - because it has a different city value:
# ?city in ("London") will remove the value "Liverpool" so bound(?city) will be false
:gamma
h3. Advanced entity filter example

Sometimes data represented in RDF is not well suited to map directly to non-RDF. For example, if we have news articles and they can be tagged with different concepts (locations, persons, events, etc.), one possible way to model that this is a single property :taggedWith. Consider the following RDF data:

{div:style=width: 70em}{noformat}
h1. Overview of connector predicates

The following diagram shows a summary of all predicates that can administer (create, drop, check status) connector instances or issue queries and retrieve results. It can be used as a quick reference of what a particular predicate needs to be attached to. For example, to retrieve entities, you need to use :entities on a search instance and to retrieve snippets, you need to use :snippets on an entity. Variables that are bound as a result of a query are shown in green, blank helper nodes are shown in blue, literals in red, and URIs in orange. The predicates are represented by labelled arrows.

{plantuml}
# deploy the new GraphDB version;
# modify the INSERT statement according to the changes described below;
# re-create the connector instance with the modified INSERT statement;.

You might also need to change your queries to reflect any changes in field names or extra fields.

|| field || note ||
| firstName | produced, if the option "index" was true; used explicitly in queries |
| _facet_firstName | produced, if the option "facet" was true; used implicitly for facet search |
| _sort_firstName | produced, if the option "sort" was true; used implicitly for ordering connector results |

The current version always produces a single Lucene field per field definition in the configuration. This means that you are responsible for creating all appropriate fields based on your needs. See more under [#Creation parameters].
java -jar migration.jar --file <input-file> <output-file>
{code}
where *input-file* is your old sparql file and *output-file* is the new sparql file.

you can find possible options with
You can find possible options with:
{code:language=bash}
java -jar migration.jar --help

h3. Select queries using the index
We have changed the syntax for the search queries to be able to match our needs for new features and better design. Here is an example query using the lucene4 plugin:

{div:style=width: 70em}{noformat}
* Snippets belong to the entity;
* Snippets are now first class objects - you can also get the field of the match;
* Indexes are now an instance of another namespace. This allows you to create indexes with the name "entities", for example.

See [#Overview of connector predicates] for more info on the new syntax and how everything is linked together.
For more information on the new syntax and how everything is linked together, see [#Overview of connector predicates].