Lucene GraphDB Connector

compared with
Current by Nikola Petrov
on May 12, 2015 14:48.

Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (64)

View Page History
The GraphDB Connectors provide extremely fast normal and facet (aggregation) searches that are typically implemented by an external component or service such as Lucene, but have the additional benefit to stay automatically up-to-date with the GraphDB repository data.

The Connectors provide synchronisation at the _entity_ level, where an entity is defined as having a unique identifier (a URI) and a set of properties and property values. In terms of RDF, this corresponds to a set of triples that have the same subject. In addition to simple properties (defined by a single triple), the Connectors support _property chains_. A property chain is defined as a sequence of triples where each subsequent's triple's object is the subject of the following triple.

h1. Features
The main features of the GraphDB Connectors are:

* maintaining an index that is always in sync with the data stored in GraphDB
* multiple independent instances per repository
* the entities to synchronise for synchronisation are defined by:
** a list of fields (on the Lucene side) and property chains (on the GraphDB side) whose values to sync will be synchronised
** a list of rdf:type's of entities to sync
**a list of rdf:type's of the entities for synchronisation
** a list of languages to sync (default for synchronisation (the default is all languages)
** additional filtering by property and value
* full-text search using native Lucene queries
* custom scoring expressions at query time to evaluate score based on Lucene score and entity boost

Each feature will be described in detail later on.
Each feature is described in detail below.

h1. Sample data

All examples below will use the following sample data. It describes five fictitious wines: Yoyowine, Franvino, Noirette, Blanquito and Rozova, as well as the grape varieties needed to make thoese wines. The minimum needed ruleset level in GraphDB is RDFS.

{code}
{code}


h1. Usage

All interactions with the Lucene GraphDB Connector shall be done through SPARQL queries.

There are three types of SPARQL queries:
In general this corresponds to _INSERT adds or modifies data_ and _SELECT queries existing data_.

Each connector implementation defines its own URI prefix to distinguish it from other connectors. For the Lucene GraphDB Connector this is *http://www.ontotext.com/connectors/lucene#*. Each command or predicate that will be is executed by the connector uses this prefix, e.g. <http://www.ontotext.com/connectors/lucene##createConnector> for creating a connector for Lucene.

Individual instances of a connector are distinguished by unique names that are also URIs. They have their own prefix in order not to avoid clashing with any of the command predicates. For Lucene, the instance prefix is http://www.ontotext.com/connectors/lucene/instance#.

h2. Creating a connector

Creating a connector is should bye done by sending a SPARQL query with the following configuration data:

* Name of the connector (e.g. my_index)
* Classes to synchronise
* Properties to synchronise
* the name of the connector (e.g. my_index),
* classes to synchronise,
* properties to synchronise.

The configuration data must be provided as a JSON string representation and passed together with the create command.
{tip}

The create command is triggered by a SPARQL *INSERT* with the *createConnector* predicate, e.g. this will create a connector called *my_index* that will synchronise the wines from the sample data above:

{code}
{code}

Note that one of the fields has _"sort": true_. This will be explained under sorting below.
Note that one of the fields has _"sort": true_. This is explained further under [Sorting|#sorting].

The above command will create a new Lucene connector.

The "types" key defines the RDF type of the entities to synchronise and in the example it is only entities of the type <http://www.ontotext.com/example/wine#Wine> (and its subtypes). The "fields" key defines the mapping from RDF to Lucene. The basic building block is the property chain, i.e. a sequence of RDF properties where the object of each property is the subject of the following property. In the example we map three bits of information, information - the wine's grape, sugar content, and year. Each chain is assigned a short and convenient field name: "grape", "sugar", and "year". The field names will be are later used in the queries.

Grape is an example of a property chain composed of more than one property. First we take the wine's madeFromGrape property whose property, the object of which is an instance of type Grape, and then we take the rdfs:label of that this instance. Sugar and year are both composed of a single property that links the value directly to the wine.





h2. Dropping a connector

h2. Listing available connectors

Listing connectors should return returns all previously created connectors. It is a *SELECT* query with the *listConnectors* predicate:

{code}
{code}

*?cntUri* will be bound to the prefixed URI of the connector that was used during creation, e.g. <http://www.ontotext.com/connectors/lucene/instance#my_index>, while *?cntStr* will be bound to a string, representing the part after the prefix, e.g. "my_index".

h2. Status check
{code}

*?cntUri* is bound to the connector prefixed URI, while *?cntStatus* is a string representation of the status for the connector represented by that URI. The status is key-value based.
*?cntUri* will be bound to the connector prefixed URI, while *?cntStatus* will be bound to a string representation of the status of the connector represented by this URI. The status is key-value based.

h2. Adding, updating and deleting data

From the user's point of view all synchronisation should will happen transparently without using any additional predicates or naming a specific store explicitly, i.e. the user should simply execute standard SPARQL INSERT/DELETE queries. This is achieved by intercepting all changes in the plugin and determining which abstract documents need to be updated.

h2. Querying data

Once a connector has been created it should will be possible to query data from it through SPARQL. For each matching abstract document, the connector returns the document's subject. In its simplest form querying is achieved by using a *SELECT* and providing the Lucene query as the object of the *:query* predicate:

{code}
The result will bind ?entity to the two wines made from grapes that have "cabernet" in their name, namely :Yoyowine and :Franvino.

Note that you must use the field names you chose when you created the connector. It is perfectly valid to have field names identical to the property URIs but then you are responsible for escaping any special characters according to what Lucene expects.

First we get an instance of the requested connector by using the RDF notation "X a Y" (= X rdf:type Y), where X is a variable and Y is a connector. X will be bound to an instance of that this connector. Then we assign a query to that instance by using the system predicate *:query*. Finally we request the matching entities through the *:entities* predicate.

It is also possible to provide per query search options by using one or more option predicates. The option predicates will be are described in detail further details below.



h3. Combining Lucene results with GraphDB data

{code}

The result will look like this:

|| ?entity || ?grape || ?sugar ||
| :Franvino | :CabernetFranc | 2012 |

Note that :Franvino is returned twice because it is made from two different grapes, both of which are both returned.

h3. Entity match score

It is possible to access the match score returned by Lucene with the *:score* predicate. Since As each entity has its own score, the predicate must come at the entity level, for level. For example:

{code}
{code}

It is important to specify the fields we want to facet by using the *facetFields* predicate. Its value must be a simple comma-delimited list of field names. In order to get the facetted results, we have to use the *facets* predicate and since as each facet has three components (name, value and count), the facets predicate will bind binds a blank node that node, which in turn can be used to access the individual values for each component through the predicates *facetName*, *facetValue*, and *facetCount*.

The resulting bindings will look like in the table below:

|| facetName || facetValue || facetCount ||
| sugar | medium | 2 |

We can easily see that there are three wines that were produced in 2012 and two in 2013. We also see that three of the wines are dry, while two are medium. However, it is not necessarily true that the three wise wines produced in 2012 are the same as the three dry wines as each facet is computed independently.

{anchor:sorting}
h2. Sorting

It is possible to sort the entities returned by a connector query according to one or more fields. In order to be able to use a certain field for sorting you have to specify that during the creation of the connector instance. Sorting is achieved through the *orderBy* predicate whose value must be a comma-delimited list of fields to sort according to. Each field may be prefixed with a minus to indicate sorting in descending order. For example:
It is possible to sort the entities returned by a connector query according to one or more fields. In order to be able to use a certain field for sorting, you have to specify this at the time of creating the connector instance. Sorting is achieved by the *orderBy* predicate the value of which must be a comma-delimited list of fields. Each field may be prefixed with a minus to indicate sorting in descending order. For example:

{code}
| Yoyowine |

By default, entities are sorted according to their matching score in descending order.

Note that GraphDB might scramble the order if you join the entity from the connector query to other triples stored in GraphDB. In order to remedy this use ORDER BY from SPARQL.
Note that if you join the entity from the connector query to other triples stored in GraphDB, GraphDB might scramble the order. To remedy this, use ORDER BY from SPARQL.

h2. Limit and offset

Limit and offset are supported on the Lucene side of the query. This is achieved through the predicates *limit* and *offset*. Consider this example, in which we specify an offset of 1 and a limit of 1:

{code}
| Blanquito |

Note that the specific order in which GraphDB returns the results, depends both on both how Lucene returns the matches, unless you specified sorting.

h2. Snippet extraction

Snippet extraction is used to extract highlighted snippets of text that match the query. The snippets are accessed through the dedicated predicate *:snippets*, which will bind binds a blank node that in in turn provides the actual snippets via the predicates *:snippetField* and *:snippetText*. The predicate :snippets must be attached to the entity, as each entity has a different set of snippets. For example, in a search for Cabernet:

{code}
{code}

Since As there are three wines made in 2012, the value 3 (of type xdd:long) will be bound to ?totalHits.

h1. Creation parameters
h4. Property chain to map: propertyChain (list of URI)

The property chain (propertyChain) defines the mapping on the GraphDB side. A property chain is defined as a sequence of triples where the entity URI is the subject of the first triple, its object is the subject of the next triple and so on. In this model, a property chain with a single element corresponds to a direct property defined by a single triple. Property chains are specified as a list of URIs and at least one URI must be provided. If you need to store the entity URI in the connector, you may map it by defining a property chain with a single special URI: $self. Only one field per connector may use the $self notation.

h4. The default value: defaultValue (string)
h4. Indexing the field: index (boolean)

Fields are indexed by default but that can be changed by using the Boolean option "index". True by default. Fields that are not indexed will be unavailable for queries but may still be used for faceting or sorting, if these are enabled.

h4. Synchronising for faceting: facet (boolean)
h4. Skipping the analyser: syncAsIs (boolean)

When literal fields are indexed in Lucene, they will be analysed according to the analyser settings. Should you require that a given field is not analysed you may use syncAsIs. False by default.

h2. Optional parameters

h3. Lucene Analyzer

The Lucene Connector supports custom Analyzer implementations. They may be specified via the _analyzer_ parameter whose value must be a fully qualified name of a class that extends org.apache.lucene.analysis.Analyzer. The class must have either a default constructor or a constructor with exactly one parameter of type org.apache.lucene.util.Version. For example, these two classes would be valid implementations:

{code}
package com.ontotext.example;

import org.apache.lucene.analysis.Analyzer;

public class FancyAnalyzer extends Analyzer {
public FancyAnalyzer() {
...
}
...
}
{code}

{code}
package com.ontotext.example;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.util.Version;

public class SmartAnalyzer extends Analyzer {
public SmartAnalyzer(Version luceneVersion) {
...
}
...
}
{code}

FancyAnalyzer and SmartAnalyzer could then be used by specifying their fully qualified names, for example:

{code}
...
"analyzer": "com.ontotext.example.SmartAnalyzer",
...
{code}

h3. Literals in what language: languages (list of string)

h3. Entity filtering: entityFilter (string)

The _entityFilter_ parameter is used to fine-tune the set of entities and/or individual values for the configured fields, based on the field value. Entities and field values will be synchronised to Lucene if, and only if, they pass the filter. The entity filter is similar to a bit like a FILTER() inside a SPARQL query but not quite exactly the same. Each configured field can be referred to in the entity filter by prefixing it with a "?", much like referring to a variable in SPARQL. Several operators are supported:

|| Operator || Meaning || Example ||
{code}

We could create the following index like this to specify a default value for _city_:

{code}
{code}

The default value will be used for entity:b as it has no value for city in the repository. Since As the value is "London", the entity will be synchronised.

h4. Advanced entity filter example

Sometimes data represented in RDF is not ideally well suited to map directly to non-RDF. For example, if we have news articles and they can be tagged with different concepts (locations, persons, events, etc.), one possible way to model that is a single property :taggedWith. Consider the following RDF data:

{code}
{code}

Now, if we want to map this data to Lucene such that the property *:taggedWith _x_* is mapped to separate fields *taggedWithPerson* and *taggedWithLocation* according to the type of _x_ (we are not interested in events), we can map :taggedWith twice to different fields and then use an entity filter to get the desired values:

{code}
|| Article URI || Entity mapped? || Value in taggedWithPerson || Value in taggedWithLocation || Explanation ||
| :Article1 | yes | :Einstein | :Berlin | :taggedWith has the values :Einstein, :Berlin and :Cannes-FF. The filter leaves only the correct values in the respective fields. The value :Cannes-FF is ignored as it does not match the filter. |
| :Article2 | yes | | :Berlin | :taggedWith has the value :Berlin. After the filter is applied, only taggedWithLocation is populated. |
| :Article3 | yes | :Mozart | | :taggedWith has the value :Mozart. After the filter is applied, only taggedWithPerson is populated |
| :Article4 | yes | :Mozart | :Berlin | :taggedWith has the values :Berlin and :Mozart. The filter leaves only the correct values in the respective fields. |
| :Article5 | yes | | | :taggedWith has no values. The filter is not relevant. |
| :Article6 | yes | | | :taggedWith has the value :Cannes-FF. The filter removes it as it does not match. |

This can be checked by issuing a faceted search for taggedWithLocation and taggedWithPerson:

{code}
h1. Overview of connector predicates

The following diagram presents shows a summary of all predicates that can administer (create, drop, check status) connector instances or issue queries and retrieve results. It can be used as a quick reference of what a particular predicate needs to be attached to. For example, to retrieve entities you need to use :entities on a search instance and to retrieve snippets you need to use :snippets on an entity. Variables that are bound as a result of query are shown in green, blank helper nodes are shown in blue, literals in red, and URIs in orange. The predicates are represented by labelled arrows.

{plantuml}
h2. Order of control

Even though SPARQL per se is not sensitive to the order of triple patterns, the connectors expect to receive certain predicates before others so that queries can be executed properly. In particular, predicates that specify the query or query options need to come before any predicates that fetch results. Refer to the diagram in [#Overview of connector predicates] for a quick overview of the predicates.

The diagram in [#Overview of connector predicates] provides a quick overview of the predicates.


h1. Migrating from Lucene4 plugin

You can easily migrate your existing [lucene4 plugin|https://confluence.ontotext.com/display/EM/Lucene4+OWLIM+Plug-in] setup to the new connectors interface.

h3. Create index queries

We provide an automated migration tool for your create index queries. The tool is distributed with GraphDB 6.0 onward and can be found in the tools subdirectory. Here is how to use it:

{code}
java -jar migration.jar --file <input-file> <output-file>
{code}
where *input-file* is your old sparql file and *output-file* is the new sparql file

you can find possible options with
{code}
java -jar migration.jar --help
{code}

h3. Select queries using the index
We changed the syntax for the search queries to be able to match our needs for new features and better design. Here is an example query using the lucene4 plugin:

{code}
PREFIX luc4:<http://www.ontotext.com/owlim/lucene4#>
SELECT ?c ?snippet WHERE {
?c rdf:type <http://data.ontotext.com/ontologies/ontology1/Type1> .

?c luc4:content ("gold" "limit=10;snippet.size=200") .
?c luc4:snippet ?snippet .
?c luc4:score ?score .
}
{code}

and here is it's connectors variant:

{code}
PREFIX conn:<http://www.ontotext.com/connectors/lucene#>
PREFIX inst:<http://www.ontotext.com/connectors/lucene/instance#>

SELECT ?c ?snippet WHERE {
[] a inst:content ;
conn:query "gold" ;
conn:limit "10" ;
conn:snippetSize "200" ;
conn:entities ?entity

?entity conn:snippets ?s .
?s conn:snippetText ?snippet .
}
{code}

note the following changes:

* We are using special predicates for everything - no more key value options in a string
* The query is actually an instance of the index
* snippets belong to the entity
* snippets are now first class objects - you can also get the field of the match
* indexes are now an instance of another namespace. This allows you to create indexes with the name "entities" for example.

Look at [#Overview of connector predicates] for more info on the new syntax and how everything is linked together.