GraphDB-SE Lucene Connector

compared with
Current by Pavel Mihaylov
on Jun 26, 2015 14:32.

Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (241)

View Page History
{toc:maxLevel=2}

h1. Overview and features

The GraphDB Connectors provide extremely fast normal and facet (aggregation) searches that are typically implemented by an external component or service such as Lucene, but have the additional benefit to stay automatically up-to-date with the GraphDB repository data.
The GraphDB Connectors provide extremely fast normal and faceted (aggregation) searches, typically implemented by an external component or a service such as Lucene but have the additional benefit of staying automatically up-to-date with the GraphDB repository data.

The Connectors provide synchronisation at the _entity_ level, where an entity is defined as having a unique identifier (a URI) and a set of properties and property values. In terms of RDF, this corresponds to a set of triples that have the same subject. In addition to simple properties (defined by a single triple), the Connectors support _property chains_. A property chain is defined as a sequence of triples where each triple's object is the subject of the following triple.

h1. Features

The main features of the GraphDB Connectors are:

* paging of results using _offset_ and _limit_;
* custom mapping of RDF types to Lucene types;
* specifying which Lucene analyzer to use (the default is Lucene's {{StandardAnalyzer}});
* stripping HTML/XML tags in literals (the default is not to strip markup);
* boosting an entity by the \[numeric\] value of one or more predicates;
* custom scoring expressions at query time to evaluate score based on Lucene score and entity boost.

Each feature is described in detail below.

h1. Sample data
h1. Usage

All interactions with the Lucene GraphDB Connector shall be done through SPARQL queries.

There are three types of SPARQL queries:

* INSERT for creating and deleting connector instances;
* SELECT for listing connector instances and querying their configuration parameters;
* INSERT/SELECT for storing and querying data as part of the normal GraphDB data workflow.

In general, this corresponds to _INSERT adds or modifies data_ and _SELECT queries existing data_.

Each connector implementation defines its own URI prefix to distinguish it from other connectors. For the Lucene GraphDB Connector, this is http://www.ontotext.com/connectors/lucene#. Each command or predicate executed by the connector uses this prefix, e.g., http://www.ontotext.com/connectors/lucene#createConnector to create a connector instance for Lucene.

Individual instances of a connector are distinguished by unique names that are also URIs. They have their own prefix to avoid clashing with any of the command predicates. For Lucene, the instance prefix is http://www.ontotext.com/connectors/lucene/instance#.

h3. Sample data

All examples use the following sample data, which describes five fictitious wines: Yoyowine, Franvino, Noirette, Blanquito and Rozova as well as the grape varieties required to make these wines. The minimum required ruleset level in GraphDB is RDFS.

{noformat}{div}

h1. Usage
h1. Setup and maintenance

All interactions with the Lucene GraphDB Connector shall be done through SPARQL queries.

There are three types of SPARQL queries:

* INSERT for creating and deleting connectors;
* SELECT for listing connectors and querying connector configuration parameters;
* INSERT/SELECT for storing and querying data as part of the normal GraphDB data workflow.
h2. Creating a connector instance

In general this corresponds to _INSERT adds or modifies data_ and _SELECT queries existing data_.
Creating a connector instance is done by sending a SPARQL query with the following configuration data:

Each connector implementation defines its own URI prefix to distinguish it from other connectors. For the Lucene GraphDB Connector, this is *http://www.ontotext.com/connectors/lucene#*. Each command or predicate executed by the connector uses this prefix, e.g. <http://www.ontotext.com/connectors/lucene##createConnector> to create a connector for Lucene.

Individual instances of a connector are distinguished by unique names that are also URIs. They have their own prefix to avoid clashing with any of the command predicates. For Lucene, the instance prefix is http://www.ontotext.com/connectors/lucene/instance#.



h2. Creating a connector

Creating a connector is done by sending a SPARQL query with the following configuration data:

* the name of the connector instance (e.g., my_index);
* classes to synchronise;
* properties to synchronise.

{tip:title=What we recommend}
Use the GraphDB Connectors management interface provided by the GraphDB Workbench as it will let lets you create the configuration easily, and then create the connector instance directly or copy the configuration and execute it elsewhere.
{tip}

The create command is triggered by a SPARQL *INSERT* with the *createConnector* predicate, e.g. this will create a connector called *my_index* that will synchronise the wines from the sample data above:
The create command is triggered by a SPARQL *INSERT* with the *createConnector* predicate, e.g., it creates a connector instance called *my_index*, which synchronises the wines from the sample data above:

{div:style=width: 70em}{noformat}


The above command will create a new Lucene connector.
The above command creates a new Lucene connector instance.

The "types" key defines the RDF type of the entities to synchronise and in the example it is only entities of the type <http://www.ontotext.com/example/wine#Wine> (and its subtypes). The "fields" key defines the mapping from RDF to Lucene. The basic building block is the property chain, i.e. a sequence of RDF properties where the object of each property is the subject of the following property. In the example we map three bits of information - the wine's grape, sugar content, and year. Each chain is assigned a short and convenient field name: "grape", "sugar", and "year". The field names are later used in the queries.
The "types" key defines the RDF type of the entities to synchronise and, in the example, it is only entities of the type <http://www.ontotext.com/example/wine#Wine> (and its subtypes). The "fields" key defines the mapping from RDF to Lucene. The basic building block is the property chain, i.e., a sequence of RDF properties where the object of each property is the subject of the following property. In the example, three bits of information are mapped - the grape the wines are made of, sugar content, and year. Each chain is assigned a short and convenient field name: "grape", "sugar", and "year". The field names are later used in the queries.

Grape is an example of a property chain composed of more than one property. First, we take the wine's madeFromGrape property, the object of which is an instance of the type Grape, and then we take the rdfs:label of this instance. Sugar and year are both composed of a single property that links the value directly to the wine.



h2. Dropping a connector instance

h2. Dropping a connector
Dropping a connector instance removes all references to its external store from GraphDB as well as all Lucene files associated with it.

Dropping a connector removes all references to its external store from GraphDB as well as the Sorl core associated with it. Dropping a connector is achieved through a SPARQL INSERT query with the following parameter:
The drop command is triggered by a SPARQL *INSERT* with the *dropConnector* predicate where the name of the connector instance has to be in the subject position, e.g., this removes the connector *my_index*:

* Name of the connector

The drop command is triggered by a SPARQL *INSERT* with the *dropConnector* predicate, e.g. this will remove the connector *:my_index*:

{div:style=width: 70em}{noformat}
PREFIX : <http://www.ontotext.com/connectors/lucene#>
{noformat}{div}

h2. Listing available connectors
h2. Listing available connector instances

Listing connector instances returns all previously created connectors. instances. It is a *SELECT* query with the *listConnectors* predicate:

{div:style=width: 70em}{noformat}
{noformat}{div}

*?cntUri* will be bound to the prefixed URI of the connector that was used during creation, e.g. <http://www.ontotext.com/connectors/lucene/instance#my_index>, while *?cntStr* will be bound to a string, representing the part after the prefix, e.g. "my_index".
*?cntUri* is bound to the prefixed URI of the connector instance that was used during creation, e.g., <http://www.ontotext.com/connectors/lucene/instance#my_index>, while *?cntStr* is bound to a string, representing the part after the prefix, e.g., "my_index".

h2. Status check
h2. Instance status check

The internal state of each connector instance can be queried using a *SELECT* query and the *connectorStatus* predicate:

{div:style=width: 70em}{noformat}
{noformat}{div}

*?cntUri* will be bound to the connector prefixed URI, while *?cntStatus* will be bound to a string representation of the status of the connector represented by this URI. The status is key-value based.
*?cntUri* is bound to the prefixed URI of the connector instance, while *?cntStatus* is bound to a string representation of the status of the connector represented by this URI. The status is key-value based.


h1. Working with data

h2. Adding, updating and deleting data

From the user's point of view, all synchronisation will happen happens transparently without using any additional predicates or naming a specific store explicitly, i.e. the user i.e., you should simply execute standard SPARQL INSERT/DELETE queries. This is achieved by intercepting all changes in the plugin and determining which abstract documents need to be updated.

h2. Querying data
h2. Simple queries

Once a connector has been created it will be possible to query data from it through SPARQL. For each matching abstract document, the connector returns the document's subject. In its simplest form querying is achieved by using a *SELECT* and providing the Lucene query as the object of the *:query* predicate:
Once a connector instance has been created, it is possible to query data from it through SPARQL. For each matching abstract document, the connector instance returns the document subject. In its simplest form, querying is achieved by using a *SELECT* and providing the Lucene query as the object of the *query* predicate:

{div:style=width: 70em}{noformat}
{noformat}{div}

The result will bind ?entity binds *?entity* to the two wines made from grapes that have "cabernet" in their name, namely :Yoyowine and :Franvino.

Note that you must use the field names you chose when you created the connector. It is perfectly valid to have field names identical to the property URIs but then you responsible for escaping any special characters according to what Lucene expects.
{note}
Note that you should use the field names you chose when you created the connector instance. They can be identical to the property URIs but you should escape any special characters according to what Lucene expects.
{note}

First we get an instance of the requested connector by using the RDF notation "X a Y" (= X rdf:type Y), where X is a variable and Y is a connector. X will be bound to an instance of this connector. Then we assign a query to that instance by using the system predicate *:query*. Finally we request the matching entities through the *:entities* predicate.
# Get a query instance of the requested connector instance by using the RDF notation "X a Y" (= X rdf:type Y), where X is a variable and Y is a connector instance URI. X is bound to a query instance of the connector instance.
# Assign a query to the query instance by using the system predicate :query.
# Request the matching entities through the :entities predicate.

It is also possible to provide per query search options by using one or more option predicates. The option predicates are described in details below.


h3. Combining Lucene results with GraphDB data

The bound ?entity can be used in other SPARQL triples in order to build complex queries that fetch additional data from GraphDB. For example GraphDB, for example, to see the actual grapes in the matching wines as well as the year they were made:

{div:style=width: 70em}{noformat}
{noformat}{div}

The result will look looks like this:

|| ?entity || ?grape || ?sugar ||
| :Franvino | :CabernetFranc | 2012 |

{note}
Note that :Franvino is returned twice because it is made from two different grapes, both of which are returned.
{note}

h3. Entity match score

It is possible to access the match score returned by Lucene with the *:score* predicate. As each entity has its own score, the predicate must should come at the entity level. For example:

{div:style=width: 70em}{noformat}
{noformat}{div}

The result will look looks like this but the actual score might be different as it depends on the specific Lucene version:

|| ?entity || ?score ||
| :Franvino | 0.7554128170013428 |

h2. Basic faceting
h2. Basic facet queries

Consider the sample wine data and the my_index connector described previously. We can use the same connector to query facets too:
Consider the sample wine data and the my_index connector instance described previously. You can also query facets using the same instance:

{div:style=width: 70em}{noformat}
{noformat}{div}

It is important to specify the fields we want to facet by using the *facetFields* predicate. Its value must be a simple comma-delimited list of field names. In order to get the faceted results, we have to use the *facets* predicate and as each facet has three components (name, value and count), the facets predicate binds a blank node, which in turn can be used to access the individual values for each component through the predicates *facetName*, *facetValue*, and *facetCount*.
It is important to specify the facet fields by using the *facetFields* predicate. Its value is a simple comma-delimited list of field names. In order to get the faceted results, use the facets predicate. As each facet has three components (name, value and count), the facets predicate binds a blank node, which in turn can be used to access the individual values for each component through the predicates *facetName*, *facetValue*, and *facetCount*.

The resulting bindings will look like in the table below:
The resulting bindings look like the following:

|| facetName || facetValue || facetCount ||
| sugar | medium | 2 |

We You can easily see that there are three wines produced in 2012 and two in 2013. We You also see that three of the wines are dry, while two are medium. However, it is not necessarily true that the three wines produced in 2012 are the same as the three dry wines as each facet is computed independently.

h2. Advanced faceting and aggregations

While basic faceting allows for simple counting of documents based on the discrete values of a particular field, there are more complex faceted or aggregation searches in Lucene. The connector provides a mapping from Lucene results to RDF results but no mechanism for specifying the queries other than executing a [raw query|#Raw queries].


h2. Sorting

It is possible to sort the entities returned by a connector query according to one or more fields. In order to be able to use a certain field for sorting, you have to specify this at the time of creating the connector instance. Sorting is achieved by the *orderBy* predicate the value of which must be a comma-delimited list of fields. Each field may be prefixed with a minus to indicate sorting in descending order. For example:
It is possible to sort the entities returned by a connector query according to one or more fields. Sorting is achieved by the *orderBy* predicate the value of which is a comma-delimited list of fields. Each field can be prefixed with a minus to indicate sorting in descending order. For example:

{div:style=width: 70em}{noformat}
{noformat}{div}

The result will contain contains wines produced in 2013 sorted according to their sugar content in descending order:

|| entity ||
By default, entities are sorted according to their matching score in descending order.

{note}
Note that if you join the entity from the connector query to other triples stored in GraphDB, GraphDB might scramble the order. To remedy this, use ORDER BY from SPARQL.
{note}

{tip:title=Sorting by textual fields}
Sorting by an analysed textual field works but might produce unexpected results. Analysed textual fields are composed of tokens and sorting uses the least (in the lexicographical sense) token. For example, "North America" will be sorted before "Europe" because the token "america" is lexicographically smaller than the token "europe". If you need to sort by a textual field and still do full-text search on it, it is best to create a copy of the field with the setting "analyzed":false. For more information, see [#Copy fields].
{tip}


h2. Limit and offset

Limit and offset are supported on the Lucene side of the query. This is achieved through the predicates *limit* and *offset*. Consider this example in which we specify an offset of 1 and a limit of 1 are specified:

{div:style=width: 70em}{noformat}
{noformat}{div}

The result will contain a single wine, Franvino, as it would be second in the list if we executed the query without the limit and offset:
The result contains a single wine, Franvino. If you execute the query without the limit and offset, Franvino will be second in the list:

|| entity ||
| Blanquito |

Note that the specific order in which GraphDB returns the results, depends on both how Lucene returns the matches, unless you specified sorting.
{note}
Note that the specific order in which GraphDB returns the results depends on how Lucene returns the matches, unless sorting is specified.
{note}

h2. Snippet extraction

Snippet extraction is used to extract highlighted snippets of text that match the query. The snippets are accessed through the dedicated predicate *:snippets*, which *snippets*. It binds a blank node that in in turn provides the actual snippets via the predicates *:snippetField* and *:snippetText*. The predicate :snippets must be attached to the entity, as each entity has a different set of snippets. For example, in a search for Cabernet:

{div:style=width: 70em}{noformat}
{noformat}{div}

The query will return returns the two wines made from Cabernet Sauvignon or Cabernet Franc grapes as well as the respective matching fields and snippets:

|| ?entity || ?snippetField || ?snippetText ||
| :Franvino | grape | <em>Cabernet</em> Franc |

{note}
Note that the actual snippets might be somewhat different as this depends on the specific Lucene implementation.
{note}

It is possible to tweak how the snippets are collected/composed by using the following option predicates:

* *:snippetSize* sets the maximum size of the extracted text fragment, 250 by default.;
* *:snippetSpanOpen* text to insert before the highlighted text, <em> by default.;
* *:snippetSpanClose* text to insert after the highlighted text, </em> by default.

The option predicates are set on the connector query instance, much like the :query predicate.

h2. Total hits

You can get the total number of hits by using the *:totalHits* predicate, e.g. for the connector :my_index and a query that would retrieve all wines made in 2012:
You can get the total number of hits by using the *totalHits* predicate, e.g., for the connector instance my_index and a query that retrieves all wines made in 2012:

{div:style=width: 70em}{noformat}
{noformat}{div}

As there are three wines made in 2012, the value 3 (of type xdd:long) will be bound binds to ?totalHits.

h1. Creation parameters
h1. List of creation parameters

The creation parameters define how a connector instance is created by the :createConnector predicate. There Some are some required parameters and some that are optional. All parameters are provided together in a JSON object, where the parameter names are the object keys. Parameter values may be simple JSON values such as a string or a boolean, or they can be lists or objects.

All of the creation parameters can also be set conveniently from the Create Connector user interface in the GraphDB Workbench without any knowledge of JSON.


h3. Types of entities to sync: types (list of URI)

The RDF types of entities to sync are specified as a list of URIs. At least one type URI must be provided.
h3. analyzer (string), optional, specifies Lucene analyzer

h3. What exactly to sync: fields (list of field object)
The Lucene Connector supports custom Analyzer implementations. They may be specified via the *analyzer* parameter whose value must be a fully qualified name of a class that extends org.apache.lucene.analysis.Analyzer. The class requires either a default constructor or a constructor with exactly one parameter of type org.apache.lucene.util.Version. For example, these two classes are valid implementations:

The fields define exactly what parts of each entity will be synchronised as well as the specific details on the connector side. The field is the smallest synchronisation unit and it maps a property chain from GraphDB to a field in Lucene. The fields are specified as a list of field objects. At least one field object must be provided. Each field object has further keys that specify details.
{code}
package com.ontotext.example;

h4. Name of the field: fieldName (string)
import org.apache.lucene.analysis.Analyzer;

The name of the field defines the mapping on the connector side. It is specified by the key fieldName with a string value. The field name is used at query time to refer to the field. There are few restrictions on the allowed characters in a field name but to avoid unnecessary escaping (which depends on how Lucene parses its queries) we recommend to keep the field names simple.
public class FancyAnalyzer extends Analyzer {
public FancyAnalyzer() {
...
}
...
}
{code}

h4. Property chain to map: propertyChain (list of URI)
{code}
package com.ontotext.example;

The property chain (propertyChain) defines the mapping on the GraphDB side. A property chain is defined as a sequence of triples where the entity URI is the subject of the first triple, its object is the subject of the next triple and so on. In this model, a property chain with a single element corresponds to a direct property defined by a single triple. Property chains are specified as a list of URIs and at least one URI must be provided. If you need to store the entity URI in the connector, you may map it by defining a property chain with a single special URI: $self. Only one field per connector may use the $self notation.
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.util.Version;

h4. The default value: defaultValue (string)
public class SmartAnalyzer extends Analyzer {
public SmartAnalyzer(Version luceneVersion) {
...
}
...
}
{code}

The default value (defaultValue) provides means for specifying a default value for the field when the property chain has no matching values in GraphDB. The default value can be a plain literal, a literal with a datatype (xsd: prefix supported), a literal with language or a URI. It has no default value.
FancyAnalyzer and SmartAnalyzer can then be used by specifying their fully qualified names, for example:

h4. Indexing the field: indexed (boolean)
{div:style=width: 70em}{noformat}
...
"analyzer": "com.ontotext.example.SmartAnalyzer",
...
{noformat}{div}

Fields are indexed by default but that can be changed by using the Boolean option "indexed". True by default.
h3. types (list of URI), required, specifies the types of entities to sync

This option corresponds to Lucene's property "indexed".
The RDF types of entities to sync are specified as a list of URIs. At least one type URI is required.

h4. Storing the field: stored (boolean)
h3. languages (list of string), optional, valid languages for literals

Fields are stored in Lucene by default but that can be changed by using the Boolean option "stored". Stored fields are required for retrieving snippets. True by default.
RDF data is often multilingual but you can map only some of the languages represented in the literal values. This can be done by specifying a list of language ranges to be matched to the language tags of literals according to RFC 4647, Section 3.3.1. Basic Filtering. In addition, an empty range can be used to include literals that have no language tag. The list of language ranges maps all existing literals that have matching language tags.

This options corresponds to Lucene's property "stored".
h3. fields (list of field object), required, defines the mapping from RDF to Lucene

h4. Skipping the analyser: analyzed (boolean)
The fields define exactly what parts of each entity will be synchronised as well as the specific details on the connector side. The field is the smallest synchronisation unit and it maps a property chain from GraphDB to a field in Lucene. The fields are specified as a list of field objects. At least one field object is required. Each field object has further keys that specify details.

When literal fields are indexed in Lucene, they will be analysed according to the analyser settings. Should you require that a given field is not analysed you may use "analyzed". This option has no effect for URIs (they are never analysed). True by default.
h4. fieldName (string), required, name of the field in Lucene

This option corresponds to Lucene's property "tokenized".
The name of the field defines the mapping on the connector side. It is specified by the key fieldName with a string value. The field name is used at query time to refer to the field. There are few restrictions on the allowed characters in a field name but to avoid unnecessary escaping (which depends on how Lucene parses its queries), we recommend to keep the field names simple.

h4. Multivalued fields: multivalued (boolean)
h4. propertyChain (list of URI), required, defines the property chain to reach the value

RDF propreties and synchronised fields may have more than one value. If "multivalued" is set to true, all values will be synchronised to Lucene. If set to false, only a single value will be synchronised. True by default.
The property chain (propertyChain) defines the mapping on the GraphDB side. A property chain is defined as a sequence of triples where the entity URI is the subject of the first triple, its object is the subject of the next triple and so on. In this model, a property chain with a single element corresponds to a direct property defined by a single triple. Property chains are specified as a list of URIs where at least one URI must be provided.


h4. Synchronising for faceting: facet (boolean)
See [#Copy fields] for defining multiple fields with the same property chain.

Fields are synchronised for faceting by default but that can be changed by using the Boolean option "facet". True by default. Fields that are not synchronised for faceting will not be available for faceted search.
h4. defaultValue (string), optional, specifies a default value for the field

h3. Automatic datatype mapping
The default value (defaultValue) provides means for specifying a default value for the field when the property chain has no matching values in GraphDB. The default value can be a plain literal, a literal with a datatype (xsd: prefix supported), a literal with language, or a URI. It has no default value.

The connector will map different types of RDF values to different types of Lucene values according to the basic type of the RDF value (URI or literal) and the datatype of literals. The autodetection will use the following mapping:
h4. indexed (boolean), optional, default true

|| RDF value || RDF datatype || Lucene type ||
| URI | n/a | StringField |
| literal | none | TextField |
| literal | xsd:boolean | StringField with values "true" and "false" |
| literal | xsd:double | DoubleField |
| literal | xsd:float | FloatField |
| literal | xsd:long | LongField |
| literal | xsd:int | IntField |
| literal | xsd:datetime | DateTools.timeToString(), second precision |
| literal | xsd:date | DateTools.timeToString(), day precision |
If indexed, a field is available for Lucene queries. True by default.

The datatype mapping can be affected by the synchronisation options too, e.g. a non-analysed field that has xsd:long values will be indexed with a StringField.
This option corresponds to Lucene's field option "indexed".

h4. stored (boolean), optional, default true

Fields can be stored in Lucene and this is controlled by the Boolean option "stored". Stored fields are required for retrieving snippets. True by default.

Note that for any given field the automatic mapping will use the first value it sees. This will work fine for clean datasets but might lead to problems if your dataset has non-normalised data, e.g. the first value has no datatype but other values have one.
This options corresponds to Lucene's property "stored".

h4. Manual datatype mapping: datatype (string)
h4. analyzed (boolean), optional, default true

The mapping can be overriden through the property "datatype", which can be specified per field. The value of "datatype" may be any of the xsd: types supported by the automatic mapping.
When literal fields are indexed in Lucene, they will be analysed according to the analyser settings. Should you require that a given field is not analysed, you may use "analyzed". This option has no effect for URIs (they are never analysed). True by default.

h3. Literals in what language: languages (list of string)
This option corresponds to Lucene's property "tokenized".

RDF data is often multilingual but you may want to map only some of the languages represented in the literal values. This can be done by specifying a list of language ranges that will be matched to the language tags of literals according to RFC 4647, Section 3.3.1. Basic Filtering. In addition an empty range can be used to include literals that have no language tag. The list of language ranges will map all existing literals that have matching language tags.
h4. multivalued (boolean), optional, default true

RDF properties and synchronised fields may have more than one value. If "multivalued" is set to true, all values will be synchronised to Lucene. If set to false, only a single value will be synchronised. True by default.

h3. Lucene Analyzer

The Lucene Connector supports custom Analyzer implementations. They may be specified via the _analyzer_ parameter whose value must be a fully qualified name of a class that extends org.apache.lucene.analysis.Analyzer. The class must have either a default constructor or a constructor with exactly one parameter of type org.apache.lucene.util.Version. For example, these two classes would be valid implementations:
h4. facet (boolean), optional, default true

{code}
package com.ontotext.example;
Lucene needs to index data in a special way, if it will be used for faceted search. This is controlled by the Boolean option "facet". True by default. Fields that are not synchronised for faceting are also not available for faceted search.

import org.apache.lucene.analysis.Analyzer;
h4. datatype (string), optional, manual datatype override

public class FancyAnalyzer extends Analyzer {
public FancyAnalyzer() {
...
}
...
}
{code}
By default, the Lucene GraphDB Connector uses datatype of literal values to determine how they should be mapped to Lucene types. For more information on the supported datatypes, see [#Datatype mapping].

{code}
package com.ontotext.example;
The datatype mapping can be overridden through the parameter "datatype", which can be specified per field. The value of "datatype" can be any of the xsd: types supported by the automatic mapping.

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.util.Version;
h3. Copy fields

public class SmartAnalyzer extends Analyzer {
public SmartAnalyzer(Version luceneVersion) {
...
}
...
}
{code}
Often, it is convenient to synchronise one and the same data multiple times with different settings to accommodate for different use cases, e.g., faceting or sorting vs full-text search. The Lucene GraphDB Connector has explicit support for fields that copy their value from another field. This is achieved by specifying a single element in the property chain of the form @otherFieldName, where otherFieldName is another non-copy field. Take the following example:

FancyAnalyzer and SmartAnalyzer could then be used by specifying their fully qualified names, for example:

{div:style=width: 70em}{noformat}
...
"fields": [
"analyzer": "com.ontotext.example.SmartAnalyzer", {
"fieldName": "grape",
"propertyChain": [
"http://www.ontotext.com/example/wine#madeFromGrape",
"http://www.w3.org/2000/01/rdf-schema#label"
],
"analyzed": true,
},
{
"fieldName": "grapeFacet",
"propertyChain": [
"@grape"
],
"analyzed": false,
}
]
...
{noformat}{div}

The snippet creates an analysed field "grape" and a non-analysed field "grapeFacet", both fields are populated with the same values and "grapeFacet" is defined as a copy field that refers to the field "facet".

{note}
Note that the connector handles copy fields in a more optimal way than specifying a field with exactly the same property chain as another field.
{note}

h1. Datatype mapping

The Lucene GraphDB Connector maps different types of RDF values to different types of Lucene values according to the basic type of the RDF value (URI or literal) and the datatype of literals. The autodetection uses the following mapping:

|| RDF value || RDF datatype || Lucene type ||
| URI | n/a | StringField |
| literal | none | TextField |
| literal | xsd:boolean | StringField with values "true" and "false" |
| literal | xsd:double | DoubleField |
| literal | xsd:float | FloatField |
| literal | xsd:long | LongField |
| literal | xsd:int | IntField |
| literal | xsd:dateTime | DateTools.timeToString(), second precision |
| literal | xsd:date | DateTools.timeToString(), day precision |

The datatype mapping can be affected by the synchronisation options too, e.g., a non-analysed field that has xsd:long values is indexed with a StringField.



{note}
Note that for any given field the automatic mapping uses the first value it sees. This works fine for clean datasets but might lead to problems, if your dataset has non-normalised data, e.g., the first value has no datatype but other values have.
{note}

h1. Advanced filtering and fine tuning

h3. entityFilter (string)

The _entityFilter_ parameter is used to fine-tune the set of entities and/or individual values for the configured fields, based on the field value. Entities and field values will be are synchronised to Lucene if, and only if, they pass the filter. The entity filter is similar to a FILTER() inside a SPARQL query but not exactly the same. Each configured field can be referred to, in the entity filter, by prefixing it with a "?", much like referring to a variable in SPARQL. Several operators are supported:

|| Operator || Meaning || Example ||
| ?var in (_value1_, _value2_, ...) | Tests if the field _var_'s value is one of the specified values. Values that do not match will be match, are treated as if they were not present in the repository. | {nf}?status in ("active", "new"){nf} |
| ?var not in (_value1_, _value2_, ...) | The negated version of the in-operator. | {nf}?status not in ("archived"){nf} |
| bound(?var) | Tests if the field _var_ has a valid value. This can be used to make the field compulsory. | bound(?name) |
| ( expr ) | Grouping of expressions | {nf}(bound(?name) || bound(?company)) && bound(?address){nf} |

In addition to the operators, there are some constructions that can be used to write filters based not on the values but on values related to them:

h4. Accessing the previous element in the chain

The construction *parent(?var)* can be is used for going to go to a previous level in a property chain. It can be applied recursively as many times as needed, e.g., *parent(parent(parent(?var)))* will go goes back in the chain three times. The effective value of *parent(?var)* can be used with the *in* or *not in* operator like this: {nf}parent(?company) in (<urn:a>, <urn:b>){nf}.

h4. Accessing an element beyond the chain

The construction *?var -> _uri_* (alternatively *?var o _uri_* or just *?var _uri_*) can be is used to access additional values that are accessible through the property _uri_. In essence, this construction corresponds to the triple pattern _value_ _uri_ ?effectiveValue, where ?value is a value bound by the field _var_. The effective value of *?var -> _uri_* can be used with the *in* or *not in* operator like this: {nf}?company -> rdf:type in (<urn:c>, <urn:d>){nf}. It can be combined with *parent()* parent() like this: {nf}parent(?company) -> rdf:type in (<urn:c>, <urn:d>){nf}.

The URI parameter can be a full URI within < > or the special string _rdf:type_ (alternatively just _type_), which will be expanded to http://www.w3.org/1999/02/22-rdf-syntax-ns#type.

h4. Filtering by RDF context graph

The construction *context(?var)* can be *graph(?var)* is used to access the RDF context graph of a field's value. The typical use case is to sync only explicit values: {nf}context(?a) {nf}graph(?a) not in (<http://www.ontotext.com/implicit>){nf}. The construction can be combined with *parent()* like this: {nf}context(parent(?a)) {nf}graph(parent(?a)) in (<urn:a>){nf}.

h4. Entity filters and default values
Entity filters can be combined with default values in order to get more flexible behaviour.

A typical use-case for an entity filter is having soft deletes, i.e., instead of deleting an entity, it is marked as deleted by the presence of a specific value for a given property.

h3. Basic entity filter example

For example, if we create a connector like this:
Given the following RDF data:

{div:style=width: 70em}{noformat}
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix : <http://www.ontotext.com/example#> .

# the entity bellow will be synchronised because it has a matching value for city: ?city in ("London")
:alpha
rdf:type :gadget ;
:name "John Synced" ;
:city "London" .

# the entity below will not be synchronised because it lacks the property completely: bound(?city)
:beta
rdf:type :gadget ;
:name "Peter Syncfree" .

# the entity below will not be synchronised because it has a different city value:
# ?city in ("London") will remove the value "Liverpool" so bound(?city) will be false
:gamma
rdf:type :gadget ;
:name "Mary Syncless" ;
:city "Liverpool" .
{noformat}{div}

If you create a connector instance such as:

{div:style=width: 70em}{noformat}
PREFIX : <http://www.ontotext.com/connectors/lucene#>
PREFIX inst: <http://www.ontotext.com/connectors/lucene/instance#>
{noformat}{div}

and then insert some entities:
The entity :beta is not synchronised as it has no value for _city_.

{div:style=width: 70em}{noformat}
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix : <http://www.ontotext.com/example#> .
To handle such cases, you can modify the connector configuration to specify a default value for _city_:

# the entity bellow will be synchronised because it has a matching value for city: ?city in ("London")
:alpha
rdf:type :gadget ;
:name "John Synced" ;
:city "London" .

# the entity below will not be synchronised because it is lacking the property completely: bound(?city)
:beta
rdf:type :gadget ;
:name "Peter Syncfree" .

# the entity below will not be synchronised - different city value:
# ?city in ("London") will remove the value "Liverpool" so bound(?city) will be false
:gamma
rdf:type :gadget ;
:name "Mary Syncless" ;
:city "Liverpool" .
{noformat}{div}

We could create the following index to specify a default value for _city_:

{div:style=width: 70em}{noformat}
...
{noformat}{div}

The default value will be used for entity:b as it has no value for city in the repository. As the value is "London", the entity will be synchronised.
The default value is used for entity :beta as it has no value for city in the repository. As the value is "London", the entity is synchronised.

h3. Advanced entity filter example

Sometimes data represented in RDF is not well suited to map directly to non-RDF. For example, if we you have news articles and they can be tagged with different concepts (locations, persons, events, etc.), one possible way to model that this is a single property :taggedWith. Consider the following RDF data:

{div:style=width: 70em}{noformat}
{noformat}{div}

Now, if we want to you map this data to Lucene such so that the property *:taggedWith _x_* is mapped to separate fields *taggedWithPerson* and *taggedWithLocation* according to the type of _x_ (we are not interested in events), we you can map :taggedWith twice to different fields and then use an entity filter to get the desired values:

{div:style=width: 70em}{noformat}
{noformat}{div}

{note}
Note: Note that *type* is the short way to write <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>.
{note}

The six articles in the RDF data above will be mapped as such:
{noformat}{div}

If the filter was applied, you should get only :Berlin for taggedWithLocation and only :Einstein and :Mozart for taggedWithPerson:

|| ?facetName || ?facetValue || ?facetCount ||
h1. Overview of connector predicates

The following diagram shows a summary of all predicates that can administer (create, drop, check status) connector instances or issue queries and retrieve results. It can be used as a quick reference of what a particular predicate needs to be attached to. For example, to retrieve entities, you need to use :entities on a search instance and to retrieve snippets, you need to use :snippets on an entity. Variables that are bound as a result of a query are shown in green, blank helper nodes are shown in blue, literals in red, and URIs in orange. The predicates are represented by labelled arrows.

{plantuml}
scale 0.85
left to right direction

{plantuml}



h1. Caveats

h2. Order of control

Even though SPARQL per se is not sensitive to the order of triple patterns, the connectors expect Lucene GraphDB Connector expects to receive certain predicates before others so that queries can be executed properly. In particular, predicates that specify the query or query options need to come before any predicates that fetch results.

The diagram in [#Overview of connector predicates] provides a quick overview of the predicates.
h1. Migrating from a pre-6.2 version

GraphDB prior to 6.2 shipped with a versions of the connectors that had different options and slightly different behaviour. Most existing connector instances will be automatically migrated to the new settings but in some cases it is not possible to continue using the same queries. It is recommended to review the connector configuration after the upgrade and if necessary recreate it with adjusted parameters.
GraphDB prior to 6.2 shipped with a version of the Lucene GraphDB Connector that had different options and slightly different behaviour and internals. Unfortunately, it is not possible to migrate existing connector instances automatically. To prevent any data loss, the Lucene GraphDB Connector will not initialise, if it detects an existing connector in the old format. The recommended way to migrate your existing instances is:

# backup the INSERT statement used to create the connector instance;
# drop the connector;
# deploy the new GraphDB version;
# modify the INSERT statement according to the changes described below;
# re-create the connector instance with the modified INSERT statement.

You might also need to change your queries to reflect any changes in field names or extra fields.

h2. Changes in field configuration and synchronisation

Prior to 6.2, a single field in the config could produce up to three individual fields on the Lucene side, based on the field options. For example, for the field "firstName":

|| field || note ||
| firstName | produced, if the option "index" was true; used explicitly in queries |
| _facet_firstName | produced, if the option "facet" was true; used implicitly for facet search |
| _sort_firstName | produced, if the option "sort" was true; used implicitly for ordering connector results |

The current version always produces a single Lucene field per field definition in the configuration. This means that you are responsible for creating have to create all appropriate fields based on your needs. See more under [#Creation parameters].


{tip}
To mimic the functionality of the old _sort_fieldName fields, you can create a non-analysed [copy field|#Copy fields] (for textual fields) or just use the normal field (for non-textual fields).

{tip}



h1. Migrating from the Lucene4 plugin

java -jar migration.jar --file <input-file> <output-file>
{code}
where *input-file* is your old sparql file and *output-file* is the new sparql file SPARQL file.

you can find possible options with
You can find possible options with:
{code:language=bash}
java -jar migration.jar --help

h3. Select queries using the index
We have changed the syntax for the search queries to be able to match our needs for new features and better design. Here is an example query using the lucene4 plugin:

{div:style=width: 70em}{noformat}
{noformat}{div}

and here is it's connectors the connector variant:

{div:style=width: 70em}{noformat}
{noformat}{div}

{note}
nNote the following changes:

* We are using special predicates for everything - no more key value options in a string;
* The query is actually an instance of the index;
* snippets belong to the entity
* Snippets belong to the entity;
* sSnippets are now first class objects - you can also get the field of the match;
* iIndexes are now an instance of another namespace. This allows you to create indexes with the name "entities", for example.
{note}

Look at [#Overview of connector predicates] for more info on the new syntax and how everything is linked together.
For more information on the new syntax and how everything is linked together, see [#Overview of connector predicates].