Lucene GraphDB Connector

Version 1 by Pavel Mihaylov
on Jul 02, 2015 17:57.

compared with
Current by Pavel Mihaylov
on Oct 14, 2015 19:17.

Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (12)

View Page History
See [#Copy fields] for defining multiple fields with the same property chain.

See [#Multiple property chains per field] for defining a field whose values are populated from more than one property chain.

h4. defaultValue (string), optional, specifies a default value for the field



h2. Special field definitions

This section provides an overview of additional ways to define a field besides the regular field definitions composed of a field name and a property chain. The following methods are applicable in specific use cases.

h3. Copy fields

{note}

h3. Multiple property chains per field

Sometimes you have to work with data models that define the same concept (in terms of what you want to index in Lucene) with more than one property chain, e.g. the concept of "name" could be defined as a single canoncial name, multiple historical names and some unofficial names. If you want to index those together as a single field in Lucene you can define that as a multiple property chains field.

Fields with multiple property chains are defined as a set of separate _virtual_ fields that will be merged into a single _physical_ field when indexed. Virtual fields are distinguished by the suffix {nf}/xyz{nf}, where xyz is any alphanumeric sequence of convenience. For example, we can define the fields *name/1* and *name/2* like this:

{div:style=width: 70em}{noformat}
{
...
"fields": [
{
"fieldName": "name/1",
"propertyChain": [
"http://www.ontotext.com/example#canonicalName"
],
"fieldName": "name/2",
"propertyChain": [
"http://www.ontotext.com/example#historicalName"
]
...
},
...
}
{noformat}

The values of the fields *name/1* and *name/2* will be merged and synchronised to the field *name* in Lucene.

{note}
You cannot mix suffixed and unsuffixed fields with the same same, e.g. if you defined *myField/new* and *myField/old* you cannot have a field called just *myField*.
{note}

h4. Filters and fields with multiple property chains

Filters can be used with fields defined with multiple property chains. Both the physical field values and the individual virtual field values are available:
* Physical fields are specified without the suffix, e.g. ?myField
* Virtual fields are specified with the suffix, e.g. ?myField/2 or ?myField/alt.

{note:title=Limitation}
Physical fields cannot be combined with parent() as their values come from different property chains. If you really need to filter the same parent level you can rewrite {nf}parent(?myField) in (<urn:x>, <urn:y>){nf} as {nf}parent(?myField/1) in (<urn:x>, <urn:y>) || parent(?myField/2) in (<urn:x>, <urn:y>) || parent(?myField/3) ...{nf} and surround it with parentheses if it is part of a bigger expression.
{note}

h1. Datatype mapping

| ( expr ) | Grouping of expressions | {nf}(bound(?name) || bound(?company)) && bound(?address){nf} |

{note}
* *?var in (...)* filters the values of ?var and leaves only the matching values, i.e. it will modify the actual data that will be synchronised to Lucene
* *bound(?var)* checks if there is any valid value left after filtering operators like *?var in (...)* have been applied
{note}

In addition to the operators, there are some constructions that can be used to write filters based not on the values but on values related to them:

h4. Accessing the previous element in the chain

The construction *parent(?var)* is used for going to a previous level in a property chain. It can be applied recursively as many times as needed, e.g., *parent(parent(parent(?var)))* goes back in the chain three times. The effective value of *parent(?var)* can be used with the *in* or *not in* operator like this: {nf}parent(?company) in (<urn:a>, <urn:b>){nf}, or in the *bound* operator like this: {nf}parent(bound(?var)){nf}.

h4. Accessing an element beyond the chain

The construction *?var -> _uri_* (alternatively *?var o _uri_* or just *?var _uri_*) is used to access additional values that are accessible through the property _uri_. In essence, this construction corresponds to the triple pattern _value_ _uri_ ?effectiveValue, where ?value is a value bound by the field _var_. The effective value of ?var -> _uri_ can be used with the *in* or *not in* operator like this: {nf}?company -> rdf:type in (<urn:c>, <urn:d>){nf}. It can be combined with parent() like this: {nf}parent(?company) -> rdf:type in (<urn:c>, <urn:d>){nf}. The same construction can be applied to the *bound* operator like this: {nf}bound(?company -> <urn:hasBranch>){nf}, or even combined with parent() like this: {nf}bound(parent(?company) -> <urn:hasGroup>){nf}.

The URI parameter can be a full URI within < > or the special string _rdf:type_ (alternatively just _type_), which will be expanded to http://www.w3.org/1999/02/22-rdf-syntax-ns#type.
The diagram in [#Overview of connector predicates] provides a quick overview of the predicates.

h1. Migrating from a pre-6.2 version
h1. Upgrading from previous versions

GraphDB prior to 6.2 shipped with a version of the Lucene GraphDB Connector that had different options and slightly different behaviour and internals. Unfortunately, it is not possible to migrate existing connector instances automatically. To prevent any data loss, the Lucene GraphDB Connector will not initialise, if it detects an existing connector in the old format. The recommended way to migrate your existing instances is:
No special procedures are required for upgrading from:
* GraphDB 6.2 / Lucene Connector 4.0
* GraphDB 6.3 / Lucene Connector 4.1
* GraphDB 6.4 / Lucene Connector 4.1

h3. Migrating from a pre-6.2 version of GraphDB

GraphDB prior to 6.2 shipped with version 3.x of the Lucene GraphDB Connector that had different options and slightly different behaviour and internals. Unfortunately, it is not possible to migrate existing connector instances automatically. To prevent any data loss, the Lucene GraphDB Connector will not initialise, if it detects an existing connector in the old format. The recommended way to migrate your existing instances is:

# backup the INSERT statement used to create the connector instance;
# drop the connector;
You can easily migrate your existing [lucene4 plugin|https://confluence.ontotext.com/display/EM/Lucene4+OWLIM+Plug-in] setup to the new connectors interface.

h3. Create index queries

We provide an automated migration tool for your create index queries. The tool is distributed with GraphDB 6.0 onward and can be found in the tools subdirectory. Here is how to use it:

{code:language=bash}
java -jar migration.jar --file <input-file> <output-file>
{code}
where *input-file* is your old sparql file and *output-file* is the new SPARQL file.

You can find possible options with:
{code:language=bash}
java -jar migration.jar --help
{code}

h3. Select queries using the index
We have changed the syntax for the search queries to be able to match our needs for new features and better design. Here is an example query using the lucene4 plugin: