GraphDB-SE Lucene Connector

compared with
Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (70)

View Page History
All examples use the following sample data, which describes five fictitious wines: Yoyowine, Franvino, Noirette, Blanquito and Rozova as well as the grape varieties required to make these wines. The minimum required ruleset level in GraphDB is RDFS.

{noformat}
{div:style=width: 70em}{noformat}
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
:hasSugar "medium" ;
:hasYear "2013"^^xsd:integer .
{noformat}{div}

h1. Setup and maintenance
The create command is triggered by a SPARQL *INSERT* with the *createConnector* predicate, e.g. this will create a connector instance called *my_index* that will synchronise the wines from the sample data above:

{noformat}
{div:style=width: 70em}{noformat}
PREFIX : <http://www.ontotext.com/connectors/lucene#>
PREFIX inst: <http://www.ontotext.com/connectors/lucene/instance#>
''' .
}
{noformat}{div}


The drop command is triggered by a SPARQL *INSERT* with the *dropConnector* predicate where the name of the connector instance has to be in the subject position, e.g. this will remove the connector *:my_index*:

{noformat}
{div:style=width: 70em}{noformat}
PREFIX : <http://www.ontotext.com/connectors/lucene#>
PREFIX inst: <http://www.ontotext.com/connectors/lucene/instance#>
inst:my_index :dropConnector "" .
}
{noformat}{div}

h2. Listing available connectors instances
Listing connector instances returns all previously created instances. It is a *SELECT* query with the *listConnectors* predicate:

{noformat}
{div:style=width: 70em}{noformat}
PREFIX : <http://www.ontotext.com/connectors/lucene#>

?cntUri :listConnectors ?cntStr .
}
{noformat}{div}

*?cntUri* will be bound to the prefixed URI of the connector instance that was used during creation, e.g. <http://www.ontotext.com/connectors/lucene/instance#my_index>, while *?cntStr* will be bound to a string, representing the part after the prefix, e.g. "my_index".
The internal state of each connector instance can be queried using a *SELECT* query and the *connectorStatus* predicate:

{noformat}
{div:style=width: 70em}{noformat}
PREFIX : <http://www.ontotext.com/connectors/lucene#>

?cntUri :connectorStatus ?cntStatus .
}
{noformat}{div}

*?cntUri* will be bound to the prefixed URI of the connector instance, while *?cntStatus* will be bound to a string representation of the status of the connector represented by this URI. The status is key-value based.
Once a connector instance has been created, it will be possible to query data from it through SPARQL. For each matching abstract document, the connector instance returns the document subject. In its simplest form, querying is achieved by using a *SELECT* and providing the Lucene query as the object of the *:query* predicate:

{noformat}
{div:style=width: 70em}{noformat}
PREFIX : <http://www.ontotext.com/connectors/lucene#>
PREFIX inst: <http://www.ontotext.com/connectors/lucene/instance#>
:entities ?entity .
}
{noformat}{div}

The result will bind ?entity to the two wines made from grapes that have "cabernet" in their name, namely :Yoyowine and :Franvino.
The bound ?entity can be used in other SPARQL triples in order to build complex queries that fetch additional data from GraphDB. For example, to see the actual grapes in the matching wines as well as the year they were made:

{noformat}
{div:style=width: 70em}{noformat}
PREFIX : <http://www.ontotext.com/connectors/lucene#>
PREFIX inst: <http://www.ontotext.com/connectors/lucene/instance#>
?entity wine:hasYear ?year
}
{noformat}{div}

The result will look like this:
It is possible to access the match score returned by Lucene with the *:score* predicate. As each entity has its own score, the predicate must come at the entity level. For example:

{noformat}
{div:style=width: 70em}{noformat}
PREFIX : <http://www.ontotext.com/connectors/lucene#>
PREFIX inst: <http://www.ontotext.com/connectors/lucene/instance#>
?entity :score ?score
}
{noformat}{div}

The result will look like this but the actual score might be different as it depends on the specific Lucene version:
Consider the sample wine data and the my_index connector instance described previously. We can use the same instance to also query facets:

{noformat}
{div:style=width: 70em}{noformat}
PREFIX : <http://www.ontotext.com/connectors/lucene#>
PREFIX inst: <http://www.ontotext.com/connectors/lucene/instance#>
_:f :facetCount ?facetCount .
}
{noformat}{div}

It is important to specify the fields we want to facet by using the *facetFields* predicate. Its value must be a simple comma-delimited list of field names. In order to get the faceted results, we have to use the *facets* predicate and as each facet has three components (name, value and count), the facets predicate binds a blank node, which in turn can be used to access the individual values for each component through the predicates *facetName*, *facetValue*, and *facetCount*.
It is possible to sort the entities returned by a connector query according to one or more fields. Sorting is achieved by the *orderBy* predicate the value of which must be a comma-delimited list of fields. Each field may be prefixed with a minus to indicate sorting in descending order. For example:

{noformat}
{div:style=width: 70em}{noformat}
PREFIX : <http://www.ontotext.com/connectors/lucene#>
PREFIX inst: <http://www.ontotext.com/connectors/lucene/instance#>
:entities ?entity .
}
{noformat}{div}

The result will contain wines produced in 2013 sorted according to their sugar content in descending order:
Limit and offset are supported on the Lucene side of the query. This is achieved through the predicates *limit* and *offset*. Consider this example in which we specify an offset of 1 and a limit of 1:

{noformat}
{div:style=width: 70em}{noformat}
PREFIX : <http://www.ontotext.com/connectors/lucene#>
PREFIX inst: <http://www.ontotext.com/connectors/lucene/instance#>
:entities ?entity .
}
{noformat}{div}

The result will contain a single wine, Franvino, as it would be second in the list, if we execute the query without the limit and offset:
Snippet extraction is used to extract highlighted snippets of text that match the query. The snippets are accessed through the dedicated predicate *:snippets*, which binds a blank node that in in turn provides the actual snippets via the predicates *:snippetField* and *:snippetText*. The predicate :snippets must be attached to the entity, as each entity has a different set of snippets. For example, in a search for Cabernet:

{noformat}
{div:style=width: 70em}{noformat}
PREFIX : <http://www.ontotext.com/connectors/lucene#>
PREFIX inst: <http://www.ontotext.com/connectors/lucene/instance#>
:snippetText ?snippetText .
}
{noformat}{div}

The query will return the two wines made from Cabernet Sauvignon or Cabernet Franc grapes as well as the respective matching fields and snippets:
You can get the total number of hits by using the *:totalHits* predicate, e.g. for the connector instance :my_index and a query that would retrieve all wines made in 2012:

{noformat}
{div:style=width: 70em}{noformat}
PREFIX : <http://www.ontotext.com/connectors/lucene#>
PREFIX inst: <http://www.ontotext.com/connectors/lucene/instance#>
:totalHits ?totalHits .
}
{noformat}{div}

As there are three wines made in 2012, the value 3 (of type xdd:long) will be bound to ?totalHits.
FancyAnalyzer and SmartAnalyzer could then be used by specifying their fully qualified names, for example:

{noformat}
{div:style=width: 70em}{noformat}
...
"analyzer": "com.ontotext.example.SmartAnalyzer",
...
{noformat}{div}

h3. types (list of URI), required, specifies the types of entities to sync
Lucene needs to index data in a special way, if it will be used for faceted search. This is controlled by the Boolean option "facet". True by default. Fields that are not synchronised for faceting will not be available for faceted search.

{noformat}
{div:style=width: 70em}{noformat}
...
"fields": [
]
...
{noformat}{div}

When we create an analysed field called "grape" and a non-analysed field called "grapeFacet", both fields will be populated with the same values and "grapeFacet" is defined as a copy field that refers to the field "facet".

For example, if we create a connector instance like this:
{noformat}
{div:style=width: 70em}{noformat}
PREFIX : <http://www.ontotext.com/connectors/lucene#>
PREFIX inst: <http://www.ontotext.com/connectors/lucene/instance#>
''' .
}
{noformat}{div}

and then insert some entities:

{noformat}
{div:style=width: 70em}{noformat}
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix : <http://www.ontotext.com/example#> .
:name "Mary Syncless" ;
:city "Liverpool" .
{noformat}{div}

We could create the following index to specify a default value for _city_:

{noformat}
{div:style=width: 70em}{noformat}
...
{
...
}
{noformat}{div}

The default value will be used for entity:b as it has no value for city in the repository. As the value is "London", the entity will be synchronised.
Sometimes data represented in RDF is not well suited to map directly to non-RDF. For example, if we have news articles and they can be tagged with different concepts (locations, persons, events, etc.), one possible way to model that is a single property :taggedWith. Consider the following RDF data:

{noformat}
{div:style=width: 70em}{noformat}
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
rdfs:comment "An article about the Cannes Film Festival in 2013." ;
:taggedWith :Cannes-FF .
{noformat}{div}

Now, if we want to map this data to Lucene so that the property *:taggedWith _x_* is mapped to separate fields *taggedWithPerson* and *taggedWithLocation* according to the type of _x_ (we are not interested in events), we can map :taggedWith twice to different fields and then use an entity filter to get the desired values:

{noformat}
{div:style=width: 70em}{noformat}
PREFIX : <http://www.ontotext.com/connectors/lucene#>
PREFIX inst: <http://www.ontotext.com/connectors/lucene/instance#>
''' .
}
{noformat}{div}

Note that *type* is the short way to write <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>.
This can be checked by issuing a faceted search for taggedWithLocation and taggedWithPerson:

{noformat}
{div:style=width: 70em}{noformat}
PREFIX : <http://www.ontotext.com/connectors/lucene#>
PREFIX inst: <http://www.ontotext.com/connectors/lucene/instance#>
:facetCount ?facetCount .
}
{noformat}{div}

If the filter was applied, you should get only :Berlin for taggedWithLocation and only :Einstein and :Mozart for taggedWithPerson:

{plantuml}
scale 0.85
left to right direction

We changed the syntax for the search queries to be able to match our needs for new features and better design. Here is an example query using the lucene4 plugin:

{noformat}
{div:style=width: 70em}{noformat}
PREFIX luc4:<http://www.ontotext.com/owlim/lucene4#>
SELECT ?c ?snippet WHERE {
?c luc4:score ?score .
}
{noformat}{div}

and here is the connector variant:

{noformat}
{div:style=width: 70em}{noformat}
PREFIX conn:<http://www.ontotext.com/connectors/lucene#>
PREFIX inst:<http://www.ontotext.com/connectors/lucene/instance#>
?s conn:snippetText ?snippet .
}
{noformat}{div}

Note the following changes: