View Source

{toc}

h2. Description/Motivation

The Lucene4 plug-in for GraphDB provides very fast facet (aggregation) searches, which are normally available through external Apache Solr services, but have the additional benefit to stay automatically up-to-date with the GraphDB repository data.

h2. Features

* maintain a Lucene index that is always synced with the data
* multiple indexes per repository with independent settings. Index options supported:
** stripping the \*ML tags in literals (*false* by default);
** index auto-update (*true* by default);
** specifying a Lucene analyzer (the default is the Lucene's *StandardAnalyzer*);
** a white list of predicates whose values to be indexed;
** a white list of entities rdf:type-s to be indexed (the default is *ALL*);
** a white list of languages to index (the default is *ALL*);
** a white list of predicates whose values to add to the facets index;
* full Lucene syntax for search;
* simple molecules - only literals reachable through zero hops (i.e. a single predicate of an entity);
* snippet extraction - retrieving snippets with highlighted search terms from the search result;
* search flags:
** results paging through _offset_ and _limit_ parameters;
** specifying snippet size per query;
** specifying facets to aggregate.

h2. Differences with Lucene2

h3. Architecture

Lucene2 indexes each relevant statement (triple) as a single Lucene document. This feature is very beneficial for maintaining an index up-to-date, but leads to a few undesirable effects such as returning a single entity more than once for a given search (because more than one literal matches the query) and inability to sort by specific predicates. The Lucene4 plugin creates a single Lucene document per entire entity, which has a field for each of the predicates listed in the index configuration. This leads to slower update times, but solves the two main problems posed above. These decisions have some other minor implications described below. Another significant feature of Lucene4 is the facets support (which is built in Apache Lucene 4.x.x).

h3. Index options

h5. Predicates

The _predicates_ index option is now mandatory and must not be empty. Indexing all predicates for a given entity is not supported.

h5. Additional joins

The _additionalJoins_ index options from Lucene2 have been removed in Lucene4. It can be emulated though. Consider the following Lucene2 index creation:
{code}
# Lucene2 snippet - create index with an additional join
luc2:my-index-name luc2:createIndex "predicates=...; additionalJoins=urn:join,joinValue"
{code}
In order to do this in Lucene4, create the same index, but add _urn:join_ to _predicates_ instead of _additionalJoins_
{code}
# Lucene4 snippet - create index with the join predicate actually indexed
luc4:my-index-name luc4:createIndex "predicates=...,urn:join"
{code}
The filtering itself is done through searching. Since Lucene4 creates Lucene document field for each predicate and the field name is exactly the predicate URI, we can utilize that through the Lucene query syntax
{code}
# Lucene4 snippet - search: equivalent to ?entity luc2:my-index-name "query"
?entity luc4:my-index-name "+(query) +urn\\:join:joinValue"
{code}
To dissect the {code}+urn\\:join:joinValue{code} part (it's all part of the normal Lucene query syntax):
* + means the following query is mandatory
* {code}urn\\:join:joinValue{code} is a query in the form field:value. However, since the field name contains :, we need to escape it with backslash. The backslash, however is also a escaping character in SPARQL and so we need to escape it again so that the plugin will see a single backslash. Furthermore, forward slashes are also special Lucene syntax and need to be escaped too. That means if the join predicate is _http://example.com/slash_ the proper query would be "+(query) +http\\:\/\/example.com\/slash:joinValue"
* also note that whatever the original FTS query, it's best for it to be put in brackets, with plus in front. If you ask for "query +urn\\:join:joinValue" then Lucene will return all entities that match either _query_ OR _urn\\:join:joinValue_ which is not the intent.

h5. Optional joins

The _optionalJoins_ parameter is still there in Lucene4 as in Lucene2: entities that have the specified predicate values OR are missing the predicates completely are indexed, the rest of the entities are not. The optional joins predicate values are indexed as other predicates, in a Lucene field per predicate, which allows to search for specific values as described above in the _Additional joins_ section.

For example, if we create an index like this:
{code}
luc4:my-index-name luc4:createIndex "predicates=urn:label;optionalJoins=urn:join,ok"
{code}
and then insert some entities:
{code}
# the entity bellow will be indexed because it has matching value for <urn:join>
<entity:will-be-indexed> <urn:label> "this entity will be indexed"
<entity:will-be-indexed> <urn:join> "ok"

# the entity below will be indexed because it's lacking the predicate completely
<entity:will-be-indexed-too> <urn:label> "this entity will be indexed too"

# the entity below will not be indexed - different <urn:join> value
<entity:will-not-be-indexed> <urn:join> "cancel"
{code}

Querying on urn:join is possible as in the additional joins example above:
{code}
# will return only the entities that had this specific value for the optional join predicate
?entity luc4:my-index-name "words... +urn\\:join:ok"
{code}

Entities that don't have the optional join predicate get a default value indexed in that field name in order to be able to filter for those too. The default default value is _OPTIONALJOINDEFAULT_ and can be set by using the _optionalJoinDefaults_ index option. For example, with the above entities, searching for "words... \+urn
\\
:join:OPTIONALJOINDEFAULT" will return only _<entity:will-be-indexed-too>_.

We could create the index like this to specify different default value for _<urn:join>_:
{code}
luc4:my-index-name luc4:createIndex "predicates=urn:label;optionalJoins=urn:join,ok;optionalJoinDefaults=urn:join,newdefault"
{code}

Optional join predicate values are indexed, but not tokenized - that means queries for them should match exactly.

h3. Search options

* _return.entities_ parameter is removed. Lucene4 always acts as Lucene2 with _return.entities=dedup_
* _order.by_ parameter added to allow specifying predicates to sort on

h2. User's Guide

h3. Creating an index

To create an index, issue the following SPARQL update query

{code}
PREFIX luc4:<http://www.ontotext.com/owlim/lucene4#>
INSERT DATA {
luc4:my-index-name luc4:createIndex "<index-options>" .
}
{code}

where *<index-options>* can be a combination of the following options, separated by a semicolon ';':
* stripMarkup=true\|false specifying whether to strip tags from HTML/XML literals (default is false)
* autoUpdate=true\|false specifying whether to keep this index automatically up-to-date (default is true)
* enableSnippets=true\|false specifying whether to enable snippets in this index. As of 2013-12-15, this is a dummy flag and snippets are always enabled. You should generally pass a meaningful value here in case we optimize our implementation later.
* analyzer=<analyzer-class-name> Lucene analyzer to use when indexing literals in this index. Threre are two possibilities here:
** specifying Lucene analyzer class name directly - in that case the analyzer should either have a default constructor or a constructor accepting a single _org.apache.lucene.util.Version_ parameter. If you specify an analyzer that doesn't have one of those constructors the index won't be created
** specifying a class derived from _com.ontotext.trree.plugin.lucene4.AnalyzerFactory_
* predicates=<comma-separated-list-of-URIs> - only triples with those predicates will be indexed
* languages=<comma-separated-languages> if specified, only literals tagged with the listed languages will be indexed
* types=<comma-separated-list-of-URIs> if specified, a white list of types to index (i.e. will only index entities that have rdf:type equal to one of the specified URIs)
* facets=<comma-separated-list-of-URIs> if specified, the listed predicates and their values will be indexed in the facets index
* optionalJoins=<\|-separated-predicate-object-pairs> if specified, a white list of additional optional joins to validate. Supports both URIs and Literals as objects. Literals can include spaces. Sample syntax: _optionallJoins=urn:ontology:predicate,longer value\|urn:ontology:predicate,another longer value_. An entity is only indexed iff for each specified predicate it either has the predicate with one of the specified value or doesn't have the predicate at all. If an entity is indexed, a field for each optional join predicate is created with all values and in the case where the entity doesn't have the predicate - a default value for the optional join predicate specified with *optionalJoinDefaults*. The value is indexed, but not tokenized, so any searches within an optional joins field should match exactly
* optionalJoinDefaults=<\|-separated-predicate-object-pairs> if specified, provide different default value to index in the field of entities that don't have the optional join predicate at all. The default for all predicates is "OPTIONALJOINDEFAULT"
* sortPredicates=<comma-separated-list-of-URIs> - predicate values that will be used for sorting at search time - only predicates specified here can be passed to \_order.by_.

Examples:
{code}
PREFIX luc4:<http://www.ontotext.com/owlim/lucene4#>
INSERT DATA {
# creates an index, only indexing rdfs:comment-s with snippets enabled
luc4:my-index-name luc4:createIndex "predicates=http://www.w3.org/2000/01/rdf-schema#comment;enableSnippets=true" .
}
{code}

{code}
PREFIX luc4:<http://www.ontotext.com/owlim/lucene4#>
INSERT DATA {
# creates an index, only indexing entities with rdf:type that's either http://example.com/Type1 or http://example.com/Type2; and only indexing rdfs:label-s
luc4:my-index-name luc4:createIndex "types=http://example.com/Type1,http://example.com/Type2; predicates=http://www.w3.org/2000/01/rdf-schema#label" .
}
{code}

{code}
PREFIX luc4:<http://www.ontotext.com/owlim/lucene4#>
INSERT DATA {
# creates an index, only indexing @en literals or literals without a language tag, use Lucene's EnglishAnalyzer
luc4:my-index-name luc4:createIndex "languages=en,;analyzer=org.apache.lucene.analysis.en.EnglishAnalyzer" .
}
{code}

h3. Drop an index

Example:
{code}
PREFIX luc4:<http://www.ontotext.com/owlim/lucene4#>
INSERT DATA {
# drops an index uniquely specified by its URI
?indexUri luc4:dropIndex ""
}
{code}

h3. List all indices

Example:
{code}
PREFIX luc4:<http://www.ontotext.com/owlim/lucene4#>
SELECT ?uri ?name WHERE {
?uri luc4:list ?name
}
{code}

h3. Query indices health

Indices might become corrupt due to disk failure or other issues. The _luc4:healthCheck_ predicate returns a list of indices along with their health status. In the example below, ?uri will bind to

{code}
PREFIX luc4:<http://www.ontotext.com/owlim/lucene4#>
SELECT ?uri ?health WHERE {
?uri luc4:healthCheck ?health .
}
{code}

h3. Search

# Simple - in the form {code}?entity luc4:indexName "lucene-query"{code} Returns all matching entities' ids as the ?entity binding.
# Advanced - in the form {code}?result luc4:indexName ("lucene-query" "options"){code} Options is ;-separated list of option=value pairs. What will be bound as ?result depends on whether the options specify facet query or not. A list of all supported options:
|| Parameter || Value \\ || Default \\ || Comment \\ ||
| offset \\ | int \\ | 0 \\ | Returns the results starting from the specified value \\ |
| limit \\ | int \\ | 2^31-1 \\ | Lucene query limit - maximum number of results to return \\ |
| snippet.size \\ | int \\ | 250 | The size of the returned snippets \\ |
| class \\ | string \\ | none | Hit highlights are represented with <span>. This optional parameter provides the class, i.e. <span class="xxx"> \\ |
| facets \\ | string \\ | none \\ | Comma-separated list of facet predicates whose values to aggregate. All predicates specified here should also have been specified in the index options _facets_. Passing non-empty list of facet predicates changes the way results are returned, see facet examples below \\ |
| facets.limit \\ | int \\ | 2^31-1 \\ | The number of top values to return per facet predicate. For example, specifying _facets=urn:facet1,urn:facet2;facets.limit=5_ will return ten facet results in total \\ |
| order.by \\ | string \\ | score \\ | Comma-separated list of indexed predicates. If specified the results will be ordered the value of the specified predicates, in the specified order. Values are sorted in an ascending order, to sort certain predicate's values in descending order, prepend - (minus) to it. For example _order.by=urn:predicate1,-urn:predicate2_ will sort the results by the value of urn:predicate1 ascending, then those results that have the same value in urn:predicate1 will be sorted by the value of urn:predicate2, descending. The special value _score_ (lower-case) can be used instead of a predicate to order based on the normal Lucene score (descending by default). For example "_order.by=urn:predicate,score_" will sort on urn:predicate first (ascending) and then on score, while "_order.by=score,urn:predicate_" will sort on score first and on urn:predicate second. Passing "-score" to reverse the order (from lowest to highest score) is also allowed. *NOTE*: Predicates on this list should be specified as _sortPredicates_ at index creation. Ordering by predicate not specified in _sortPredicates_ will lead to unpredictable order \\ |
# Faceted search results - for searches where _facets_ search option is specified, the search predicate (e.g. luc4:indexName) binds a single dummy blank node to its object. You must then query for _luc4:entities_ to get entities that matched the query and _luc4:facets_ to get faceted results. Example, demonstrating this and related special predicates for extra result details:
{code}
PREFIX lucene4:<[http://www.ontotext.com/owlim/lucene4#]>

SELECT ?entity ?score ?snippet ?facetPredicate ?facetValue ?facetCount WHERE {
?r lucene4:basicFacets ( "query-string" "facets=urn:facet1,urn:facet2" ) . # ?r now contains dummy blank node
{
?r lucene4:entities ?entity . # ?entity will bind to the URIs of all entities that matched the query
?entity lucene4:score ?score . # ?score will bind to the Lucene score of the corresponding ?entity for this search
?entity lucene4:snippet ?snippet . # ?snippet will bind to a relevant snippet of the corresponding ?entity for this search
} UNION {
?r lucene4:facets ?facet . # ?facet will bind to all facet results. ?facet itself will be dummy blank node, use the special predicates below to extract information
?facet lucene4:facetPredicate ?facetPredicate . # a predicate for this facet, will be either urn:facet1 or urn:facet2 in this example
?facet lucene4:facetValue ?facetValue . # the facet predicate value for the current facet
?facet lucene4:facetCount ?facetCount . # the number of entities that matched the query having ?facetValue as value for their ?facetPredicate
}
}
{code}
Note the use of UNION to avoid Cartesian product of the set of document and facet results, this is the recommended way to retrieve both documents and facets results.

h4. Additional predicates allowed on bound search results

* ?entity [http://www.ontotext.com/lucene4#score] ?score - binds ?score to the score of the result
* ?entity [http://www.ontotext.com/lucene4#snippet] ?snippet - binds ?snippet to the extracted snippet, with highlights

h4. Additional predicates allowed on bound facets results

* ?facet [http://www.ontotext.com/lucene4#facetPredicate] ?predicate - binds ?predicate to the URI of the current facet predicate
* ?facet [http://www.ontotext.com/lucene4#facetValue] ?value - binds ?value to a value of ?predicate (and up to _facets.limit_ values of ?predicate ordered by the aggregate count in the result set)
* ?facet [http://www.ontotext.com/lucene4#facetCount] ?count - binds ?count to the number of entities in the result set that had ?value as value of ?predicate

h3. Examples

* Make a simple query, return all results, ?entity will will to the resource URI
{code}
SELECT ?entity WHERE {
?entity luc4:index-name "query string"
}
{code}
* Return the top 10 resources, snippets are of size 20, ?entity will bind to the resource URI
{code}
PREFIX luc4:<http://www.ontotext.com/owlim/lucene4#>
SELECT ?entity WHERE {
?entity luc4:index-name ("query string" "limit=10;snippet.size=20")
}
{code}
* Return the top 10 resources, starting from offset 10 (i.e. page size is 10, return second page), documents will be sorted by the value of rdfs:label instead of Lucene score
{code}
PREFIX luc4:<[http://www.ontotext.com/owlim/lucene4#]>
SELECT ?entity WHERE {
?entity luc4:index-name ("query string" "limit=10;offset=10;order.by=http://www.w3.org/2000/01/rdf-schema#label")
}
{code}


h3. Real-world examples

Maybe to get started more quickly, these examples will help you understand how to create an index and then execute searches on it.

The following query creates an index on all entities that have an rdfs:labels with @en or no language tag, using an EnglishAnalyzer with snippets enabled.
{code}
PREFIX luc4:<http://www.ontotext.com/owlim/lucene4#>
INSERT DATA {
<http://www.ontotext.com/owlim/lucene4#sampleIndex> luc4:createIndex "languages=en,;analyzer=org.apache.lucene.analysis.en.EnglishAnalyzer;predicates=http://www.w3.org/2000/01/rdf-schema#label;enableSnippets=true" .
}
{code}

Now that the index is created, you can run the following query to obtain the top 20 entries that start with "a" and snippet where the literal occurs, including the Lucene score.
{code}
PREFIX luc4:<http://www.ontotext.com/owlim/lucene4#>
SELECT * WHERE {
?entity luc4:sampleIndex ("a*" "return.entities=true;limit=20") .
?entity luc4:snippet ?snippet .
?entity luc4:score ?score .
}
{code}

h3. Facets example

Consider, the following RDF data (in turtle format)
{code}
@prefix rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs:<http://www.w3.org/2000/01/rdf-schema#> .

<urn:a>
rdf:type <urn:Type1> ;
<test:facet> "facet-value-1" ;
rdfs:comment "this is instance a of Type1, hello world" .

<urn:b>
rdf:type <urn:Type2> ;
rdfs:comment "this is instance b of Type2, hello world" .

<urn:c>
rdf:type <urn:Type1> ;
<test:facet> "facet-value-2" ;
rdfs:comment "this is instance c of Type1, hello world" .

<urn:d>
rdf:type <urn:Type1> ;
<test:facet> "facet-value-1" ;
rdfs:comment "this is instance d of Type1, hello world" .
{code}

Now create an index, using rdf:type and test:facet as facet predicates, indexing rdfs:comment and having no type restriction:
{code}
PREFIX luc4:<http://www.ontotext.com/owlim/lucene4#>
INSERT DATA {
<http://www.ontotext.com/owlim/lucene4#sampleIndex> luc4:createIndex "predicates=http://www.w3.org/2000/01/rdf-schema#comment;facets=http://www.w3.org/1999/02/22-rdf-syntax-ns#type,test:facet" .
}
{code}
Let's gather some results now:
{code}
PREFIX lucene4:<http://www.ontotext.com/owlim/lucene4#>

SELECT ?entity ?score ?facetPredicate ?facetValue ?facetCount WHERE {
# note empty query string is allowed and will just match all documents
?r lucene4:sampleIndex ( "" "facets=test:facet,http://www.w3.org/1999/02/22-rdf-syntax-ns#type" ) .
{
?r lucene4:entities ?entity .
?entity lucene4:score ?score .
} UNION {
?r lucene4:facets ?facet .
?facet lucene4:facetPredicate ?facetPredicate .
?facet lucene4:facetValue ?facetValue .
?facet lucene4:facetCount ?facetCount .
}
}
{code}
The result bindings will look like in the table below, empty cell means this value is unbound:
|| score || entity || facetValue || facetCount || facetPredicate ||
| 1.0 | urn:a | | | |
| 1.0 | urn:b | | | |
| 1.0 | urn:c | | | |
| 1.0 | urn:d | | | |
| | | facet-value-1 | 2.0 | test:facet |
| | | facet-value-2 | 1.0 | test:facet |
| | | urn:Type1 | 3.0 | [http://www.w3.org/1999/02/22-rdf-syntax-ns#type] |
| | | urn:Type2 | 1.0 | [http://www.w3.org/1999/02/22-rdf-syntax-ns#type] |


h2. Dependencies & Deployment Details

Dependencies:
* OWLIM Plug-in API (releases after build 5.4.6686)
* Lucene 4.5.1
* Apache commons-io 2.4


h2. FAQ

h3. How to use UNION with Lucene4?

*A:* Just like in a normal query. Consider the following example:
{code}
PREFIX luc4:<http://www.ontotext.com/owlim/lucene4#>
SELECT ?c WHERE {
?s luc4:content ("query string" "limit=10;snippet.size=200") .
{
?s rdf:type <http://data.ontotext.com/ontologies/ontology1/Type1> .
bind (?s as ?c) .
}
UNION
{
?s rdf:type <http://data.ontotext.com/ontologies/ontology2/Type2> .
?c <http://data.ontotext.com/ontologies/ontology2/Predicate1> ?s .
}
} LIMIT 10
{code}
The above query joins the union part (with bindings for ?c and ?s) with the lucene part on ?s. Provided that the lucene index contains things of the right classes, i.e. things of type Type1 AND Type2. There are a few noteworthy details though:
# The Lucene query limit applies before the SPARQL query limit.
# You can safely use DISTINCT in order to clean up duplicate results.

h3. If I union up two lucene4 queries (on different lucene4 indices), will the snippets and scoring still work?

*A:* The short answer is no, because Lucene scores are generated per query, so basically one cannot execute 2 different Lucene query and expect adequate scoring when joining the results. Consider the following example:
{code}
PREFIX luc4:<http://www.ontotext.com/owlim/lucene4#>
SELECT ?c ?snippet WHERE {
?c rdf:type <http://data.ontotext.com/ontologies/ontology1/Type1> .

# LUCENE QUERY
{
?c luc4:content ("gold" "return.entities=true;limit=10;snippet.size=200") .
?c luc4:snippet ?snippet .
?c luc4:score ?score .
}
UNION
{
?annotation rdf:type <http://data.ontotext.com/ontologies/ontology2/Type2> .
?annotation luc4:annotations ("gold" "return.entities=true;limit=10;snippet.size=200") .
?annotation luc4:snippet ?snippet .
?annotation luc4:score ?score .
?c <http://data.ontotext.com/ontologies/ontology2/Predicate1> ?annotation .
}

} ORDER BY DESC (?score)
{code}
While the query above is valid, it is not sane because of the reasons mentioned earlier. The results will be incorrect since different scoring is used for the two queries. Instead of using UNION, you should create a single index for Type1, Type2 and Predicate1 and execute just one query.

h3. Sometimes the snippets are less than the number of snippet chars requested, why?

*A:* It is the way Lucene's FastVectorHighlighter generates snippets. For example if the "match" is on an indexed property that is relatively short (such as a report title), then the snippet tends to be less than the entire title even though the title length is less than the requested snippet length. Lucene tends to cut the snippet off chars before the first term match. The getBestFragment method's [javadoc|http://lucene.apache.org/core/3_5_0/api/contrib-highlighter/org/apache/lucene/search/vectorhighlight/FastVectorHighlighter.html#getBestFragment%28org.apache.lucene.search.vectorhighlight.FieldQuery,%20org.apache.lucene.index.IndexReader,%20int,%20java.lang.String,%20int%29] is not very helpful in explaining why.

h3. The drop index SPARQL update request throws an error if the index does not exist. Is there any way to ask if the index exists and if so - drop it?

This is where the _luc4:list_ comes in handy. If you have an index named _"myIndex"_, then you can execute the following SPARQL update and get response code 200, even when the index does not exist:
{code}
PREFIX luc4:<http://www.ontotext.com/owlim/lucene4#>
INSERT {
?indexUri luc4:dropIndex "" .
} WHERE {
?indexUri luc4:list "myIndex" .
};
{code}

h3. How to search for terms/phrases containing stop words?

First of all, let's get the terminology right - a _term_ represents a word of text. What you are actually trying to search for is a _phrase_. In addition, the Lucene4 plug-in supports the Lucene query syntax. This means that if you search for phrases such as "City of Manchester" like this (mind the quotes):
{code}
PREFIX luc4:<http://www.ontotext.com/owlim/lucene4#>
SELECT ?entity WHERE {
?entity luc4:index-name ('''"City of Manchester"''' "return.entities=true;limit=10;snippet.size=200")
}
{code}
There is a caveat, in order to be able to use " in a literal, you need to use the additional quoting construct - [http://www.w3.org/TR/sparql11-query/#QSynLiterals]

You will get appropriate results. Indeed, the Analyzers are filtering stop words, but this isn't an issue, since we are using the same analyzer for index and search - the one that was specified during index creation time. More info on how this actually works could be found here [http://lucene.apache.org/core/3_6_2/api/core/org/apache/lucene/analysis/package-summary.html] in the "Token Position Increments" section.

h3. Are prefix wildcards supported? Can I use both prefix and suffix wildcards on the same term?

Yes, you can. Go for it\!