View Source

{toc}

h2. Description/Motivation

The motivation behind the development of the Lucene2 Plug-in was the need for a full-text search in GraphDB, which is automatically up-to-date with the repository data. The existing OWLIM Lucene plug-in lacked incremental building and was not kept up-to-date automatically. This plug-in comes to fill in these gaps while adding some useful features.

h2. Features

* maintaining a Lucene index that is always synced with the data;
* multiple indices per repository with independent settings - the index options supported are as follows:
** stripping the \*ML tags in literals (*false* by default);
** index auto-update (*true* by default);
** enabling snippets in the index (enabling snippets will make the index significantly larger) (*false* by default);
** specifying a Lucene analyzer (the default is the Lucene's *StandardAnalyzer*);
** a white list of predicates, whose values to index (the default is *ALL*);
** a white list of entities rdf:type-s to index (the default is *ALL*);
** a white list of additional joins, used to restrict what should be indexed more than just by type;
** basic wildcard support for the additional join values;
** a white list of languages to index (the default is *ALL*);
* full Lucene syntax for search;
* simple molecules - only literals reachable through zero hops (i.e. a single predicate of an entity);
* snippet extraction - retrieving snippets with highlighted search terms from the search result;
* easy access to all parts of the triple that matched a query (i.e. easy way to retrieve the subject, predicate and object of the triple, having an object literal matching the query);
* search flags:
** binding the object of the query either to entity or literal;
** returning only the top N entries of the Lucene query;
** specifying snippet size per query.

h2. User's Guide

h3. Creating an index

To create an index, issue the following SPARQL update query:

{code}
PREFIX luc2:<http://www.ontotext.com/owlim/lucene2#>
INSERT DATA {
luc2:my-index-name luc2:createIndex "<index-options>" .
}
{code}

where *<index-options>* can be a combination of the following options, separated by a semicolon ';':
* {{stripMarkup=true\|false}} \- specifying whether to strip tags from HTML/XML literals (*false* by default);
* {{autoUpdate=true\|false}} \- specifying whether to keep the index automatically up-to-date (*true* by default);
* {{enableSnippets=true\|false}} \- specifying whether to enable snippets in this index (*false* by default);
* {{analyzer=<analyzer-class-name>}} \- specifying which Lucene analyzer to use when indexing literals in this index. There are two options here:
** directly specifying the Lucene analyzer class name - in this case, the analyzer should either have a default constructor or a constructor accepting a single {{org.apache.lucene.util.Version}} parameter. If you specify an analyzer that does not have one of these constructors, the index will not be created.
** specifying a class derived from {{com.ontotext.trree.plugin.lucene.AnalyzerFactory}} \- as used in the "old" Lucene plug-in;
* {{predicates=<comma-separated-list-of-URIs>}} \- if specified, only triples with these predicates will be listed;
* {{languages=<comma-separated-languages>}} \- if specified, only literals tagged with the listed languages will be indexed;
* {{types=<comma-separated-list-of-URIs>}} \- if specified, a white list of types will be indexed (i.e. only entities that have rdf:type equal to one of the specified URIs will be indexed);
* {{additionalJoins=<\|-separated-predicate-object-pairs>}} \- if specified, a white list of additional joins will enforce a check after the type check, and in case no additional joins are found for a given entity, it will not be indexed. Supports both URIs and Literals as objects. Literals can include spaces. Sample syntax:
{code}additionalJoins=urn:ontology:predicate,longer value|urn:ontology:predicate,another longer value{code}
In order to provide a wildcard, just leave the value empty, for example:
{code}additionalJoins=urn:ontology:predicate,;{code}
* {{optionalJoins=<\|-separated-predicate-object-pairs>}} \- if specified, a white list of additional optional joins to validate. This option has the same behaviour as the additionalJoins, except that the check will succeed, and the entity will be indexed if the specified predicates are missing (in addition to the usual condition of a predicate-value match). The same syntax as _additionalJoins_ is used, except wildcard values are not supported.

Examples:
{code}
PREFIX luc2:<http://www.ontotext.com/owlim/lucene2#>
INSERT DATA {
# creates an index, only indexing rdf:comment-s with snippets enabled
luc2:my-index-name luc2:createIndex "predicates=http://www.w3.org/2000/01/rdf-schema#comment;enableSnippets=true" .
}
{code}

{code}
PREFIX luc2:<http://www.ontotext.com/owlim/lucene2#>
INSERT DATA {
# creates an index, only indexing entities with rdf:type that's either http://example.com/Type1 or http://example.com/Type2
luc2:my-index-name luc2:createIndex "types=http://example.com/Type1,http://example.com/Type2" .
}
{code}

{code}
PREFIX luc2:<http://www.ontotext.com/owlim/lucene2#>
INSERT DATA {
# creates an index, only indexing @en literals or literals without a language tag, use Lucene's EnglishAnalyzer
luc2:my-index-name luc2:createIndex "languages=en,;analyzer=org.apache.lucene.analysis.en.EnglishAnalyzer" .
}
{code}

h3. Drop an index

Example:
{code}
PREFIX luc2:<http://www.ontotext.com/owlim/lucene2#>
INSERT DATA {
# drops an index uniquely specified by its URI
?indexUri luc2:dropIndex ""
}
{code}

h3. List all indices

Example:
{code}
PREFIX luc2:<http://www.ontotext.com/owlim/lucene2#>
SELECT ?uri ?name WHERE {
?uri luc2:list ?name
}
{code}

h3. Search

# Simple, compatible with the one of the Lucene plug-ins.
# Advanced configuration properties:
|| Parameter || Value \\ || Default \\ || Comment \\ ||
| return.entities \\ | *false*, *true* or *dedup* \\ | false \\ | *false* \- the query will return literals that matched the query, just like the old plugin; \\
*true* \- the query will return entity IDs with predicate values that matched the query. A single instance ID can be returned more than once (when it has multiple predicate values that matched the query); \\
*dedup* \- the same as "true", except duplicate instance IDs are filtered - if a single entity has multiple predicate values that matched the query, only the one with the top score is returned. \\ |
| offset \\ | int \\ | 0 \\ | Returns the results starting from the specified value; \\ |
| limit \\ | int \\ | 2^31 \\ | Lucene query limit - returns maximum number of results; \\ |
| snippet.size \\ | int \\ | 250 | The size of the returned snippets (if enabled); \\ |
| class \\ | string \\ | none | Hit highlights are represented with <span>. This optional parameter provides the class, i.e. <span class="xxx">; \\ |
# Examples
#* The following query returns the top 10 resources, snippets are of size 20, ?entity will bind to the resource URI:
{code}
PREFIX luc2:<http://www.ontotext.com/owlim/lucene2#>
SELECT ?entity WHERE {
?entity luc2:index-name ("query string" "return.entities=true;limit=10;snippet.size=20")
}
{code}
#* To get the top result and bind ?literal to the literal that is matched:
{code}
SELECT ?literal WHERE {
?literal luc2:index-name ("query string" "limit=1")
}
{code}
# Additional predicates are allowed on bound search results whether for literals or resources:
#* ?result [http://www.ontotext.com/lucene2#score] ?score - binds ?score to the score of the result;
#* ?result [http://www.ontotext.com/lucene2#snippet] ?snippet - binds ?snippet to the extracted snippet, with highlights;
#* ?result [http://www.ontotext.com/lucene2#entity] ?entity - binds ?entity to the resource URI that has a predicate whose value is the literal that matched the query (basically the same as passing the "r" flag);
#* ?result [http://www.ontotext.com/lucene2#predicate] ?predicate - binds ?predicate to the predicate URI of the literal that matched the query;
#* ?result [http://www.ontotext.com/lucene2#literal] ?literal - binds ?literal to the literal that matched the query.

h3. Real-world examples

These examples will help you understand how to create an index and then execute searches on it.

The following query creates an index on all entities that have an rdfs:labels with @en or no language tag, using an EnglishAnalyzer with snippets enabled.
{code}
PREFIX luc2:<http://www.ontotext.com/owlim/lucene2#>
INSERT DATA {
<http://www.ontotext.com/owlim/lucene2#sampleIndex> luc2:createIndex "languages=en,;analyzer=org.apache.lucene.analysis.en.EnglishAnalyzer;predicates=http://www.w3.org/2000/01/rdf-schema#label;enableSnippets=true" .
}
{code}

Now that the index is created, you can run the following query to obtain the top 20 entries that start with "a", with their actual literal and the snippet where the literal occurs, including the Lucene score.
{code}
PREFIX luc2:<http://www.ontotext.com/owlim/lucene2#>
SELECT * WHERE {
?entity luc2:sampleIndex ("a*" "return.entities=true;limit=20") .
?entity luc2:snippet ?snippet .
?entity luc2:literal ?literal .
?entity luc2:score ?score .
}
{code}

h2. Benchmarking

This benchmark testing was done over a LUBM-50 dataset (6654856 explicit statements) using default values for test memory and repository configuration.

The CPU used was Intel(R) Core(TM)2 Duo CPU E6850 @ 3.00GHz.

* loading:
** LUBM-50 without including the plug-in library: 485689 ms ( < 486 s);
** LUBM-50 with library included: 684011 ms.

* creating a simple index over LUBM-50 with the following configuration: 2110 seconds (35 mins)
{code}
PREFIX luc2:<http://www.ontotext.com/owlim/lucene2#>
INSERT DATA {
<http://www.ontotext.com/owlim/lucene2#sampleIndex> luc2:createIndex "languages=en,;analyzer=org.apache.lucene.analysis.en.EnglishAnalyzer;predicates=http://www.w3.org/2000/01/rdf-schema#label;enableSnippets=true" .
}
{code}

* creating an index over LUBM-50 with following configuration (types): 140 s
{code}
PREFIX luc2:<http://www.ontotext.com/owlim/lucene2#>
INSERT DATA {
luc2:profAndStud luc2:createIndex "types=http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#Professor,http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#GraduateStudent" .
}
{code}

* creating an index over LUBM-50 with following configuration (types and labels): 30 s
{code}
PREFIX luc2:<http://www.ontotext.com/owlim/lucene2#>
INSERT DATA {
luc2:pasn3 luc2:createIndex "predicates=http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#name;types=http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#Professor,http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#GraduateStudent;enableSnippets=true" .
}
{code}

h2. Dependencies & Deployment Details

* GraphDB Plug-in API
* Lucene 3.6.2
* Apache commons-io 2.4



h2. FAQ

h3. How to use UNION with Lucene2?

*A:* Just like in a normal query. Consider the following example:
{code}
SELECT ?c WHERE {
?s luc2:content ("query string" "return.entities=true;limit=10;snippet.size=200") .
{
?s rdf:type <http://data.ontotext.com/ontologies/ontology1/Type1> .
bind (?s as ?c) .
}
UNION
{
?s rdf:type <http://data.ontotext.com/ontologies/ontology2/Type2> .
?c <http://data.ontotext.com/ontologies/ontology2/Predicate1> ?s .
}
} LIMIT 10
{code}
The above query joins the union part (with bindings for ?c and ?s) with the Lucene part on ?s. Provided that the Lucene index contains things of the right classes, i.e. things of type Type1 AND Type2. There are a few noteworthy details:
# The Lucene query limit applies before the SPARQL query limit.
# You can safely use DISTINCT in order to clean up duplicate results.

h3. If I UNION up two Lucene2 queries (on different Lucene2 indices), will the snippets and scoring still work?

*A:* The short answer is no, because Lucene scores are generated per query, so basically one cannot execute two different Lucene queries and expect adequate scoring when joining the results. Consider the following example:
{code}
SELECT ?c ?snippet WHERE {
?c rdf:type <http://data.ontotext.com/ontologies/ontology1/Type1> .

# LUCENE QUERY
{
?c luc2:content ("gold" "return.entities=true;limit=10;snippet.size=200") .
?c luc2:snippet ?snippet .
?c luc2:score ?score .
}
UNION
{
?annotation rdf:type <http://data.ontotext.com/ontologies/ontology2/Type2> .
?annotation luc2:annotations ("gold" "return.entities=true;limit=10;snippet.size=200") .
?annotation luc2:snippet ?snippet .
?annotation luc2:score ?score .
?c <http://data.ontotext.com/ontologies/ontology2/Predicate1> ?annotation .
}

} ORDER BY DESC (?score)
{code}
While the query above is valid, it is not adequate because of the reasons mentioned earlier. The results will be incorrect since a different scoring is used for the two queries. Instead of using UNION, you should create a single index for Type1, Type2 and Predicate1, and execute just one query.

h3. Sometimes the snippets are less than the number of the snippet chars requested, why?

*A:* It is the way Lucene's FastVectorHighlighter generates snippets. For example, if the "match" is on an indexed property that is relatively short (such as a report title), then the snippet tends to be less than the entire title, even though the title length is less than the requested snippet length. Lucene tends to cut the snippet off chars before the first term match. The {{getBestFragment}} method's [javadoc|http://lucene.apache.org/core/3_5_0/api/contrib-highlighter/org/apache/lucene/search/vectorhighlight/FastVectorHighlighter.html#getBestFragment%28org.apache.lucene.search.vectorhighlight.FieldQuery,%20org.apache.lucene.index.IndexReader,%20int,%20java.lang.String,%20int%29] is not very helpful in explaining why.

h3. The drop index SPARQL update request throws an error, if the index does not exist. Is there a way to ask if the index exists, and if so - drop it?

This is where the _luc2:list_ comes in handy. If you have an index named _"myIndex"_, then you can execute the following SPARQL update and get response code 200, even when the index does not exist:
{code}
PREFIX luc2:<http://www.ontotext.com/owlim/lucene2#>
INSERT {
?indexUri luc2:dropIndex "" .
} WHERE {
?indexUri luc2:list "myIndex" .
};
{code}

h3. How to search for terms/phrases containing stop words?

In our terminology a _term_ represents a word of text. What you are actually trying to search for is a _phrase_. In addition, the Lucene2 plug-in supports the Lucene query syntax. This means that, if you search for phrases such as "City of Manchester" like the one below, mind the quotes:
{code}
PREFIX luc2:<http://www.ontotext.com/owlim/lucene2#>
SELECT ?entity WHERE {
?entity luc2:index-name ('''"City of Manchester"''' "return.entities=true;limit=10;snippet.size=200")
}
{code}
There is a caveat, though, in order to be able to use "" in a literal, you need to use the additional quoting construct - [http://www.w3.org/TR/sparql11-query/#QSynLiterals].


Only then you will get appropriate results. It is true that the Analyzers are filtering the stop words, but this is not an issue, since we are using the same analyzer for indexing and search - the one that was specified during index creation time. More info on how this actually works can be found here [http://lucene.apache.org/core/3_6_2/api/core/org/apache/lucene/analysis/package-summary.html], in the "Token Position Increments" section.

h3. Are prefix wildcards supported? Can I use both prefix and suffix wildcards on the same term?

Yes, you can. Go for it\!