The motivation behind the development of the Lucene2 Plug-in was the need for a full-text search in GraphDB, which is automatically up-to-date with the repository data. The existing OWLIM Lucene plug-in lacked incremental building and was not kept up-to-date automatically. This plug-in comes to fill in these gaps while adding some useful features.
Features
maintaining a Lucene index that is always synced with the data;
multiple indices per repository with independent settings - the index options supported are as follows:
stripping the *ML tags in literals (false by default);
index auto-update (true by default);
enabling snippets in the index (enabling snippets will make the index significantly larger) (false by default);
specifying a Lucene analyzer (the default is the Lucene's StandardAnalyzer);
a white list of predicates, whose values to index (the default is ALL);
a white list of entities rdf:type-s to index (the default is ALL);
a white list of additional joins, used to restrict what should be indexed more than just by type;
basic wildcard support for the additional join values;
a white list of languages to index (the default is ALL);
full Lucene syntax for search;
simple molecules - only literals reachable through zero hops (i.e. a single predicate of an entity);
snippet extraction - retrieving snippets with highlighted search terms from the search result;
easy access to all parts of the triple that matched a query (i.e. easy way to retrieve the subject, predicate and object of the triple, having an object literal matching the query);
search flags:
binding the object of the query either to entity or literal;
returning only the top N entries of the Lucene query;
specifying snippet size per query.
User's Guide
Creating an index
To create an index, issue the following SPARQL update query:
where <index-options> can be a combination of the following options, separated by a semicolon ';':
stripMarkup=true|false - specifying whether to strip tags from HTML/XML literals (false by default);
autoUpdate=true|false - specifying whether to keep the index automatically up-to-date (true by default);
enableSnippets=true|false - specifying whether to enable snippets in this index (false by default);
analyzer=<analyzer-class-name> - specifying which Lucene analyzer to use when indexing literals in this index. There are two options here:
directly specifying the Lucene analyzer class name - in this case, the analyzer should either have a default constructor or a constructor accepting a single org.apache.lucene.util.Version parameter. If you specify an analyzer that does not have one of these constructors, the index will not be created.
specifying a class derived from com.ontotext.trree.plugin.lucene.AnalyzerFactory - as used in the "old" Lucene plug-in;
predicates=<comma-separated-list-of-URIs> - if specified, only triples with these predicates will be listed;
languages=<comma-separated-languages> - if specified, only literals tagged with the listed languages will be indexed;
types=<comma-separated-list-of-URIs> - if specified, a white list of types will be indexed (i.e. only entities that have rdf:type equal to one of the specified URIs will be indexed);
additionalJoins=<|-separated-predicate-object-pairs> - if specified, a white list of additional joins will enforce a check after the type check, and in case no additional joins are found for a given entity, it will not be indexed. Supports both URIs and Literals as objects. Literals can include spaces. Sample syntax:
In order to provide a wildcard, just leave the value empty, for example:
optionalJoins=<|-separated-predicate-object-pairs> - if specified, a white list of additional optional joins to validate. This option has the same behaviour as the additionalJoins, except that the check will succeed, and the entity will be indexed if the specified predicates are missing (in addition to the usual condition of a predicate-value match). The same syntax as additionalJoins is used, except wildcard values are not supported.
Examples:
Drop an index
Example:
List all indices
Example:
Search
Simple, compatible with the one of the Lucene plug-ins.
Advanced configuration properties:
Parameter
Value
Default
Comment
return.entities
false, true or dedup
false
false - the query will return literals that matched the query, just like the old plugin; true - the query will return entity IDs with predicate values that matched the query. A single instance ID can be returned more than once (when it has multiple predicate values that matched the query); dedup - the same as "true", except duplicate instance IDs are filtered - if a single entity has multiple predicate values that matched the query, only the one with the top score is returned.
offset
int
0
Returns the results starting from the specified value;
limit
int
2^31
Lucene query limit - returns maximum number of results;
snippet.size
int
250
The size of the returned snippets (if enabled);
class
string
none
Hit highlights are represented with <span>. This optional parameter provides the class, i.e. <span class="xxx">;
Examples
The following query returns the top 10 resources, snippets are of size 20, ?entity will bind to the resource URI:
To get the top result and bind ?literal to the literal that is matched:
Additional predicates are allowed on bound search results whether for literals or resources:
?result http://www.ontotext.com/lucene2#entity ?entity - binds ?entity to the resource URI that has a predicate whose value is the literal that matched the query (basically the same as passing the "r" flag);
These examples will help you understand how to create an index and then execute searches on it.
The following query creates an index on all entities that have an rdfs:labels with @en or no language tag, using an EnglishAnalyzer with snippets enabled.
Now that the index is created, you can run the following query to obtain the top 20 entries that start with "a", with their actual literal and the snippet where the literal occurs, including the Lucene score.
Benchmarking
This benchmark testing was done over a LUBM-50 dataset (6654856 explicit statements) using default values for test memory and repository configuration.
The CPU used was Intel(R) Core(TM)2 Duo CPU E6850 @ 3.00GHz.
loading:
LUBM-50 without including the plug-in library: 485689 ms ( < 486 s);
LUBM-50 with library included: 684011 ms.
creating a simple index over LUBM-50 with the following configuration: 2110 seconds (35 mins)
creating an index over LUBM-50 with following configuration (types): 140 s
creating an index over LUBM-50 with following configuration (types and labels): 30 s
Dependencies & Deployment Details
GraphDB Plug-in API
Lucene 3.6.2
Apache commons-io 2.4
FAQ
How to use UNION with Lucene2?
A: Just like in a normal query. Consider the following example:
The above query joins the union part (with bindings for ?c and ?s) with the Lucene part on ?s. Provided that the Lucene index contains things of the right classes, i.e. things of type Type1 AND Type2. There are a few noteworthy details:
The Lucene query limit applies before the SPARQL query limit.
You can safely use DISTINCT in order to clean up duplicate results.
If I UNION up two Lucene2 queries (on different Lucene2 indices), will the snippets and scoring still work?
A: The short answer is no, because Lucene scores are generated per query, so basically one cannot execute two different Lucene queries and expect adequate scoring when joining the results. Consider the following example:
While the query above is valid, it is not adequate because of the reasons mentioned earlier. The results will be incorrect since a different scoring is used for the two queries. Instead of using UNION, you should create a single index for Type1, Type2 and Predicate1, and execute just one query.
Sometimes the snippets are less than the number of the snippet chars requested, why?
A: It is the way Lucene's FastVectorHighlighter generates snippets. For example, if the "match" is on an indexed property that is relatively short (such as a report title), then the snippet tends to be less than the entire title, even though the title length is less than the requested snippet length. Lucene tends to cut the snippet off chars before the first term match. The getBestFragment method's javadoc is not very helpful in explaining why.
The drop index SPARQL update request throws an error, if the index does not exist. Is there a way to ask if the index exists, and if so - drop it?
This is where the luc2:list comes in handy. If you have an index named "myIndex", then you can execute the following SPARQL update and get response code 200, even when the index does not exist:
How to search for terms/phrases containing stop words?
In our terminology a term represents a word of text. What you are actually trying to search for is a phrase. In addition, the Lucene2 plug-in supports the Lucene query syntax. This means that, if you search for phrases such as "City of Manchester" like the one below, mind the quotes:
Only then you will get appropriate results. It is true that the Analyzers are filtering the stop words, but this is not an issue, since we are using the same analyzer for indexing and search - the one that was specified during index creation time. More info on how this actually works can be found here http://lucene.apache.org/core/3_6_2/api/core/org/apache/lucene/analysis/package-summary.html, in the "Token Position Increments" section.
Are prefix wildcards supported? Can I use both prefix and suffix wildcards on the same term?