Lucene2 Plug-in

Skip to end of metadata
Go to start of metadata
Search
This documentation is NOT for the latest version of GraphDB.

Latest version - GraphDB 6.6

OWLIM Documentation

Next versions

[OWLIM 5.6]
GraphDB 6.0 & 6.1
GraphDB 6.2
GraphDB 6.3
GraphDB 6.4
GraphDB 6.5

Previous versions

OWLIM 5.4
OWLIM 5.3
OWLIM 5.2
OWLIM 5.1
OWLIM 5.0
OWLIM 4.4
OWLIM 4.3
OWLIM 4.2
OWLIM 4.1
OWLIM 4.0

Description/Motivation

There is a need for a full text search within literals in OWLIM that will be automatically up-to-date with the repository data. The existing OWLIM Lucene plug-in lacks incremental building and is not kept up-to-date automatically; this plug-in comes in to fill these gaps while adding some useful features.

Features

  • maintain a Lucene index that is always in sync with the data
  • multiple indexes per repository with independent settings. Index options supported:
    • whether to strip *ML tags in literals or not (default is to not strip markup)
    • whether the index is auto-updated or not (default is to keep the index up-to-date)
    • whether to enable snippets in the index or not (enabling snippets will make the index significantly larger) (default is no support for snippets)
    • specifying Lucene analyzer to use (the default is Lucene's StandardAnalyzer)
    • a white list of predicates whose values to index (default is all)
    • a white list rdf:type-s of entities to index (default is all)
    • a white list of additional joins in order to restrict what should be indexed more than just by type
    • basic wildcard support for the additional join values
    • a white list of languages to index (default is all)
  • full Lucene syntax for search
  • simple molecules - only literals reachable through zero hops (i.e. a single predicate of an entity)
  • snippet extraction - retrieve snippet with highlighted search terms from the search result
  • easy access to all parts of the triple that matched a query (i.e. easy way to retrieve the subject, predicate and object of the triple having an object literal matching the query)
  • search flags
    • binding the object of the query to either entity or literal
    • returning only the top N entries from the Lucene query
    • specifying snippet size per query

User's Guide

Creating an index

To create an index, issue the following SPARQL update query

where <index-options> can be a combination of the following options, separated by a semicolon ';':

  • stripMarkup=true|false specifying whether to strip tags from HTML/XML literals (default is false)
  • autoUpdate=true|false specifying whether to keep this index automatically up-to-date (default is true)
  • enableSnippets=true|false specifying whether to enable snippets in this index (default is false)
  • analyzer=<analyzer-class-name> Lucene analyzer to use when indexing literals in this index. Threre are two possibilities here:
    • specifying Lucene analyzer class name directly - in that case the analyzer should either have a default constructor or a constructor accepting a single org.apache.lucene.util.Version parameter. If you specify an analyzer that doesn't have one of those constructors the index won't be created
    • specifying a class derived from com.ontotext.trree.plugin.lucene.AnalyzerFactory - as used in the "old" Lucene plugin
  • predicates=<comma-separated-list-of-URIs> if specified, only triples with those predicates will be listed
  • languages=<comma-separated-languages> if specified, only literals tagged with the listed languages will be indexed
  • types=<comma-separated-list-of-URIs> if specified, a white list of types to index (i.e. will only index entities that have rdf:type equal to one of the specified URIs)
  • additionalJoins=<|-separated-predicate-object-pairs> if specified, a white list of additional joins will enforce a check after the type check and in case no additional joins are found for the entity, it will not be indexed. Supports both URIs and Literals as objects. Literals can include spaces. Sample syntax: additionalJoins=urn:ontology:predicate,longer value|urn:ontology:predicate,another longer value. In order to provide a wildcard, just leave the value empty like this additionalJoins=urn:ontology:predicate,;
  • optionalJoins=<|-separated-predicate-object-pairs> if specified, a white list of additional optional joins to validate. The same behaviour as additionalJoins, except the check will succeed and the entity will be indexed if the specified predicates are missing (in addition to the usual condition of a predicate-value match). The same syntax as additionalJoins is used, except wildcard values are not supported

Examples:

Drop an index

Example:

List all indices

Example:

Search

  1. Simple, compatible with the one of the lucene plug-in
  2. Advanced configuration properties, following is the list of options
    Parameter Value
    Default
    Comment
    return.entities
    false, true or dedup
    false
    false - the query will return literals that matched the query, just like the old plugin
    true - the query will return entity ids that have predicate values that matched the query. A single instance id can be returned more than once (when it has multiple predicate values that matched the query)
    dedup - the same as "true", except duplicate instance ids are filtered - if there a single entity has multiple predicate values that matched the query, only the one with the top score is returned
    offset
    int
    0
    Returns the results starting from the specified value
    limit
    int
    2^31
    Lucene query limit - maximum number of results to return
    snippet.size
    int
    250 The size of the returned snippets (if enabled)
    class
    string
    none Hit highlights are represented with <span>. This optional parameter provides the class, i.e. <span class="xxx">
  1. Examples
    • The following query returns the top 10 resources, snippets are of size 20, ?entity will bind to the resource URI
    • To get the top result and bind ?literal to the literal that matched:
  2. Additional predicates are allowed on bound search results whether for literals or resources:

Real-world examples

Maybe to get started more quickly, these examples will help you understand how to create an index and then execute searches on it.

The following query creates an index on all entities that have an rdfs:labels with @en or no language tag, using an EnglishAnalyzer with snippets enabled.

Now that the index is created, you can run the following query to obtain the top 20 entries that start with "a" with their actual literal and snippet where the literal occurs, including the Lucene score.

Benchmarking

Benchmark testing utilized a LUBM-50 dataset (6654856 explicit statements) using default values for test memory and repository configuration. CPU used was Intel(R) Core(TM)2 Duo CPU E6850 @ 3.00GHz.

  • loading:
    • loading LUBM-50 without including the plugin library: 485689 ms ( < 486 s)
    • loading LUBM-50 with library included: 684011 ms
  • creating a simple index over LUBM-50 with the following configuration: 2110 seconds (35 mins)
  • creating an index over LUBM-50 with following configuration (types): 140 s
  • creating an index over LUBM-50 with following configuration (types and labels): 30 s

Dependencies & Deployment Details

Dependencies:

  • OWLIM Plug-in API
  • Lucene 3.6.2
  • Apache commons-io 2.4

FAQ

How to use UNION with Lucene2?

A: Just like in a normal query. Consider the following example:

The above query joins the union part (with bindings for ?c and ?s) with the lucene part on ?s. Provided that the lucene index contains things of the right classes, i.e. things of type Type1 AND Type2. There are a few noteworthy details though:

  1. The Lucene query limit applies before the SPARQL query limit.
  2. You can safely use DISTINCT in order to clean up duplicate results.

If I union up two lucene2 queries (on different lucene2 indices), will the snippets and scoring still work?

A: The short answer is no, because Lucene scores are generated per query, so basically one cannot execute 2 different Lucene query and expect adequate scoring when joining the results. Consider the following example:

While the query above is valid, it is not sane because of the reasons mentioned earlier. The results will be incorrect since different scoring is used for the two queries. Instead of using UNION, you should create a single index for Type1, Type2 and Predicate1 and execute just one query.

Sometimes the snippets are less than the number of snippet chars requested, why?

A: It is the way Lucene's FastVectorHighlighter generates snippets. For example if the "match" is on an indexed property that is relatively short (such as a report title), then the snippet tends to be less than the entire title even though the title length is less than the requested snippet length. Lucene tends to cut the snippet off chars before the first term match. The getBestFragment method's javadoc is not very helpful in explaining why.

The drop index SPARQL update request throws an error if the index does not exist. Is there any way to ask if the index exists and if so - drop it?

This is where the luc2:list comes in handy. If you have an index named "myIndex", then you can execute the following SPARQL update and get response code 200, even when the index does not exist:

How to search for terms/phrases containing stop words?

First of all, let's get the terminology right - a term represents a word of text. What you are actually trying to search for is a phrase. In addition, the Lucene2 plug-in supports the Lucene query syntax. This means that if you search for phrases such as "City of Manchester" like this (mind the quotes):

There is a caveat, in order to be able to use " in a literal, you need to use the additional quoting construct - http://www.w3.org/TR/sparql11-query/#QSynLiterals

You will get appropriate results. Indeed, the Analyzers are filtering stop words, but this isn't an issue, since we are using the same analyzer for index and search - the one that was specified during index creation time. More info on how this actually works could be found here http://lucene.apache.org/core/3_6_2/api/core/org/apache/lucene/analysis/package-summary.html in the "Token Position Increments" section.

Are prefix wildcards supported? Can I use both prefix and suffix wildcards on the same term?

Yes, you can. Go for it!

Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.