Lucene4 Plug-in (deprecated)

Skip to end of metadata
Go to start of metadata
This documentation is NOT for the latest version of GraphDB.

Latest version - GraphDB 7.1

GraphDB Documentation

Next versions

GraphDB 6.6
GraphDB 7.0
GraphDB 7.1

Previous versions

GraphDB 6.4
GraphDB 6.3
GraphDB 6.2
GraphDB 6.0 & 6.1

OWLIM 5.4
OWLIM 5.3
OWLIM 5.2
OWLIM 5.1
OWLIM 5.0
OWLIM 4.4
OWLIM 4.3
OWLIM 4.2
OWLIM 4.1
OWLIM 4.0

Description/Motivation

The Lucene4 plug-in for GraphDB provides very fast facet (aggregation) searches, which are normally available through external Apache Solr services, but have the additional benefit to stay automatically up-to-date with the GraphDB repository data.

Features

  • maintain a Lucene index that is always synced with the data
  • multiple indexes per repository with independent settings. Index options supported:
    • stripping the *ML tags in literals (false by default);
    • index auto-update (true by default);
    • specifying a Lucene analyzer (the default is the Lucene's StandardAnalyzer);
    • a white list of predicates whose values to be indexed;
    • a white list of entities rdf:type-s to be indexed (the default is ALL);
    • a white list of languages to index (the default is ALL);
    • a white list of predicates whose values to add to the facets index;
  • full Lucene syntax for search;
  • simple molecules - only literals reachable through zero hops (i.e. a single predicate of an entity);
  • snippet extraction - retrieving snippets with highlighted search terms from the search result;
  • search flags:
    • results paging through offset and limit parameters;
    • specifying snippet size per query;
    • specifying facets to aggregate.

Differences with Lucene2

Architecture

Lucene2 indexes each relevant statement (triple) as a single Lucene document. This feature is very beneficial for maintaining an index up-to-date, but leads to a few undesirable effects such as returning a single entity more than once for a given search (because more than one literal matches the query) and inability to sort by specific predicates. The Lucene4 plugin creates a single Lucene document per entire entity, which has a field for each of the predicates listed in the index configuration. This leads to slower update times, but solves the two main problems posed above. These decisions have some other minor implications described below. Another significant feature of Lucene4 is the facets support (which is built in Apache Lucene 4.x.x).

Index options

Predicates

The predicates index option is now mandatory and must not be empty. Indexing all predicates for a given entity is not supported.

Additional joins

The additionalJoins index options from Lucene2 have been removed in Lucene4. It can be emulated though. Consider the following Lucene2 index creation:

In order to do this in Lucene4, create the same index, but add urn:join to predicates instead of additionalJoins:

The filtering itself is done through searching. Since Lucene4 creates Lucene document field for each predicate and the field name is exactly the predicate URI, you can utilize that through the Lucene query syntax:

Where the

means:

  • + - the following query is mandatory;
  • is a query in the form field:value. However, since the field name contains :, we need to escape it with backslash. The backslash, however is also an escaping character in SPARQL, and therefore we need to escape it again so that the plugin will see a single backslash. Forward slashes are also special Lucene syntax and need to be escaped as well.
    This means that if the join predicate is http://example.com/slash, the proper query would be "+(query) +http\\:\/\/example.com\/slash:joinValue".
  • also note that whatever the original FTS query is, the best is to put it in brackets, with plus in front. If you ask for
    then Lucene will return all entities that match either query OR
    which is not the intent.
Optional joins

The optionalJoins parameter is still present in Lucene4. Entities that have the specified predicate values OR lack predicates completely are indexed, while the rest of the entities are not. The optional joins predicate values are indexed as the other predicates in a Lucene field per predicate, which allows to search for specific values, as described above in the Additional joins section.

For example, if you create the following index:

and then insert some entities:

Querying on urn:join is possible as in the additional joins, in the example above:

Entities that do not have the optional join predicate get a default value. The default value is OPTIONALJOINDEFAULT and can be set by using the optionalJoinDefaults index option. For example, with the above entities, searching for

will return only

To specify different default values for <urn:join>, create such an index:

Optional join predicate values are indexed, but not tokenised, which means that queries for them should match exactly.

Search options

  • The return.entities parameter is removed. Lucene4 always acts as Lucene2 with return.entities=dedup.
  • The order.by parameter is added to allow specifying predicates to sort.

User's Guide

Creating an index

To create an index, issue the following SPARQL update query

where <index-options> can be a combination of the following options, separated by a semicolon ';':

  • stripMarkup=true|false - specifying whether to strip tags from HTML/XML literals (false by default);
  • autoUpdate=true|false - specifying whether to keep the index automatically up-to-date (true by default);
  • enableSnippets=true|false - specifying whether to enable snippets in this index. (As of 2013-12-15, this is a dummy flag and snippets are always enabled. You should generally pass a meaningful value here, in case we optimise our implementation later.)
  • analyzer=<analyzer-class-name> - specifying which Lucene analyzer to use when indexing literals in this index. There are two options here:
    • directly specifying the Lucene analyzer class name - in this case, the analyzer should either have a default constructor or a constructor accepting a single org.apache.lucene.util.Version parameter. If you specify an analyzer that does not have one of these constructors, the index will not be created.
    • specifying a class derived from com.ontotext.trree.plugin.lucene4.AnalyzerFactory
  • predicates=<comma-separated-list-of-URIs> - if specified, only triples with these predicates will be listed;
  • languages=<comma-separated-languages> - if specified, only literals tagged with the listed languages will be indexed;
  • types=<comma-separated-list-of-URIs> - if specified, a white list of types will be indexed (i.e. only entities that have rdf:type equal to one of the specified URIs will be indexed);
  • facets=<comma-separated-list-of-URIs> if specified, the listed predicates and their values will be indexed in the facets index
  • optionalJoins=<|-separated-predicate-object-pairs> - if specified, a white list of additional optional joins to validate. Supports both URIs and Literals as objects. Literals can include spaces. Sample syntax:

    An entity is only indexed, if for each specified predicate it either has the predicate with one of the specified values, or it does not have the predicate. If an entity is indexed, a field for each optional join predicate is created with all values. If the entity does not have the predicate, a default value for the optional join predicate specified with optionalJoinDefaults is created. The value is indexed, but not tokenized, so any searches in an optional joins field should be exact matches.
  • optionalJoinDefaults=<|-separated-predicate-object-pairs> - if specified, provide different default value to be indexed in the field of entities that do not have the optional join predicate. (The default for all predicates is "OPTIONALJOINDEFAULT")
  • sortPredicates=<comma-separated-list-of-URIs> - predicate values that are used for sorting during search time. Only predicates specified here can be passed to order.by.

Examples:

Drop an index

Example:

List all indices

Example:

Query indices health

Indices might become corrupt due to disk failure or other issues. The luc4:healthCheck predicate returns a list of indices along with their health status. In the example below, ?uri will bind to

Search

  1. Simple search
    Returns all matching entities' ids as the ?entity binding.
  2. Advanced search

    The options are in a ;-separated list of option=value pairs. What will be bound as ?result, depends on whether the options specify a facet query or not. The following is a list of all supported options:

    Parameter Value
    Default
    Comment
    offset
    int
    0
    Returns the results starting from the specified value.
    limit
    int
    2^31-1
    Lucene query limit - the maximum number of results which to return.
    snippet.size
    int
    250 The size of the returned snippets.
    class
    string
    none Hit highlights are represented with <span>. This optional parameter provides the class, i.e. <span class="xxx">
    facets
    string
    none
    A comma-separated list of facet predicates whose values to aggregate. All predicates specified here should also have been specified in the index options facet. Passing a non-empty list of facet predicates changes the way results are returned, see Facet examples below
    facets.limit
    int
    2^31-1
    The number of top values to return per facet predicate. For example, specifying facets=urn:facet1,urn:facet2;facets.limit=5 will return ten facet results in total.
    order.by
    string
    score
    A comma-separated list of indexed predicates. If specified, the results are ordered by the value of the specified predicates, in the specified order. Values are sorted in an ascending order. To sort certain predicate values in descending order, prepend - (minus) to it. For example, order.by=urn:predicate1,-urn:predicate2 will sort the results by the value of urn:predicate1 in an ascending order. Then, the results that have the same value in urn:predicate1 will be sorted by the value of urn:predicate2 in a descending order. The special value score (lower-case) can be used for ordering instead of a predicate, based on the normal Lucene score (descending by default). For example order.by=urn:predicate,score will sort urn:predicate first (in an ascending order) and then by score, while order.by=score,urn:predicate will sort first by score, and then by urn:predicate. Passing -score to reverse the order (from lowest to highest score) is also allowed.
    Predicates on this list should be specified as sortPredicates during the creation of the index. Ordering by predicate without any specifications in sortPredicates will lead to unpredictable ordering.
  3. Faceted search results - for searches where the facets search option is specified, the search predicate (e.g. luc4:indexName) binds a single dummy blank node to its object. You must then query for luc4:entities to get entities that match the query, and luc4:facets to get faceted results. The following example demonstrates this and the related special predicates for extra result details:

    Note the use of UNION to avoid the Cartesian product of the set of document and facet results. This is the recommended way to retrieve both documents and facets results.

Additional predicates allowed on bound search results

Additional predicates allowed on bound facets results

Examples

  • Return all results, ?entity will bind to the resource URI.
  • Return the top 10 resources, snippets are of size 20, ?entity will bind to the resource URI.
  • Return the top 10 resources, starting from offset 10 (i.e. the page size is 10, return second page), the documents will be sorted by the value of rdfs:label instead of the Lucene score.

Real-world examples

These examples will help you understand how to create an index and then execute searches on it.
The following query creates an index on all entities that have an rdfs:labels with @en or no language tag, using an EnglishAnalyzer with snippets enabled.

Now that the index is created, you can run the following query to obtain the top 20 entries that start with "a" and snippet where the literal occurs, including the Lucene score.

Facets example

Consider, the following RDF data (in turtle format)

Now create an index, using rdf:type and test:facet as facet predicates, indexing rdfs:comment and having no type restriction:

Let's get some results now:

The result bindings will look like in the table below, the empty cell means this value is unbound:

score entity facetValue facetCount facetPredicate
1.0 urn:a      
1.0 urn:b      
1.0 urn:c      
1.0 urn:d      
    facet-value-1 2.0 test:facet
    facet-value-2 1.0 test:facet
    urn:Type1 3.0 http://www.w3.org/1999/02/22-rdf-syntax-ns#type
    urn:Type2 1.0 http://www.w3.org/1999/02/22-rdf-syntax-ns#type

Dependencies & Deployment Details

  • GraphDB Plug-in API (releases after build 5.4.6686)
  • Lucene 4.5.1
  • Apache commons-io 2.4

FAQ

How to use UNION with Lucene4?

A: Just like in a normal query. Consider the following example:

The above query joins the union part (with bindings for ?c and ?s) with the lucene part on ?s. Provided that the lucene index contains things of the right classes, i.e. things of type Type1 AND Type2. There are a few noteworthy details:

  1. The Lucene query limit applies before the SPARQL query limit.
  2. You can safely use DISTINCT in order to clean up duplicate results.

If I UNION up two lucene4 queries (on different lucene4 indices), will the snippets and scoring still work?

A: The short answer is no, because Lucene scores are generated per query, so basically one cannot execute two different Lucene query and expect adequate scoring when joining the results. Consider the following example:

While the query above is valid, it is not adequate because of the reasons mentioned earlier. The results will be incorrect since different scoring is used for the two queries. Instead of using UNION, you should create a single index for Type1, Type2 and Predicate1, and execute just one query.

Sometimes the snippets are less than the number of snippet chars requested, why?

A: It is the way Lucene's FastVectorHighlighter generates snippets. For example, if the "match" is on an indexed property that is relatively short (such as a report title), then the snippet tends to be less than the entire title, even though the title length is less than the requested snippet length. Lucene tends to cut the snippet off chars before the first term match. The getBestFragment method's javadoc is not very helpful in explaining why.

The drop index SPARQL update request throws an error if the index does not exist. Is there a way to ask if the index exists, and if so - drop it?

This is where the luc4:list comes in handy. If you have an index named "myIndex", then you can execute the following SPARQL update and get response code 200, even when the index does not exist:

How to search for terms/phrases containing stop words?

In our terminology a term represents a word of text. What you are actually trying to search for is a phrase. In addition, the Lucene4 plug-in supports the Lucene query syntax. This means that, if you search for phrases such as "City of Manchester" like the one below, mind the quotes:

There is a caveat, in order to be able to use "" in a literal, you need to use the additional quoting construct - http://www.w3.org/TR/sparql11-query/#QSynLiterals

Only then you will get appropriate results. It is true that the Analyzers are filtering the stop words, but this is not an issue, since we are using the same analyzer for indexing and search - the one that was specified during index creation time. More info on how this actually works can be found here http://lucene.apache.org/core/3_6_2/api/core/org/apache/lucene/analysis/package-summary.html, in the "Token Position Increments" section.

Are prefix wildcards supported? Can I use both prefix and suffix wildcards on the same term?

Yes, you can. Go for it!

Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.