The Lucene4 plug-in for GraphDB provides very fast facet (aggregation) searches, which are normally available through external Apache Solr services, but have the additional benefit to stay automatically up-to-date with the GraphDB repository data.
Lucene2 indexes each relevant statement (triple) as a single Lucene document. This feature is very beneficial for maintaining an index up-to-date, but leads to a few undesirable effects such as returning a single entity more than once for a given search (because more than one literal matches the query) and inability to sort by specific predicates. The Lucene4 plugin creates a single Lucene document per entire entity, which has a field for each of the predicates listed in the index configuration. This leads to slower update times, but solves the two main problems posed above. These decisions have some other minor implications described below. Another significant feature of Lucene4 is the facets support (which is built in Apache Lucene 4.x.x).
The predicates index option is now mandatory and must not be empty. Indexing all predicates for a given entity is not supported.
The additionalJoins index options from Lucene2 have been removed in Lucene4. It can be emulated though. Consider the following Lucene2 index creation:
In order to do this in Lucene4, create the same index, but add urn:join to predicates instead of additionalJoins:
The filtering itself is done through searching. Since Lucene4 creates Lucene document field for each predicate and the field name is exactly the predicate URI, you can utilize that through the Lucene query syntax:
The optionalJoins parameter is still present in Lucene4. Entities that have the specified predicate values OR lack predicates completely are indexed, while the rest of the entities are not. The optional joins predicate values are indexed as the other predicates in a Lucene field per predicate, which allows to search for specific values, as described above in the Additional joins section.
For example, if you create the following index:
and then insert some entities:
Querying on urn:join is possible as in the additional joins, in the example above:
Entities that do not have the optional join predicate get a default value. The default value is OPTIONALJOINDEFAULT and can be set by using the optionalJoinDefaults index option. For example, with the above entities, searching for
will return only
To specify different default values for <urn:join>, create such an index:
Optional join predicate values are indexed, but not tokenised, which means that queries for them should match exactly.
To create an index, issue the following SPARQL update query
where <index-options> can be a combination of the following options, separated by a semicolon ';':
Indices might become corrupt due to disk failure or other issues. The luc4:healthCheck predicate returns a list of indices along with their health status. In the example below, ?uri will bind to
These examples will help you understand how to create an index and then execute searches on it.
Now that the index is created, you can run the following query to obtain the top 20 entries that start with "a" and snippet where the literal occurs, including the Lucene score.
Consider, the following RDF data (in turtle format)
Now create an index, using rdf:type and test:facet as facet predicates, indexing rdfs:comment and having no type restriction:
Let's get some results now:
The result bindings will look like in the table below, the empty cell means this value is unbound:
A: Just like in a normal query. Consider the following example:
The above query joins the union part (with bindings for ?c and ?s) with the lucene part on ?s. Provided that the lucene index contains things of the right classes, i.e. things of type Type1 AND Type2. There are a few noteworthy details:
If I UNION up two lucene4 queries (on different lucene4 indices), will the snippets and scoring still work?
A: The short answer is no, because Lucene scores are generated per query, so basically one cannot execute two different Lucene query and expect adequate scoring when joining the results. Consider the following example:
While the query above is valid, it is not adequate because of the reasons mentioned earlier. The results will be incorrect since different scoring is used for the two queries. Instead of using UNION, you should create a single index for Type1, Type2 and Predicate1, and execute just one query.
A: It is the way Lucene's FastVectorHighlighter generates snippets. For example, if the "match" is on an indexed property that is relatively short (such as a report title), then the snippet tends to be less than the entire title, even though the title length is less than the requested snippet length. Lucene tends to cut the snippet off chars before the first term match. The getBestFragment method's javadoc is not very helpful in explaining why.
The drop index SPARQL update request throws an error if the index does not exist. Is there a way to ask if the index exists, and if so - drop it?
This is where the luc4:list comes in handy. If you have an index named "myIndex", then you can execute the following SPARQL update and get response code 200, even when the index does not exist:
In our terminology a term represents a word of text. What you are actually trying to search for is a phrase. In addition, the Lucene4 plug-in supports the Lucene query syntax. This means that, if you search for phrases such as "City of Manchester" like the one below, mind the quotes:
There is a caveat, in order to be able to use "" in a literal, you need to use the additional quoting construct - http://www.w3.org/TR/sparql11-query/#QSynLiterals
Only then you will get appropriate results. It is true that the Analyzers are filtering the stop words, but this is not an issue, since we are using the same analyzer for indexing and search - the one that was specified during index creation time. More info on how this actually works can be found here http://lucene.apache.org/core/3_6_2/api/core/org/apache/lucene/analysis/package-summary.html, in the "Token Position Increments" section.
Yes, you can. Go for it!
Skip to end of metadata Go to start of metadata