The motivation behind the development of the Lucene2 Plug-in was the need for a full-text search in GraphDB, which is automatically up-to-date with the repository data. The existing OWLIM Lucene plug-in lacked incremental building and was not kept up-to-date automatically. This plug-in comes to fill in these gaps while adding some useful features.
To create an index, issue the following SPARQL update query:
where <index-options> can be a combination of the following options, separated by a semicolon ';':
These examples will help you understand how to create an index and then execute searches on it.
The following query creates an index on all entities that have an rdfs:labels with @en or no language tag, using an EnglishAnalyzer with snippets enabled.
Now that the index is created, you can run the following query to obtain the top 20 entries that start with "a", with their actual literal and the snippet where the literal occurs, including the Lucene score.
This benchmark testing was done over a LUBM-50 dataset (6654856 explicit statements) using default values for test memory and repository configuration.
The CPU used was Intel(R) Core(TM)2 Duo CPU E6850 @ 3.00GHz.
A: Just like in a normal query. Consider the following example:
The above query joins the union part (with bindings for ?c and ?s) with the Lucene part on ?s. Provided that the Lucene index contains things of the right classes, i.e. things of type Type1 AND Type2. There are a few noteworthy details:
If I UNION up two Lucene2 queries (on different Lucene2 indices), will the snippets and scoring still work?
A: The short answer is no, because Lucene scores are generated per query, so basically one cannot execute two different Lucene queries and expect adequate scoring when joining the results. Consider the following example:
While the query above is valid, it is not adequate because of the reasons mentioned earlier. The results will be incorrect since a different scoring is used for the two queries. Instead of using UNION, you should create a single index for Type1, Type2 and Predicate1, and execute just one query.
A: It is the way Lucene's FastVectorHighlighter generates snippets. For example, if the "match" is on an indexed property that is relatively short (such as a report title), then the snippet tends to be less than the entire title, even though the title length is less than the requested snippet length. Lucene tends to cut the snippet off chars before the first term match. The getBestFragment method's javadoc is not very helpful in explaining why.
The drop index SPARQL update request throws an error, if the index does not exist. Is there a way to ask if the index exists, and if so - drop it?
This is where the luc2:list comes in handy. If you have an index named "myIndex", then you can execute the following SPARQL update and get response code 200, even when the index does not exist:
In our terminology a term represents a word of text. What you are actually trying to search for is a phrase. In addition, the Lucene2 plug-in supports the Lucene query syntax. This means that, if you search for phrases such as "City of Manchester" like the one below, mind the quotes:
There is a caveat, though, in order to be able to use "" in a literal, you need to use the additional quoting construct - http://www.w3.org/TR/sparql11-query/#QSynLiterals.
Only then you will get appropriate results. It is true that the Analyzers are filtering the stop words, but this is not an issue, since we are using the same analyzer for indexing and search - the one that was specified during index creation time. More info on how this actually works can be found here http://lucene.apache.org/core/3_6_2/api/core/org/apache/lucene/analysis/package-summary.html, in the "Token Position Increments" section.
Yes, you can. Go for it!
Skip to end of metadata Go to start of metadata