Lucene4 Plug-in (deprecated)

compared with
Current by Reneta Popova
on Sep 19, 2014 09:51.

Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (42)

View Page History
Where the {code}+urn\\:join:joinValue{code} means:
* *+* \- the following query is mandatory;
* {code}urn\\:join:joinValue{code} is a query in the form field:value. However, since the field name contains {{:}}, we need to escape it with backslash. The backslash, however is also an escaping character in SPARQL, and therefore we need to escape it again so that the plugin will see a single backslash. Forward slashes are also special Lucene syntax and need to be escaped as well.
This means that if the join predicate is {nolink}[http://example.com/slash]{nolink}, the proper query would be "+(query) \+http\\:\/\/example.com\/slash:joinValue".
* also note that whatever the original FTS query is, the best is to put it in brackets, with plus in front. If you ask for
{code}"query +urn\\:join:joinValue"{code} then Lucene will return all entities that match either _query_ OR
{code}urn\\:join:joinValue{code} which is not the intent.

h5. Optional joins
{code}

Entities that do not have the optional join predicate get a default value. The default value is {{OPTIONALJOINDEFAULT}} and can be set by using the {{optionalJoinDefaults}} index option. For example, with the above entities, searching for
{code}"words...+urn\\:join:OPTIONALJOINDEFAULT"{code}
will return only
{code}<entity:will-be-indexed-too>{code}

* {{types=<comma-separated-list-of-URIs>}} \- if specified, a white list of types will be indexed (i.e. only entities that have rdf:type equal to one of the specified URIs will be indexed);
* facets=<comma-separated-list-of-URIs> if specified, the listed predicates and their values will be indexed in the facets index
* {{optionalJoins=<\|-separated-predicate-object-pairs>}} - \- if specified, a white list of additional optional joins to validate. Supports both URIs and Literals as objects. Literals can include spaces. Sample syntax: {code}optionallJoins=urn:ontology:predicate,longer value\|urn:ontology:predicate,another longer value{code}
An entity is only indexed, if for each specified predicate it either has the predicate with one of the specified values, or it does not have the predicate. If an entity is indexed, a field for each optional join predicate is created with all values. If the entity does not have the predicate, a default value for the optional join predicate specified with {{optionalJoinDefaults}} is created. The value is indexed, but not tokenized, so any searches in an optional joins field should be exact matches.
* {{optionalJoinDefaults=<\|-separated-predicate-object-pairs>}} - \- if specified, provide different default value to be indexed in the field of entities that do not have the optional join predicate. (The default for all predicates is *"OPTIONALJOINDEFAULT"*)
* {{sortPredicates=<comma-separated-list-of-URIs>}} - \- predicate values that are used for sorting during search time. Only predicates specified here can be passed to {{order.by}}.

Examples:
h3. Search

# Simple search - in the form
{code}?entity luc4:indexName "lucene-query"{code}
# Simple search
{code}?entity luc4:indexName "lucene-query"{code}Returns all matching entities' ids as the ?entity binding.

# Advanced search - in the form
# Advanced search
{code}?result luc4:indexName ("lucene-query" "options"){code}

The options are in a ;-separated list of option=value pairs. What will be bound as ?result, depends on whether the options specify a facet query or not. The following is a list of all supported options:
|| Parameter || Value \\ || Default \\ || Comment \\ ||
| offset \\ | int \\ | 0 \\ | Returns the results starting from the specified value. \\ |
| facets \\ | string \\ | none \\ | A comma-separated list of facet predicates whose values to aggregate. All predicates specified here should also have been specified in the index options {{facet}}. Passing a non-empty list of facet predicates changes the way results are returned, see Facet examples below \\ |
| facets.limit \\ | int \\ | 2^31-1 \\ | The number of top values to return per facet predicate. For example, specifying {{facets=urn:facet1,urn:facet2;facets.limit=5}} will return ten facet results in total. \\ |
| order.by \\ | string \\ | score \\ | A comma-separated list of indexed predicates. If specified, the results are ordered by the value of the specified predicates, in the specified order. Values are sorted in an ascending order. To sort certain predicate values in descending order, prepend - (minus) to it. For example, {{order.by=urn:predicate1,-urn:predicate2}} will sort the results by the value of {{urn:predicate1}} in an ascending order. Then, the results that have the same value in {{urn:predicate1}} will be sorted by the value of {{urn:predicate2}} in a descending order. The special value {{score}} (lower-case) can be used for ordering instead of a predicate, based on the normal Lucene score (descending by default). For example {{order.by=urn:predicate,score}} will sort {{urn:predicate}} first (in an ascending order) and then by score, while {{order.by=score,urn:predicate}} will sort first by score, and then by {{urn:predicate}}. Passing {{\-score}} to reverse the order (from lowest to highest score) is also allowed. *NOTE*: Predicates on this list should be specified as {{sortPredicates}} during the creation of the index. Ordering by predicate without any specifications in {{sortPredicates}} will lead to unpredictable ordering. \\ | allowed.\\
(!) Predicates on this list should be specified as {{sortPredicates}} during the creation of the index. Ordering by predicate without any specifications in {{sortPredicates}} will lead to unpredictable ordering. \\ |
# Faceted search results - for searches where the {{facets}} search option is specified, the search predicate (e.g. {{luc4:indexName}}) binds a single dummy blank node to its object. You must then query for {{luc4:entities}} to get entities that match the query, and {{luc4:facets}} to get faceted results. The following is an example that demonstrates this and relates the related special predicates for extra result details:
{code}
PREFIX lucene4:<[http://www.ontotext.com/owlim/lucene4#]>
}
{code}
(!) Note the use of UNION to avoid the Cartesian product of the set of document and facet results. This is the recommended way to retrieve both documents and facets results.

h4. Additional predicates allowed on bound search results
h3. Examples

* Make a simple query, return all results, ?entity will will to the resource URI
* Return all results, ?entity will bind to the resource URI.
{code}
SELECT ?entity WHERE {
}
{code}
* Return the top 10 resources, snippets are of size 20, ?entity will bind to the resource URI.
{code}
PREFIX luc4:<http://www.ontotext.com/owlim/lucene4#>
}
{code}
* Return the top 10 resources, starting from offset 10 (i.e. the page size is 10, return second page), the documents will be sorted by the value of rdfs:label instead of the Lucene score.
{code}
PREFIX luc4:<[http://www.ontotext.com/owlim/lucene4#]>
h3. Real-world examples

Maybe to get started more quickly, these These examples will help you understand how to create an index and then execute searches on it.

The following query creates an index on all entities that have an rdfs:labels with @en or no language tag, using an EnglishAnalyzer with snippets enabled.
{code}
}
{code}
Let's gather get some results now:
{code}
PREFIX lucene4:<http://www.ontotext.com/owlim/lucene4#>
}
{code}
The result bindings will look like in the table below, the empty cell means this value is unbound:
|| score || entity || facetValue || facetCount || facetPredicate ||
| 1.0 | urn:a | | | |
h2. Dependencies & Deployment Details

Dependencies:
* OWLIM GraphDB Plug-in API (releases after build 5.4.6686)
* Lucene 4.5.1
* Apache commons-io 2.4
} LIMIT 10
{code}
The above query joins the union part (with bindings for ?c and ?s) with the lucene part on ?s. Provided that the lucene index contains things of the right classes, i.e. things of type Type1 AND Type2. There are a few noteworthy details though:
# The Lucene query limit applies before the SPARQL query limit.
# You can safely use DISTINCT in order to clean up duplicate results.

h3. If I union UNION up two lucene4 queries (on different lucene4 indices), will the snippets and scoring still work?

*A:* The short answer is no, because Lucene scores are generated per query, so basically one cannot execute 2 two different Lucene query and expect adequate scoring when joining the results. Consider the following example:
{code}
PREFIX luc4:<http://www.ontotext.com/owlim/lucene4#>
} ORDER BY DESC (?score)
{code}
While the query above is valid, it is not sane adequate because of the reasons mentioned earlier. The results will be incorrect since different scoring is used for the two queries. Instead of using UNION, you should create a single index for Type1, Type2 and Predicate1, and execute just one query.

h3. Sometimes the snippets are less than the number of snippet chars requested, why?

*A:* It is the way Lucene's FastVectorHighlighter generates snippets. For example, if the "match" is on an indexed property that is relatively short (such as a report title), then the snippet tends to be less than the entire title, even though the title length is less than the requested snippet length. Lucene tends to cut the snippet off chars before the first term match. The {{getBestFragment}} method's [javadoc|http://lucene.apache.org/core/3_5_0/api/contrib-highlighter/org/apache/lucene/search/vectorhighlight/FastVectorHighlighter.html#getBestFragment%28org.apache.lucene.search.vectorhighlight.FieldQuery,%20org.apache.lucene.index.IndexReader,%20int,%20java.lang.String,%20int%29] is not very helpful in explaining why.

h3. The drop index SPARQL update request throws an error if the index does not exist. Is there any a way to ask if the index exists, and if so - drop it?

This is where the _luc4:list_ comes in handy. If you have an index named _"myIndex"_, then you can execute the following SPARQL update and get response code 200, even when the index does not exist:

h3. How to search for terms/phrases containing stop words?
In our terminology a _term_ represents a word of text. What you are actually trying to search for is a _phrase_. In addition, the Lucene4 plug-in supports the Lucene query syntax. This means that, if you search for phrases such as "City of Manchester" like the one below, mind the quotes:

First of all, let's get the terminology right - a _term_ represents a word of text. What you are actually trying to search for is a _phrase_. In addition, the Lucene4 plug-in supports the Lucene query syntax. This means that if you search for phrases such as "City of Manchester" like this (mind the quotes):
{code}
PREFIX luc4:<http://www.ontotext.com/owlim/lucene4#>
}
{code}
There is a caveat, in order to be able to use " "" in a literal, you need to use the additional quoting construct - [http://www.w3.org/TR/sparql11-query/#QSynLiterals]

Only then you will get appropriate results. It is true that the Analyzers are filtering the stop words, but this is not an issue, since we are using the same analyzer for indexing and search - the one that was specified during index creation time. More info on how this actually works can be found here [http://lucene.apache.org/core/3_6_2/api/core/org/apache/lucene/analysis/package-summary.html], in the "Token Position Increments" section.


You will get appropriate results. Indeed, the Analyzers are filtering stop words, but this isn't an issue, since we are using the same analyzer for index and search - the one that was specified during index creation time. More info on how this actually works could be found here [http://lucene.apache.org/core/3_6_2/api/core/org/apache/lucene/analysis/package-summary.html] in the "Token Position Increments" section.

h3. Are prefix wildcards supported? Can I use both prefix and suffix wildcards on the same term?