Intro
The ordering of autocomplete results (search terms) is very important, since the user won't/can't scroll past the first 50 or 100.
Currently they are returned in alphabetical order.
- Actually it's a complex ordering function that prefers terms where the query appears at the start of prefLabel
- (While the autocomplete returns all results where the query appears at the start of any word in any label)
Problem:
- A small village "Amsesomewhere" will come before "Amsterdam", even if the village does not appear anywhere in the data.
- that makes it hard to find the British Museum, since there are many other terms starting with "British"
It's better to order by popularity/importance (i.e. how widely used a term is).
If Autocomplete Test is satisfactory, cut off at 50 or 100 results
RDFRank
The plan is to use RDFRank, a unique OWLIM feature, which works like this:
- Follows links and increases the score of visited nodes
- Diminishes score increment with every iteration (damping)
- Never decreases scores
- Iterates up to maxIterations, or when the net change of an iteration is less than some epsilon
- At the end normalizes all scores to 0..1
Compute RDFRank
- compute RDFRank. This is an expensive operation done once as part of Repository Creation.
PREFIX rank: <http://www.ontotext.com/owlim/RDFRank#> INSERT DATA {[] rank:compute []}
- tell OWLIM FTS use RDFRank to boost the score (relevance) of Lucene search results. Allowed values are 'no' (default), 'yes' and 'squared' (use the square of RDFRank).
PREFIX luc: <http://www.ontotext.com/owlim/lucene#> INSERT DATA { luc:useRDFRank luc:setParam "squared" . }
- The default is "no" and the new syntax is INSERT not ASK
- rebuild the Lucene index
PREFIX luc: <http://www.ontotext.com/owlim/lucene#> INSERT DATA { luc:thesIndex luc:createIndex "true" . }
Incremental Update of RDFRank
RS-1462
It would be good to recompute term rank when data is updated:
- thesaurus terms are used in Data/Image Annotation
- tags are attached/detached to an object (see Tags Spec).
Note: tags are also thesaurus terms, so below we talk only about "terms"
Computing the rank incrementally is hard:
- rank:computeIncremental computes the rank only of nodes that don't have any RDFRank, i.e. new terms. It cannot be used to update the rank of terms that are attached/detached
- rank:computeIncremental performance:
- Mitac is concerned this may be slow and may block the system. If so, needs to be run nightly.
- Vlado thinks the scope of this update is small, so it shouldn't be slow
- The performance of rank:computeIncremental (and whether it blocks the system) needs to be timed
- If we use luc:score and not rank:hasRDFRank (see below) then we need to recreate the FTS index for the affected terms only.
How important is this?
- Updating the rank is not critical because these few new statements will have little effect on the overall rank, compared to the quite bigger number of statements from imported data.
- Computing the rank of new terms is probably critical, because they won't be returned by the query using rank:hasRDFRank. This needs to be done when adding a tag to rs-tag
Use RDFRank
There are two methods to use RDFRank:
- Direct: use rank:hasRDFRank
- Indirect: use luc:score, which itself is boosted by rank:hasRDFRank
SELECT * { ?term luc:thesIndex "lond*". ?term rank:hasRDFRank ?rank. # DIRECT # ?term luc:score ?score. # INDIRECT } ORDER BY DESC(?score) LIMIT 100
We currently use the Direct method because Indirect has the following complications:
- doesn't work with a wildcard query (eg "amst*")
OWLIM-1079
We need to change multiTermRewriteMethod of QueryParser: - needs to update the FTS index incrementally after Incremental Update of RDFRank
Labels:
None