View Source


h1. Intro
The ordering of autocomplete results (search terms) is very important, since the user won't/can't scroll past the first 50 or 100.
Currently they are returned in alphabetical order.
- Actually it's a complex ordering function that prefers terms where the query appears at the start of prefLabel
- (While the autocomplete returns all results where the query appears at the start of any word in any label)

- A small village "Amsesomewhere" will come before "Amsterdam", even if the village does not appear anywhere in the data.
- that makes it hard to find the British Museum, since there are many other terms starting with "British"

It's better to order by popularity/importance (i.e. how widely used a term is).
- (!) If [#Autocomplete Test] is satisfactory, cut off at 50 or 100 results

h1. RDFRank
The plan is to use [RDFRank|], a unique OWLIM feature, which works like this:
- Follows links and increases the score of visited nodes
- Diminishes score increment with every iteration (damping)
- Never decreases scores
- Iterates up to maxIterations, or when the net change of an iteration is less than some epsilon
- At the end normalizes all scores to 0..1

h2. Compute RDFRank
- compute RDFRank. This is an expensive operation done once as part of [Repository Creation].
PREFIX rank: <>
INSERT DATA {[] rank:compute []}
- tell [OWLIM FTS|] use RDFRank to boost the score (relevance) of Lucene search results. Allowed values are 'no' (default), 'yes' and 'squared' (use the square of RDFRank).
PREFIX luc: <>
INSERT DATA { luc:useRDFRank luc:setParam "squared" . }
-- The default is "no" and the new syntax is INSERT not ASK
- rebuild the Lucene index
PREFIX luc: <>
INSERT DATA { luc:thesIndex luc:createIndex "true" . }

h2. Incremental Update of RDFRank
It would be good to recompute term rank when data is updated:
- thesaurus terms are used in Data/Image Annotation
- tags are attached/detached to an object (see [Tags Spec]).
Note: tags are also thesaurus terms, so below we talk only about "terms"

Computing the rank incrementally is hard:
- rank:computeIncremental computes the rank only of nodes that don't have any RDFRank, i.e. *new* terms. It cannot be used to *update* the rank of terms that are *attached/detached*
- rank:computeIncremental performance:
-- Mitac is concerned this may be slow and may block the system. If so, needs to be run nightly.
-- Vlado thinks the scope of this update is small, so it shouldn't be slow
-- The performance of rank:computeIncremental (and whether it blocks the system) needs to be timed
- If we use luc:score and not rank:hasRDFRank (see below) then we need to recreate the FTS index for the affected terms *only*.

How important is this?
- Updating the rank is *not critical* because these few new statements will have little effect on the overall rank, compared to the quite bigger number of statements from imported data.
- Computing the rank of new terms is *probably critical*, because they won't be returned by the query using rank:hasRDFRank. This needs to be done when *adding a tag to rs-tag*

h2. Use RDFRank
There are two methods to use RDFRank:
- Direct: use rank:hasRDFRank
- Indirect: use luc:score, which itself is boosted by rank:hasRDFRank

?term luc:thesIndex "lond*".
?term rank:hasRDFRank ?rank. # DIRECT
# ?term luc:score ?score. # INDIRECT
} ORDER BY DESC(?score) LIMIT 100

We currently use the Direct method because Indirect has the following complications:
- doesn't work with a wildcard query (eg "amst*")
We need to change multiTermRewriteMethod of QueryParser:
- needs to update the FTS index incrementally after [#Incremental Update of RDFRank]