View Source


h1. Intro
The autocomplete query is slow for short queries (2 and 3 letters), because we use prefix search and there are many matches.
Note: this is largely mitigated by better [Autocomplete Ranking], since we can safely limit to 50-100 results.

There's a fine tradeoff between functionality and speed:
- not to exclude 2 letters, eg "Ur"
- not to make the user wait too long after
- not to make a short query too fast, thus delaying performance

h2. Related Experience
- The autocomplete at [] is very fast. It doesn't use prefix queries (eg "asp" doesn't find "aspirin") but finds misspellings (eg "aspiri" finds "aspirin").
-- It uses editing (Levensthein) distance, which is a Lucene query option. It allows up to 3 misspelt chars:
-- it ranks matches using TF-IDF ranking
-- it uses the Forest autocomplete module
- The autocomplete of another system of the LifeSci group uses prefix queries and is still fast. But it uses external Solr (parallelized Lucene), not the Lucene built into OWLIM

h2. Suggestions
- when the search is for 2 chars, do exact keyword search ("re" instead of "re*"), to make it faster. Works for the "Ur" case above
- when the result list is bigger than 100 items, the SearchAPI should truncate it to 100 results (to make things faster in the js-part); the rationale is that the user doesn't need that many results in an autocomplete box; the results are sorted (currently semi-alphabetical, with the words starting with the typed word on top)
- delay the 2&3-char query for further 0.5 seconds
I.e. increase the UI typing timeout, so if the user starts typing "London", NO query for "Lon" will be made
- [#Approximate vs Prefix Search]
- Cache short queries (Vlado is against such complications)

h3. Approximate vs Prefix Search
The autocomplete at [] is fast and usable.
It doesn't use prefix queries (eg "asp" doesn't find "aspirin") but finds misspellings (eg "aspiri" finds "aspirin"). Vlado asked Kosyo:
- it uses editing (Levensthein) distance, which is a Lucene query option. It allows up to 3 misspelt chars with a Lucene "approximate" query like:
- it ranks matches using TF-IDF ranking
- it uses the Forest autocomplete module

Vlado asked Dominic whether it's necessary to use prefix queries: should "Rem" find "Rembrandt", or is it sufficient for "Rembrand" and "Rembrant" and similar mis-spellings to find it

(!) While prefix search on the first word is not so necessary, it's necessary on subsequent words.
- Eg on []:
-# "aspiri" shows various matches for aspirin, including "litecoat aspirin"
-# "aspirin li" shows *nothing*
-# user has to type all the way to "aspirin litecoa" to get again "litecoat aspirin"
- If we don't do prefix search, then for "british museum"
-# "british" will show "british museum"
-# "british m" will show *nothing*
-# user has to type all the way to "british museu" to get again "british museum"

Maybe the best approach is to combine the two: use approximate search for the first word, and prefix for the rest, eg:
{noformat}britis~3 m*{noformat}
will find "british museum" and also "british mint", etc.

We finally decided that we want prefix

h1. Timing
(Old observation)
- The backend executes a prefix query "re*" in 14s (5482 results) and a non-prefix query "re" in 1.3s (16 results).
- The frontend is much slower (minutes) because of js-part of the autocomplete component which is slow to deal with big result sets.

Data as of 3 Sep 2012. Timing is in ms:
|| Query || #Results || Owlim4 Local || Owlim4 Remote || Owlim5 Local || Owlim5 Remote ||
| oil p | 27 | 906 | 1752 | 1113 | 425 |
| rem | 130 | 2143 | 10753 | 2721 | 1508 |
| fieren | 1 | 7 | 138 | 25 | 30 |
| Vries | 18 | 22 | 188 | 60 | 41 |
| Poorter | 3 | 10 | 91 | 17 | 29 |
| Metropolitan | 29 | 97 | 319 | 174 | 122 |
| Lon | 1151 | 2666 | 20075 | 4740 | 5146 |
| London | 448 | 1937 | 20606 | 2832 | 4426 |

h2. Lucene Alone vs with Thesaurus Restrictions
Mitac timed the Lucene query alone, without the additional thesaurus restrictions
- OWLIM/Lucene restriction alone: 200-500ms
- additional SPARQL clauses (thesaurus restrictions): 1-5s (5-10 times slower)

h2. Lucene Direct vs Lucene in Owlim
Mitac implemented Lucene direct indexing (using LuceneAPI). The index is created from the existing thesauri terms and has the following fields:
|| Field || Description || STORE/INDEX options ||
| CONTENT | all rdf:label-s | Store.NO, Index.ANALYZED |
| RANK | rdfRank from OWLIM | Store.YES, Index.NO; Used in the final lucene ordering |
| THES | List of the thesauri to which the current item belongs, separated by space | Store.YES, Index.NO |
| LABEL | prefLabel, previously computed by the rule (en, none, nl) - cached in this field | Store.YES, Index.NO |
| SCOPENOTE | The scope note, previously computed - cached in this field | Store.YES, Index.NO|
| URI | The URI of the item |Store.YES, Index.NO|
- the indexing takes ~30mins, including the execution of the Owlim queries
- the performance of the search varies from 22% (for easier searches, like "Poorter" above to 8800% for "Lon*"). See the spreadsheet for details
- tested that the top results of the current search and the new search match (up to certain point at least; e.g. for 100 or more results we cannot expect the order to be the same - just for the items with a non-zero rank)
- there are two differences, for "rem*" and for "ams*" where the LuceneDirect search results 8 and 1 less results, respectively. This is probably related to some of the searcher options in Lucene and seems OK

The autocomplete of an AZ project by our LifeSci group uses prefix, and is fast.
Vlado asked Dancho: it uses an external SOLR index (parallel Lucene), not the Lucene index embeded in OWLIM.