Intro
The autocomplete query is slow for short queries (2 and 3 letters), because we use prefix search and there are many matches.
Note: this is largely mitigated by better Autocomplete Ranking, since we can safely limit to 50-100 results.
There's a fine tradeoff between functionality and speed:
- not to exclude 2 letters, eg "Ur"
- not to make the user wait too long after
- not to make a short query too fast, thus delaying performance
Related Experience
- The autocomplete at http://www.linkedlifedata.com is very fast. It doesn't use prefix queries (eg "asp" doesn't find "aspirin") but finds misspellings (eg "aspiri" finds "aspirin").
- It uses editing (Levensthein) distance, which is a Lucene query option. It allows up to 3 misspelt chars:
aspiri~3
- it ranks matches using TF-IDF ranking
- it uses the Forest autocomplete module
- It uses editing (Levensthein) distance, which is a Lucene query option. It allows up to 3 misspelt chars:
- The autocomplete of another system of the LifeSci group uses prefix queries and is still fast. But it uses external Solr (parallelized Lucene), not the Lucene built into OWLIM
Suggestions
- when the search is for 2 chars, do exact keyword search ("re" instead of "re*"), to make it faster. Works for the "Ur" case above
- when the result list is bigger than 100 items, the SearchAPI should truncate it to 100 results (to make things faster in the js-part); the rationale is that the user doesn't need that many results in an autocomplete box; the results are sorted (currently semi-alphabetical, with the words starting with the typed word on top)
- delay the 2&3-char query for further 0.5 seconds
I.e. increase the UI typing timeout, so if the user starts typing "London", NO query for "Lon" will be made - Approximate vs Prefix Search
- Cache short queries (Vlado is against such complications)
Approximate vs Prefix Search
The autocomplete at http://www.linkedlifedata.com is fast and usable.
It doesn't use prefix queries (eg "asp" doesn't find "aspirin") but finds misspellings (eg "aspiri" finds "aspirin"). Vlado asked Kosyo:
- it uses editing (Levensthein) distance, which is a Lucene query option. It allows up to 3 misspelt chars with a Lucene "approximate" query like:
aspiri~3
- it ranks matches using TF-IDF ranking
- it uses the Forest autocomplete module
Vlado asked Dominic whether it's necessary to use prefix queries: should "Rem" find "Rembrandt", or is it sufficient for "Rembrand" and "Rembrant" and similar mis-spellings to find it
While prefix search on the first word is not so necessary, it's necessary on subsequent words.
- Eg on http://www.linkedlifedata.com:
- "aspiri" shows various matches for aspirin, including "litecoat aspirin"
- "aspirin li" shows nothing
- user has to type all the way to "aspirin litecoa" to get again "litecoat aspirin"
- If we don't do prefix search, then for "british museum"
- "british" will show "british museum"
- "british m" will show nothing
- user has to type all the way to "british museu" to get again "british museum"
Maybe the best approach is to combine the two: use approximate search for the first word, and prefix for the rest, eg:
britis~3 m*
will find "british museum" and also "british mint", etc.
We finally decided that we want prefix
Timing
(Old observation)
- The backend executes a prefix query "re*" in 14s (5482 results) and a non-prefix query "re" in 1.3s (16 results).
- The frontend is much slower (minutes) because of js-part of the autocomplete component which is slow to deal with big result sets.
Data as of 3 Sep 2012. Timing is in ms:
Query | #Results | Owlim4 Local | Owlim4 Remote | Owlim5 Local | Owlim5 Remote |
---|---|---|---|---|---|
oil p | 27 | 906 | 1752 | 1113 | 425 |
rem | 130 | 2143 | 10753 | 2721 | 1508 |
fieren | 1 | 7 | 138 | 25 | 30 |
Vries | 18 | 22 | 188 | 60 | 41 |
Poorter | 3 | 10 | 91 | 17 | 29 |
Metropolitan | 29 | 97 | 319 | 174 | 122 |
Lon | 1151 | 2666 | 20075 | 4740 | 5146 |
London | 448 | 1937 | 20606 | 2832 | 4426 |
Lucene Alone vs with Thesaurus Restrictions
Mitac timed the Lucene query alone, without the additional thesaurus restrictions
- OWLIM/Lucene restriction alone: 200-500ms
- additional SPARQL clauses (thesaurus restrictions): 1-5s (5-10 times slower)
Lucene Direct vs Lucene in Owlim
Mitac implemented Lucene direct indexing (using LuceneAPI). The index is created from the existing thesauri terms and has the following fields:
Field | Description | STORE/INDEX options |
---|---|---|
CONTENT | all rdf:label-s | Store.NO, Index.ANALYZED |
RANK | rdfRank from OWLIM | Store.YES, Index.NO; Used in the final lucene ordering |
THES | List of the thesauri to which the current item belongs, separated by space | Store.YES, Index.NO |
LABEL | prefLabel, previously computed by the rule (en, none, nl) - cached in this field | Store.YES, Index.NO |
SCOPENOTE | The scope note, previously computed - cached in this field | Store.YES, Index.NO |
URI | The URI of the item | Store.YES, Index.NO |
- the indexing takes ~30mins, including the execution of the Owlim queries
- the performance of the search varies from 22% (for easier searches, like "Poorter" above to 8800% for "Lon*"). See the spreadsheet for details
- tested that the top results of the current search and the new search match (up to certain point at least; e.g. for 100 or more results we cannot expect the order to be the same - just for the items with a non-zero rank)
- there are two differences, for "rem*" and for "ams*" where the LuceneDirect search results 8 and 1 less results, respectively. This is probably related to some of the searcher options in Lucene and seems OK
lucene-direct-vs-owlim.xlsx
The autocomplete of an AZ project by our LifeSci group uses prefix, and is fast.
Vlado asked Dancho: it uses an external SOLR index (parallel Lucene), not the Lucene index embeded in OWLIM.