compared with
Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (13)

View Page History
{toc}

h1. Problem & Tasks
- The query is slow for 3-letter queries, because we use prefix search and there are many matches
- Kostadinov to increase the UI typing timeout, so if the user starts typing "London", NO query for "Lon" will be made
- Mitac proposes a cache, Vlado is against such complications
- Mitac to time the Lucene query alone, without the additional thesaurus restrictions
h1. Intro
The autocomplete query is slow for short queries (2 and 3 letters), because we use prefix search and there are many matches.
Note: this is largely mitigated by better [Autocomplete Ranking], since we can safely limit to 50-100 results.

h1. Alternative Approaches
There's a fine tradeoff between functionality and speed:
- not to exclude 2 letters, eg "Ur"
- not to make the user wait too long after
- not to make a short query too fast, thus delaying performance

h2. Related Experience
- The autocomplete at [http://www.linkedlifedata.com] is very fast. It doesn't use prefix queries (eg "asp" doesn't find "aspirin") but finds misspellings (eg "aspiri" finds "aspirin").
-- It uses editing (Levensthein) distance, which is a Lucene query option. It allows up to 3 misspelt chars:
{noformat}aspiri~3{noformat}
-- it ranks matches using TF-IDF ranking
-- it uses the Forest autocomplete module
- The autocomplete of another system of the LifeSci group uses prefix queries and is still fast. But it uses external Solr (parallelized Lucene), not the Lucene built into OWLIM

h2. Suggestions
- when the search is for 2 chars, do exact keyword search ("re" instead of "re*"), to make it faster. Works for the "Ur" case above
- when the result list is bigger than 100 items, the SearchAPI should truncate it to 100 results (to make things faster in the js-part); the rationale is that the user doesn't need that many results in an autocomplete box; the results are sorted (currently semi-alphabetical, with the words starting with the typed word on top)
- delay the 2&3-char query for further 0.5 seconds
I.e. increase the UI typing timeout, so if the user starts typing "London", NO query for "Lon" will be made
- [#Approximate vs Prefix Search]
- Cache short queries (Vlado is against such complications)

h23. Approximate vs Prefix Search
The autocomplete at [http://www.linkedlifedata.com] is fast and usable.
It doesn't use prefix queries (eg "asp" doesn't find "aspirin") but finds misspellings (eg "aspiri" finds "aspirin"). Vlado asked Kosyo:
will find "british museum" and also "british mint", etc.

h2. SOLR vs Embedded Lucene
The autocomplete of an AZ project by our LifeSci group is prefix and very fast.
Vlado asked Dancho: it uses an external SOLR index, not the Lucene index embeded in OWLIM.
We finally decided that we want prefix

h1. Timing
Data as of Sep 03, 2012. Timing is in ms
(Old observation)
- The backend executes a prefix query "re*" in 14s (5482 results) and a non-prefix query "re" in 1.3s (16 results).
- The frontend is much slower (minutes) because of js-part of the autocomplete component which is slow to deal with big result sets.

Data as of 3 Sep 2012. Timing is in ms:
|| Query || #Results || Owlim4 Local || Owlim4 Remote || Owlim5 Local || Owlim5 Remote ||
| oil p | 27 | 906 | 1752 | 1113 | 425 |
| London | 448 | 1937 | 20606 | 2832 | 4426 |

h1. LuceneDirect vs. Lucene/Owlim
h2. Lucene Alone vs with Thesaurus Restrictions
Mitac timed the Lucene query alone, without the additional thesaurus restrictions
- OWLIM/Lucene restriction alone: 200-500ms
- additional SPARQL clauses (thesaurus restrictions): 1-5s (5-10 times slower)

h2. Lucene Direct vs Lucene in Owlim
- implemented Lucene direct indexing (using LuceneAPI). The index is created from the existing thesauri terms and has the following fields:
|| Field || Description || STORE/INDEX options ||
- there are two differences, for "rem*" and for "ams*" where the LuceneDirect search results 8 and 1 less results, respectively. This is probably related to some of the searcher options in Lucene and seems OK
[^lucene-direct-vs-owlim.xlsx]

The autocomplete of an AZ project by our LifeSci group uses prefix, and is fast.
Vlado asked Dancho: it uses an external SOLR index (parallel Lucene), not the Lucene index embeded in OWLIM.