Dynamic partial match in thesaurus values
Also see Autocomplete Performance, Autocomplete Ranking, Autocomplete Testing
Specification
TODO: UPDATE THIS!
- We have a search field - let's say Place. We know what thesaurus is used for that field: thesaurus is fixed. RS3.2 will deal with one thesaurus per field
- User types 3 or more letters. We search in the thesaurus
- We look at the beginning of any word in the thesaurus label
- We don't look in the middle of words
- We look in all languages
- Example: "ber" should find:
- berlin (en) and berlijn (nl) - same thesaurus entries, different labels for it
- village of bergen (en) - beginning of a word, only English label matches
- We show dynamically to the user the matching entries (EN labels) DO NB: What about scope note - see BVA example of Freebase which shows contextual information.
- The ordering of the entries is based on relevance
- Mitac: based on the frequency of the usage of the entry in data (either Term Frequency or Document Frequency).
- That is, if 2 pictures are connected to Berlin and no pictures are connected to Village of Bergen, we will show Berlin first.
- Jana: This needs additional specification. Vlado: seems pretty clear to me? Jana: What's the ordering we are going to use: lexical, based on data frequency etc.?
- Done on the OWLIM side. Don't use caching in GUI/Nuxeo - we may have 80 thousand entries in a thesaurus.
- User chooses one entry
- User may enter multiple entries.
- Vlado: Jana, I don't understand.. Can he select several entries at once? Or does he add entries one by one, using the "+"? Edit and merge to the bullet above. Jana: That's right, they use "+".
- DO - This is very much linked to the search interface and needs to support the predicate approach of the FR/FC system
- Backend notes
- both alphabetic and order by relevance could be supported (the first one is trivial)
- the result of the call should be Entity with its URI and label(s).
- The Multi-linguality will be partially handled by the Entity (e.g. if we decide to represent the labels as a Map of languages -> labels) but probably we also need to know which is the matching label
- the relevance can be pre-computed if slow - the thesauri do not change (often)
- Performance: what would be the maximum time for the autocomplete to work before it returns results?
- DO - has does this all relate to co-referencing of terms. Does the user select the vocabulary they want? This spec needs to address the co-referencing functionality
- GUI sends all selected URIs to SearchAPI
- SearchAPI constructs a search condition using OR (disjuncts)
- URI is a better choice than "entered word", as we may have several matches (eg 2 villages of Bergen - one in USA and one in Germany), and the user has selected the ones he wants. Would an indication of posityion in the heirachy be required?
Limitations
- RS3.2 will use one thesaurus per field: this is a limitation that needs to be removed in later iterations
- RKD had separate thesauri "Support Material" vs "Frame Material". But they didn't give us "Frame Material", so we extracted values from the data, and put them in one merged thesaurus.(DO presumeable necessary to support the more primative FCs)
- When we bring BM data into a shared space, one should be able to search using RKD and/or BM thesauri. That's under 13 Terminology Matching (Coreference). Eg it should collect all entries of Rembrandt from all thesauri (DO: Co-reference again)
- RS3.2 won't
- RS3.2 will not use skos:altLabel's (i.e. "use for" labels). It will use only the main label (in all its language variants)
- What info to display: only label, always in EN
- Jana: highlighting the matching word would be nice. Vlado: not in RS3.2
- Vlado: for Artist it would be useful to display years of birth/death (many Rembrandts out there...). But we have no such in the extracted rkd-artists (RKD did not provide this thesaurus)
- Jana: maybe display the Broader Place?
Vlado: we don't need to deal with broader, since the label in rkd-places is non-ambiguous. E.g.: (DO: But you might elsewhere: Please consider BM thesauri terms not just RKD.
"Bergen (Noorwegen)"
"Bergen Belsen"
"Bergen (Niedersachsen)"
"Bergen im Chiemgau" - In http://www.linkedlifedata.com we show much more (eg term type, description, etc)
- RS3.2 will do autocomplete only over thesauri (FRs). It won't do autocomplete over FTS, since that'd be too broad (we're searching only for root Museum Objects, not sub-objects/ artists/ collections...)
- RS3.2 won't do spelling correction
- we have lots of experience with this, using both phonetic matching methods (Caverphone) and editing fixes (Levenhstein distance)
- eg "astma" on http://www.linkedlifedata.com finds all entries for "asthma", "astra" etc
- but Dominic doesn't want to make users lazy
-DO: If it is straightforward.........
Possible Enhancements
Better Display
- Shows the matching part of word underlined
- shows the matching altLabel in parentheses (not only the prefLabel)
Labels:
None