ResearchSpace Semantic Search Specification
Date |
Ver |
Who |
What |
2012-04-08 |
0.1 |
Dominic Oldman |
Created |
2012-04-21 |
0.1-VA |
Vladimir Alexiev |
Commented |
2012-04-20 |
0.2 |
Dominic Oldman |
Extended |
2012-04-24 |
0.2-VA |
Vladimir Alexiev |
Commented |
2 Adherence to Client Business Requirements 2
3.1 Search Sentence - Overview 4
3.2 Standard Search – details 5
4 Advanced Explore and Advanced Search 9
Appendix 1 - Example Queries 9
Appendix 2 - Authority control in Merlin (Peter Main 19/08/2009) 10
Appendix 3 – Search Sentence Process 12
Appendix 4 – Timeline and Geographical mapping 12
1 Introduction
Search is described at pages 29, c38, 43 – 46, and page 92 of the ResearchSpace requirements & Specification v2 document. This describes the tool at a high level but includes the following;
Ontology Search
Right, FR search
- The ontology search would allow searching by CRM ontology terms. (The project is using the Fundamental Concepts and Relationships devised by Martin Doerr)
Taxonomy searching
Right, Thesaurus search
- Co-reference
Out of RS3 scope since there are no planned tasks to match thesauri.
Various matching tools and RDFized thesauri are available for this. See comment in App2
of terms using by the different organizations submitting data.
Related Results - The search environment should suggest different avenues of exploration by showing related results using different categories
Precision Searching
Same as FR search?
- concentrate on searching material on a particular object or subject and therefore narrow down the potential results from the outset.
External relationships - The potential relationships to be discovered should extend to data sources outside the ResearchSpace sphere
Keyword Searching
Right, FTS search
-
Free text searching of literals
Our FTS index also indexes (multilingual) labels of thesauri values
–
2 Adherence to Client Business Requirements
Roger that J
3
Please refer to ResearchSpace Business Requirements & Specification, Date: May 2010, Version: 2. These elements are part of client acceptance testing.
Relevant Business rules
Business Rule 2 : Established Data – Existing data and assets, for example accessible organisational collection data and associated images, should be uploaded to the shared ResearchSpace repository and be available to all projects. Some multimedia assets will require some access restrictions and this reflects the practical reality that some media assets are more likely to be subject to conditions and restrictions. However, ResearchSpace would always encourage free and unfettered access to cultural heritage data.
Business Rule 3: Open System Standards the ResearchSpace tools should not be dependent onproprietary APIs. It should be possible to take any tool developed for use with open standard RDFdata and incorporate that tool into ResearchSpace without substantial redevelopment. Similarly, tools developed specifically for ResearchSpace should be useable with little or no modification outside the ResearchSpace environment. This business rule ensures the open nature of ResearchSpace and keeps the RDF research tool model simple and accessible.
Business Rule 5 : Ontology Standard - In order to obtain a good level of data harmonisation and therefore allow effective exploration of data supplied from different sources, the ontology standard for all data will be the CIDOC Conceptual Reference Model (CRM).
Business Rule 6: Use Cases – The main ResearchSpace elements of data, collaboration, analysis, and web publication should be both integrated but also available as separate functions to encourage a wide range of project and related use.
Business Rule 7: Open Source – New software created will be released to the community freely as open source. Existing open source tools should, if possible, be utilised for ResearchSpace development.
4 Standard Search
The ResearchSpace system will need to provide different search mechanisms for different users. Some will require very simple search interfaces and some will require more advanced and precise searching. Standard search will utilise the Fundamental Relationships and categories concept developed at FORTH. These can be used in many different implementations and UI presentations depending upon context.
This document refers to a part of the Stage 3 search interfaces which should will be the default search mechanism used by a general user. The system is referred to as search sentence but at this stage this does not represent a grammatical rules based search system (and the complexities that this would entail) but rather refers the horizontal ‘sentence like’ nature of the presentation.
4.1 Search Sentence - Overview
1. The user will conduct a keyword search that uses an index across text and controlled terms
The FTS search looks at all words no matter where they occur (including thesauri!).
The Thesaurus search auto-completes against all thesauri and is more restrictive: the term must occur in the object; and in a FR of the object.
So we can combine the two searches, but they still use different indexes and very different mechanisms.
2. .
3. The results will be displayed with the search sentence available for the user to refine the results if they wish.
4. The first item will be the term typed into the keyword search but this could be replaced.
5. Search term boxes are auto-complete and in selecting a term the thesaurus should be displayed in a box next to the term (similar to a mouse over event) with information about the term. This help method will be applied to other drop down items like boolean operators and relationship terms.
6. Terms can be refined using the AND, OR (and any other allowed Boolean operators) as well as WHICH – a special keyword for using relationship connectors.
7. Boolean operators should have their own scope note box which provides and example of usage. The WHICH keyword should also have an appropriate scope note box.
8. WHICH - refers the use of a fundamental relationship in the next entry box.
9. Only fundamental relationships and thesauri terms that apply (based on the operators and relationships used at the entry position) should be available.
10. If a fundamental relationship is specified then only valid terms can be auto-completed in the next box.
11. When selecting a fundamental relationship this too should have a scope note to explain how it is applied and the types that it can be applied to.
12. The system should allow for further configuration
Given the complexity of FRs and the need to reload the repository when OWLIM rules are added, I'd say "further development"
13. of relationships and operators.
4.2 Standard Search – details
N |
Function |
phase |
Comment (VA) |
1 |
The design should conform to the BVA design above and this documents adds to features in the design. Where a feature is in the design but not this document it should be assumed and added. For any differences please call the author. |
RS3.4 |
ok |
2 |
The search term builds horizontally from drop down boxes as each entry is made |
RS3.4 |
How about year search, using numbers "from-to"? |
3 |
It should be possible to edit the boxes and re-submit searches. |
RS3.4 |
Propose to have edit only for kewords/terms. FR clauses will be deleted & reinserted |
4 |
It should be possible to delete boxes or add new ones and re-submit searches. |
RS3.4 |
Delete will work on complete FR clauses (FR+terms); or keywords/terms (if used without FR). If we adopt flexible Grammar, deleting/adding will be posisble only at the end |
5 |
An entry can be either a free text search or a controlled search depending upon whether a controlled term is fully selected. For example a user could input “cotton” AND ‘cotton’ to search both free text and material controlled term |
RS3.4 |
Bad example: the keyword would also find the term, so the term is sueprfluous. Better example: "cotton and Canvas [Material]" where the first is a keyword, the second a term. |
6 |
Terminology items will be identified by the thesaurus they originate from e.g. Glass (object) Glass (material) |
RS3.4 |
Ok. thesaurus Type should be always on-screen, to indicate this is a Term search, and different from Keyword search |
7 |
All dropdown items will have scope notes in boxes that appear on a mouse over (tooltip) . · Terminology – will have the thesauri title and the scope note · Boolean operators - will have a definition and an example · WHICH – will have a definition and an example · Relationships will provide examples based on the predicates that they are aggregating |
RS3.4 |
· Term: if we have additional details (e.g. rdfs:comment): there aren't any in RKD thesauri. · Boolean: is it really necessary to explain what AND/OR mean? · WHICH: I still don't understand what it is for, and imho nobody has explained it clearly. Why can't the user just select the FR? I think it's a parasitic word and not needed · FR: And most of all to define it (e.g. the definition of Thing From Place is several sentences!) |
8 |
Intelligent Sentences. The search sentence will only provide relationships that apply to the subject that has just been entered. |
? |
Let's discuss. Eg if I enter term "The Hague [Place]", what should happen? Should it prepend a box with relations "any" (default), "from place", "about place"?? |
9 |
It should be possible to add a thumbnail image to scope note to illustrate the term. Therefore it should be possible in some later configuration to associate representative images to thesauri terms |
RS4 |
No images in RKD/BM thesauri, AFAIK |
10 |
When the term is selected the next box in the sentence appears |
RS3.4 |
Yes, there's no need for the user to enter AND since that's the only top-level connective |
11 |
A controlled term or free text word is always followed by a box to “refine“ |
?? |
"Refine" what? Describe in detail how this would work |
12 |
A WHICH is always followed by a “define” which has a drop down of relationships supported by that part of the search sentence |
|
In RS3 all FRs will be about Thing so all apply all the time. I don't see the need for WHICH |
13 |
Only authority files / thesauri that are support by the relationship should be available on the next autocomplete box |
RS3.4 |
Ok. Each FR should know its range: Thesaurus Type(s) or Date Range |
14 |
If there is a Boolean limitation for OR’s then only controlled terms from the same thesaurus as the connected term will be available for auto-complete |
RS3.4 |
Yes: in OR the FR stays the same |
15 |
The Boolean AND can be used with terminology from any Authority |
RS3.4 |
After AND the user enters a keyword, term, or selects a new FR that determines the thesaurus Types of the next term |
16 |
The search settings will allow a user to change the datasets that are being searched. A user can select all available datasets, selected datasets including the project dataset. Defaults can also be set. E.g.: Select All Select this project Rembrandt Cranach etc |
RS4 |
The plan is to address multi-project and security features in RS3.5. But we need to define "data set" since the spec talks about graphs (data spaces) and there are only 2 from the viewpoint of a given project: shared and project space. Need to decide what we use Named Graphs for. Need to define what happens when data is annotated then changed inside a project |
17 |
The search settings should be able to include external semantic sources. (Another Museum Endpoint, Dbpedia…) Note: The relationship mapper functionality will probably need to incorporate external taxonomies |
RS4 |
Out of RS3 scope, but IMHO very important for RS4 since we want to leverage available resources: Wikipedia, LOD (including DBpedia, FreeBase), and anything else found through Google. E.g. SEMLIB (see Related projects) allows Annotation based on either internal thesauri, or external sources like FreeBase, and leverages a FB search API |
18 |
The absence of an object type denotes any object |
RS3.4 |
Clear (and no need to say it) |
19 |
Search results should be displayed with an image and summary metadata (Currently, Title, maker/artist, date, location, material, technique ).
|
RS3.4 |
Need to define "Display Fields": · Universally across CIDOC objects (in RSO extension ontology) · As specialized (subset) FRs · including Main Image |
20 |
Display Fields should be configurable: select from the main metadata fields and we should be able to draw from a larger list. A dropdown with the field possibilities should be available and provide multi-selection. You should also be able to turn all metadata text off. See also map and timeline functionality. |
RS4 |
What other fields do you contemplate beoynd the ones mentioned above? Then let's see how we can define them universally across CIDOC objects |
21 |
It should be possible to filter results with facets. (Note that design only presents core concept facets). We should be able to make use of other controlled authorities (material, culture, etc – see the annex 2 – when available). An example is the Finnish site at p.62 of the Business Requirements ( www.museosuomi.fi ) |
RS3.5? |
Exhibit handles the faceting. Do we need to transfer selected facets into the Search (i.e. a Refine Search operation), forming an AND/OR FR search? Will facets coincide with FRs? |
22 |
It should be possible to select items in the summary results pages for inclusion in the data basket (copy & link) |
RS3.4 |
Link is clear. But what's Copy? |
23 |
It should be possible to select a summary result list and click through to see the object ( full details ) . (Note: we currently only have the data annotation tool that does this and we will need to implement a non research tool detailed results page with a similar design) |
RS3.4 |
(Not sure what you mean: after Search you can also see the object details) |
24 |
Configuration should be available to change the relationship terms to other more meaningful terms (masks) What do you mean by terms/masks? . For example: “From” could refer to relationships that are not obvious to the user. |
?? |
Do we want to create divergence by letting people give their own "translations" of FRs? I think this goes against CIDOC's desire for standardization/ unification. If we show them the scope notes, they'll quickly learn the FRs. |
25 |
Saving the Search should provide a dialogue box that allows the search to be named and for a description for the search. |
RS3.4/5 |
Need to define ontology for searches mechanism to save them in RDF, and be able to put them in Basket |
26 |
Note on Time line and GeoMap Functions: These features will need to be fully specified as applications in their own right and additional specifications will be required. Only features relevant to search functionality are described in this document. For search these plug-ins should simply be seen as a view replacing the standard summary thumbnail review but with some additional contextual user controls. |
RS3.1
RS4 |
Timeline: available since RS3.1.
GeoMap: requires a mapping from RKD/BM Places to GeoNames, or another mechanism to provide coordinates (see App4 for discussion) |
27 |
The user should be able to switch to a timeline or map view (and been ?? these views) which provides a visualisation but has the same functionality as the standard results screen, e.g. selection, facet filtering, presentation of metadata fields etc. The exhibit plug-in simply replaces the thumbnail view but all other features are retained with some additional exhibit controls. |
RS3.1 |
Yes, Exhibit accomplishes this |
28 |
The plotted results should have the same summary information as the standard results screen and click provide a popup box with the thumbnail and additional selection functions as in the main results screen. (The popup is effectively providing the same as the thumbnail results for adding to the data basket etc ) . Clicking on a result plotted on a timeline or map should provide result in a view of the objects details as a popup similar to the thumbnail view on the standard search screen. |
RS3.1 |
ok. Please reword to remove redundancy |
29 |
Timeline results Mapping: Should allow the user to change the time line scale (days, months, weeks, years – on two levels |
RS3.1 |
Yes, Exhibit accomplishes this |
30 |
Time line and Map data representation: The object should be represented by a suitable marker similar to that in Google Maps.
|
RS3.4 |
|
31 |
Text next to the marker should be configurable by the user and default to the object title. A dropdown with the field possibilities should be available and provide multi-selection. It should be possible to turn text on and off. |
RS3.5? |
Jana? |
32 |
Geo Results Mapping: Should allow zooming into particular parts off the map. |
RS4 |
|
33 |
Individual results can be saved to the data basket as they can on the main results screen. |
RS3.4 |
|
34 |
Explore Refin e – The search sentence should allow the user to change the parameters to alter the results. This should include the ability to; · Replace terms with different ones, for example replacing Rembrandt with Flinck · Add additional search criteria, for example add Flinck so that the search is for Rembrandt OR Flinck or Rembrandt AND Flinck. · Remove criteria |
RS3.4 |
Please merge with 3,4 |
35 |
Use a result as a basis for exploring. For example a result could be used as a root for finding other items based on relationships with different metadata fields. For example TODO: The results may have found various objects some of which were created by a particular person |
RS3.5? |
(Called "Related Results" in the intro). I think this is very similar to "Refine Search" described above, but uses a single object instead of the selected facets. Should we use all facet values of the object, or only the lowest-level, or what? |
36 |
Results are paginated and the user must use a paging system. (Presented in rows with paging for large result sets.) In the Time Line and Map views all results are plotted but the user can change the scales or specify a limited result. (Top 50 / 100 etc) |
RS 3.1, RS 4 |
RS3 uses Exhibit2 for result-set presentation and supports the first 1k results. RS4 can leverage Exhibit3 (yet another server) to support 100k results. Various views are supported (rows, thumbnails= lighbox). |
5 Advanced Explore and Advanced Search
To be specified in another document
Appendix 1 - Example Queries
Catalogue type questions
Dominic, please fix the numbering and references below, then I'll show how these would be implemented in FR/term searches
1. China vessels made of bronze
2. Vessels made of bronze from china
3. Bronze vessels from china
1. Same as 4. , but excluding objects which are prints (2 objects)
2. ceramic objects made between 200 BC and 100 AD - (13517 objects)
3. Topographic representations of Greece (894 objects)
4. Either topographic representations of Greece, or objects where Greece is mentioned in an inscription
5. 4.9 As for 8., but restricted to Greece specifically (i.e. excluding any narrower terms of Greece this time) (19 objects)
Conservation type questions
NONE of these are covered by FRs since Conservation is not Fundamental (in CIDOC's understanding ;-). Same as the "Exhibited at/on" example: we'd need Event FRs and FR composition
1. Which objects have both been conserved and scientifically examined?
2. Which bronze or brass objects have been conserved in the last year?
3. Which brands of cleaner have been used to treat marble?
4. What is the chronological sequence of treatments applied to this object?
5. What medieval brooches have been examined scientifically since 1980?
Appendix 2 - Authority control in Merlin (Peter Main 19/08/2009)
The Merlin catalogue record
We have good knowledge of the BM thesauri, since SSL implemented them and we checked them out while working on th offer. [ BM Thesauri ] describes 11 thesauri having 62k terms (45 in Places).
I think BM should strongly consider an effort to match BM Thesauri to more widely used ones. E.g. see [ Thesauri Tools#Cultural Heritage LOD ] concerning places:
- Getty TGN has 89k, RKD has 22k, Rijksmuseum has 11k
- GeoNames has 8M (but these are modern)
- then you have efforts like Lexicon of Greek Names, Google Places, Pleiades…
contains many fields where data entry is controlled by a thesaurus or by a flat authority file. The thesauri are also used to control searching using the hierarchies (e.g. searching for “vessel” will find records where only “cup” is mentioned). In some cases, more than one field is controlled by the same authority (e.g. Place of manufacture, Findspot and Associated place are all controlled by one Place thesaurus). This document describes what we have in place at present. This does not mean that we are not looking for improvements in the longer term. For example, the limitation of two broader terms in the Place thesaurus we find irksome, and we would prefer it to be polyhierarchical. We would also wish to make use of latitude/longitude information which we do not currently have
- GeoNames has that.
- Location uncertainty of ancient places is non-trivial (e.g. we could know from a historic source that place Z is "between X and Y", but not exactly where)
- See "Integration of Coordinate Information in CIDOC CRM": Gerald Hiebel, Øyvind Eide and Mark Fichtner, CRM SIG mlist 11/7/2011
.
Standard Thesauri
The thesauri in use are as follows:
· Object type (e.g. pin, cup)
· Material (e.g. paper, stone)
· Technique of manufacture (e.g. carved, incised)
· Material Culture/Period (e.g. 13 th dynasty, Late Minoan)
· Ware (specialised thesaurus for pottery, e.g. Black Glaze Ware, Samian)
· School (used for artworks, e.g. Italian, Aesthetic Movement)
· Escapement type (specialist thesaurus for clocks and watches)
· Subject (e.g. animal, acupuncture)
· Ethnic Name (e.g. Aztec, Yoruba)
The thesaurus structure is standard, but does not use all fields available in BS standards. We use:
· term
· term discriminator
· broader term(s)
· related term(s)
· use-for terms
· display term
· scope note
· whether the term has been authorised
The thesauri are polyhierarchical (i.e. they allow multiple broader terms)
Place thesaurus
This thesaurus is for geographical places. Its structure is the same as the standard thesauri except that:
only two broader terms (up to one each of modern and archaic types) are allowed
there are two additional fields: Place name type (i.e. modern or archaic) and Place Type (i.e. one of a series of codes distinguishing continents, countries, villages and the like)
Flat authorities
As well as a large number of simple drop-down lists, there are two important flat authorities:
Biographical: Used to record information about individuals and institutions. This includes the following fields:
· Name(s), and for each one title, name type, date range name was in use
· Display name
· Life date(s)
· Private address
· Public address
· Gender
· Nationality
· Profession
· School (for artists)
· Biography
· Bibliography
· Copyright details
· Whether the term has been authorised
There is no hierarchy associated with biographical records.
Bibliographical
To record information about publications. The does not conform to any of the library standards, and is quite simplistic in comparison.
It contains the following fields:
· Citation
· Title
· Author/Editor(s)
· Collective title
· Series title
· Place of Publication
· Date of Publication
· Journal
· Publisher
Appendix 3 – Search Sentence Process
You describe a combination of all 3 searches (Keyword, Term and FR).
- But is there benefit in combining simple (Keyword+ Term) with complex (FR)?
- Will the simple always come first?
- what's the meaning of e.g.
The Hague [Place] AND fromPlace The Hague
(Term AND FR)?
- if you want to combine all 3, we need a more detailed description of the interaction
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Appendix 4 – Timeline and Geographical mapping
DO: In the business document these components are apps in their own right (and I assume should be part of a separate spec ant some point) but timeline and geographical representations are built into the search result screen. To what extent should these be described as part of the search specification. Are these, in this specification just different ways of showing results without much user interaction but would these also form part of the more developed timeline and geo tools?
VA: I think we should start a separate spec for Result Set Handling.
· Timeline: we do it based on time of creation, which matches the FR "Thing From Event à has_time-span à at_some_time_within". But more needs to be done to accommodate dates specified to different level of precision, see date vs gYearMonth vs gYear
· Geographic: cannot do it before we have geomapped Place thesauri (see comments in App3)