Skip to end of metadata
Go to start of metadata

Given RKD field name and text contents, return matching thesaurus URI

Name Size Creator Creation Date Comment  
Java Source entity-api-0.1-SNAPSHOT.jar 66 kB Matthew Carey Dec 16, 2011 14:26 Snapshot of Entity API build  
XML File settings.xml 2 kB Vladimir Alexiev Jul 09, 2013 17:52 file to put in ~/.m2/ to allow mavan to resolve dependencies  

Vlado: I think remove attachments, just point to SVN?
Matthew: Do we have these two files in SVN, I can see the logic in putting settings.xml there, but the jar file, do we have a place for higher level users of the API to pick it up?

The Requirement

The RKD data contains thesaurus controlled fields (about 37 fields) and we have several data files containing the thesaurus data, both data files and the thesaurus data files have been encoded as RDF turtle and ingested into a data set. The data set should be able to resolve the fields to the respective thesauri items. We know which fields are controlled and by which thesaurus and this information is encapsulated in thes.properties. So given a field name and its contents we need to be able to get the thesaurus URI that matches. Some of the fields have multiple language entries and some just one, the functions described below cater for each case. These functions are part of the larger entity-api.

Vladimir's original notes

Vlado: merge into section below: delete whatever is covered, or is not current anymore (eg for Fake version: document only if there's still a fake lookup for any thesaurus, else delete the whole saga)

Method Variants:
h5 LookupInThesaurusByLabel (String field, String label)
Used for single-label fields, eg

  • Input: XML field name; label (field content).
  • Output: prefixed thesaurus URI, e.g. rkd-collection:private-collection
  • Driven by a properties file that knows which thesaurus to use for that field, and what prefix to return.
  • Used during migration: throws exception unless it finds exactly 1 match.
LookupInthesaurusByLabels (String field, StringWithLang[] labels)

Used for multi-label fields (nested <value>), for example

or

  • Input: XML field name; collection of string labels with language. Languages are chopped at "-" (e.g. "en").
  • Output: prefixed thesaurus URI, e.g. rkd-support:panel--oak or unit:Centimeter
  • Driven by a properties file that knows which thesaurus to use for that field, which language to lookup (uses only 1 from the collection), and what prefix to return.
  • Used during migration: throws exception unless it finds exactly 1 match.
InteractiveLookupInThesaurusByLabel (String field, String label)

Used by the user through semantic search.

  • Output: an array of matching URIs (in RS3.1, looks for exact match only, so returns 0 or 1 URIs)
    Otherwise, same as the non-interactive version.

"Fake" version:

  • Instead of lookup in thesaurus, return a generated URI (same as is generated by Python during thesaurus migration).
  • This way you won't have to wait for RKD to send the thesauri

The Functions

Initially this is faked but as set "fakeThesSet" is reduced and the data is loaded it should do proper lookups

getThesaurusURIforLabel

Given a field name and label text try and find the matching thesaurus uri.

  • Parameters:
    • field - Field in the main data to use
    • label - the text itself
  • Returns:
    • URI or null in the case of regular failure (this includes finding nothing or finding more than one item)
  • Throws:
    • java.lang.Exception - in the case of an error
getThesaurusURIforLabels

Given a field name, labels and language text try and find the matching thesaurus uri

  • Parameters:
    • field - the field name in the main data
    • labels - an array of labels in different languages
    • langs - a matching array of languages eg "nl" "en" etc
  • Returns:
    • URI or null in the case of regular failure (this includes finding nothing or finding more than one item)
  • Throws:
    • java.lang.Exception - in the case of error
  • The language list is needed to allow the correct label to be looked up in the thesuarus.
  • There is a question of whether there should be parallel lists or a list of a data type with to members.
interactiveLookupInThesaurusByLabel

Given a field name and label text try and find the matching thesaurus uri

  • Parameters:
    • field - Field in the main data to use
    • label - the text itself
  • Returns:
    • List of URI
      TODO: should return a list of tuples: uri and the label/note that matches the text search
      Luckily we don't need more info (e.g. even for places we don't need the Broader term, since the place note mentions the super-place).
  • Throws:
    • java.lang.Exception - in the case of an error
    • If it finds nothing, returns an empty list

thes.properties

https://svn.ontotext.com/svn/researchspace/trunk/entity-api/src/resources/thes.properties
The lookup functions use data from a properties file to decide on the form of the query and which thesaurus to use for which field. The property file is a standard java properties file in the sense of being a list of context=value statements.

# field.name=thesaurus,lang,label|desc            # thesaurus-file.ttl
file.spec.front_back=rkd-area_captured,en,label   # thesauri.ttl

The context in this case is the field name that is passed to the lookup functions the value is a comma separated list of 3 items:

  1. The thesaurus name to use.
  2. The language in the form "nl", "en", or blank if not used in thesaurus
  3. "desc" or "label", which decides on the form of the SPARQL query to make.
    • label: ?id rdfs:label ?label
    • desc: ?id crm:P1_is_identified_by ?name. ?name crm:P3_has_note ?label

As in standard java properties files a line starting with a # is taken as a comment. In this case comments can be also added inline anything after a # character is ignored on a line.

The Sparql Query used to lookup

Both functions generate the one two forms of query as the thesaurus data is held in two different general forms. The thes.properties file allows the code to distinguish between the forms and do the right thing.
For instance given the parameters

  • "vorm" is looked up in the thes.properties configuration
  • the result "rkd-shape,nl,label" is found, which means use the rkd-shape type and use the nl language, and use the label form of query.
  • So the query looks like this

While given the parameters

  • "plaats" is looked up in the thes.properties configuration the result "plaats=rkd-plaats,nl,note" is found, which means use the rkd-plaats type and use the nl language, and use the note form of query.
  • So the query looks like this

    and returns the URI http://rkd.nl/thesaurus/plaats/eerste-egelantiersdwarsstraat-amsterdam

Implementation Note: The choice of query types is controlled by a map of string templates that contain the tokens to replace for each sort of query.

Refactoring

  • The above functions and their private support functions have been moved into a separate class "Thesari" which is now a super class of "EntityAPIImpl"

Questions

Exceptions vs null returns
  • After some discussion I decided on only using exceptions in true error cases and having the caller check for a null return in the case of search failure. Thus if an exception occurs something in the configuration/network/logic truly has failed while it is quite usual for a search to turn up no results.
The form of the returned URI
  • At present the URIs are returned in an http form (http://rkd.nl/thesaurus/shape/vertical-rectangle) returned by walking through the result of an evaluateTupleQuery(query) call, I have the code in the class to change that to a form more like that described in Vladimir's notes (rkd-shape:vertical-rectangle).

Running the code

  • Get the entity-api jar from https://confluence.ontotext.com/download/attachments/9509538/entity-api-0.1-SNAPSHOT.jar or build it as shown below.
  • From the command line under linux assuming you have done a svn checkout of trunk and used mvn (put settings.xml in your ~/.m2 folder and run mvn dependency:copy-dependencies) to resolve dependencies.

Building the code

Using the code

  • The Thesauri class constructor should never be called you should use the BackendFactory to instantiate an instance. The entity-api project uses the "Factory Design Pattern" (http://en.wikipedia.org/wiki/Factory_design_pattern). The BackendFactory does the various initializations and connects to the local or remote repository etc.
  • I have altered Thesauri.java so that the constructor is "protected" and thus not generally visible.

Main Function

  • The main function of this class connects to the remote server and runs a couple of queries to demonstrate the use of the functions.

which should give this result:

Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.