View Source

{excerpt}Given RKD field name and text contents, return matching thesaurus URI{excerpt}
{toc}
{attachments:sortBy=name}

Vlado: I think remove attachments, just point to SVN?
Matthew: Do we have these two files in SVN, I can see the logic in putting settings.xml there, but the jar file, do we have a place for higher level users of the API to pick it up?

h4. The Requirement

The RKD data contains thesaurus controlled fields (about 37 fields) and we have several data files containing the thesaurus data, both data files and the thesaurus data files have been encoded as RDF turtle and ingested into a data set. The data set should be able to resolve the fields to the respective thesauri items. We know which fields are controlled and by which thesaurus and this information is encapsulated in [#thes.properties]. So given a field name and its contents we need to be able to get the thesaurus URI that matches. Some of the fields have multiple language entries and some just one, the functions described below cater for each case. These functions are part of the larger entity-api.

h4. Vladimir's original notes

Vlado: merge into section below: delete whatever is covered, or is not current anymore (eg for Fake version: document only if there's still a fake lookup for any thesaurus, else delete the whole saga)

Method Variants:
h5 LookupInThesaurusByLabel (String field, String label)
Used for single-label fields, eg
{code}<soort_collectie_verblijfplaats>particuliere collectie</soort_collectie_verblijfplaats>{code}

- Input: XML field name; label (field content).
- Output: prefixed thesaurus URI, e.g. rkd-collection:private-collection
- Driven by a properties file that knows which thesaurus to use for that field, and what prefix to return.
- Used during migration: throws exception unless it finds exactly 1 match.

h5. LookupInthesaurusByLabels (String field, StringWithLang\[\] labels)

Used for multi-label fields (nested <value>), for example
{code}<object.support>
<value lang="en-US" invariant="true">panel (oak)</value>
<value lang="nl-NL" invariant="false">paneel (eikenhout)</value>
</object.support>
{code} or
{code}<doc.size.unit>
<value lang="neutral">CM</value>
<value lang="0">cm</value>
<value lang="1">cm</value>
</doc.size.unit>
{code}
- Input: XML field name; collection of string labels with language. Languages are chopped at "-" (e.g. "en").
- Output: prefixed thesaurus URI, e.g. rkd-support:panel--oak or unit:Centimeter
- Driven by a properties file that knows which thesaurus to use for that field, which language to lookup (uses only 1 from the collection), and what prefix to return.
- Used during migration: throws exception unless it finds exactly 1 match.

h5. InteractiveLookupInThesaurusByLabel (String field, String label)

Used by the user through semantic search.
- Output: an array of matching URIs (in RS3.1, looks for exact match only, so returns 0 or 1 URIs)
Otherwise, same as the non-interactive version.

"Fake" version:
* Instead of lookup in thesaurus, return a generated URI (same as is generated by Python during thesaurus migration).
* This way you won't have to wait for RKD to send the thesauri

h4. The Functions

Initially this is faked but as set "fakeThesSet" is reduced and the data is loaded it should do proper lookups

h5. getThesaurusURIforLabel
{code}public org.openrdf.model.URI getThesaurusURIforLabel(java.lang.String field,
java.lang.String label)
throws java.lang.Exception
{code}
Given a field name and label text try and find the matching thesaurus uri.
* Parameters:
** field - Field in the main data to use
** label - the text itself
* Returns:
** URI or null in the case of regular failure (this includes finding nothing or finding more than one item)
* Throws:
** java.lang.Exception - in the case of an error

h5. getThesaurusURIforLabels
{code}public org.openrdf.model.URI getThesaurusURIforLabels(java.lang.String field,
java.lang.String[] labels,
java.lang.String[] langs)
throws java.lang.Exception
{code}
Given a field name, labels and language text try and find the matching thesaurus uri
* Parameters:
** field - the field name in the main data
** labels - an array of labels in different languages
** langs - a matching array of languages eg "nl" "en" etc
* Returns:
** URI or null in the case of regular failure (this includes finding nothing or finding more than one item)
* Throws:
** java.lang.Exception - in the case of error
* The language list is needed to allow the correct label to be looked up in the thesuarus.
* There is a question of whether there should be parallel lists or a list of a data type with to members.

h5. interactiveLookupInThesaurusByLabel
{code}
public java.util.List<org.openrdf.model.URI> interactiveLookupInThesaurusByLabel(java.lang.String field,
java.lang.String label)
throws java.lang.Exception
{code}
Given a field name and label text try and find the matching thesaurus uri
* Parameters:
** field - Field in the main data to use
** label - the text itself
* Returns:
** List of URI
TODO: should return a list of tuples: uri and the label/note that matches the text search
Luckily we don't need more info (e.g. even for places we don't need the Broader term, since the place note mentions the super-place).
* Throws:
** java.lang.Exception - in the case of an error
** If it finds nothing, returns an empty list

h4. thes.properties
[https://svn.ontotext.com/svn/researchspace/trunk/entity-api/src/resources/thes.properties]
The lookup functions use data from a properties file to decide on the form of the query and which thesaurus to use for which field. The property file is a standard java properties file in the sense of being a list of context=value statements.

{noformat}
# field.name=thesaurus,lang,label|desc # thesaurus-file.ttl
file.spec.front_back=rkd-area_captured,en,label # thesauri.ttl
{noformat}

The context in this case is the field name that is passed to the lookup functions the value is a comma separated list of 3 items:
# The thesaurus name to use.
# The language in the form "nl", "en", or blank if not used in thesaurus
# "desc" or "label", which decides on the form of the SPARQL query to make.
#- label: ?id rdfs:label ?label
#- desc: ?id crm:P1_is_identified_by ?name. ?name crm:P3_has_note ?label

As in standard java properties files a line starting with a # is taken as a comment. In this case comments can be also added inline anything after a # character is ignored on a line.

h4. The Sparql Query used to lookup

Both functions generate the one two forms of query as the thesaurus data is held in two different general forms. The thes.properties file allows the code to distinguish between the forms and do the right thing.
For instance given the parameters
{code}getThesaurusURIforLabel("vorm","staande rechthoek");{code}
- "vorm" is looked up in the thes.properties configuration
- the result "rkd-shape,nl,label" is found, which means use the rkd-shape type and use the nl language, and use the label form of query.
- So the query looks like this
{code}
PREFIX skos:<http://www.w3.org/2004/02/skos/core#>
PREFIX rkd-shape: <http://rkd.nl/thesaurus/shape/>
select ?id where{
?id skos:inScheme rkd-shape: .
?id rdfs:label "staande rechthoek"@nl .
}
{code}

While given the parameters
{code}getThesaurusURIforLabel("plaats","Eerste Egelantiersdwarsstraat (Amsterdam)");{code}

- "plaats" is looked up in the thes.properties configuration the result "plaats=rkd-plaats,nl,note" is found, which means use the rkd-plaats type and use the nl language, and use the note form of query.
- So the query looks like this
{code}
PREFIX skos:<http://www.w3.org/2004/02/skos/core#>
PREFIX rkd-plaats: <http://rkd.nl/thesaurus/plaats/>
select ?id where {
?id crm:P1_is_identified_by ?id2 .
?id2 crm:P3_has_note "Eerste Egelantiersdwarsstraat (Amsterdam)"@nl .
?id skos:inScheme rkd-plaats: .
}
{code}
and returns the URI [http://rkd.nl/thesaurus/plaats/eerste-egelantiersdwarsstraat-amsterdam]

Implementation Note: The choice of query types is controlled by a map of string templates that contain the tokens to replace for each sort of query.

h4. Refactoring

* The above functions and their private support functions have been moved into a separate class "Thesari" which is now a super class of "EntityAPIImpl"

h4. Questions

h5. Exceptions vs null returns

* After some discussion I decided on only using exceptions in true error cases and having the caller check for a null return in the case of search failure. Thus if an exception occurs something in the configuration/network/logic truly has failed while it is quite usual for a search to turn up no results.

h5. The form of the returned URI

* At present the URIs are returned in an http form ([http://rkd.nl/thesaurus/shape/vertical-rectangle]) returned by walking through the result of an evaluateTupleQuery(query) call, I have the code in the class to change that to a form more like that described in Vladimir's notes (rkd-shape:vertical-rectangle).

h4. Running the code

* Get the entity-api jar from [https://confluence.ontotext.com/download/attachments/9509538/entity-api-0.1-SNAPSHOT.jar] or build it as shown below.
* From the command line under linux assuming you have done a svn checkout of trunk and used mvn (put settings.xml in your \~/.m2 folder and run mvn dependency:copy-dependencies) to resolve dependencies.
* {code}
CLASSPATH="";for i in `find ~/.m2 -name '*.jar'`; do CLASSPATH=$CLASSPATH":"$i; done
java -cp $CLASSPATH:target/entity-api-0.1-SNAPSHOT.jar com.ontotext.rs.model.impl.Thesauri{code}

h4. Building the code

* Pre-requisites
** Java SDK
** Maven
*** Put [https://confluence.ontotext.com/download/attachments/9509538/settings.xml] in your \~/.m2/ folder
** Subversion
* {code}
svn co https://svn.ontotext.com/svn/researchspace/trunk
cd trunk/entity-api
mvn dependency:copy-dependencies
mvn -DargLine="-Dowlim-license=resources/OWLIM_SE.license" package{code}

h4. Using the code

* The Thesauri class constructor should never be called you should use the BackendFactory to instantiate an instance. The entity-api project uses the "Factory Design Pattern" ([http://en.wikipedia.org/wiki/Factory_design_pattern]). The BackendFactory does the various initializations and connects to the local or remote repository etc.
* I have altered Thesauri.java so that the constructor is "protected" and thus not generally visible.
* {code}
BackendFactory bf = BackendFactory.getInstance(
RSConstants.REMOTE_REPOSITORY, RSConstants.REMOTE_USER,
RSConstants.REMOTE_PASS);
EntityAPI ea = bf.getEntityAPI();
Thesauri thesauri=(Thesauri)ea;{code}

h4. Main Function

* The main function of this class connects to the remote server and runs a couple of queries to demonstrate the use of the functions.

{code}
public static void main(String[] args) throws Exception {
BackendFactory bf = BackendFactory.getInstance(
RSConstants.REMOTE_REPOSITORY, RSConstants.REMOTE_USER,
RSConstants.REMOTE_PASS);

EntityAPI ea = bf.getEntityAPI();

URI ux=((Thesauri)ea).getThesaurusURIforLabel("plaats","Eerste Egelantiersdwarsstraat (Amsterdam)");
if(ux!=null) System.out.println(ux.toString());
else System.out.println("plaats ->Eerste Egelantiersdwarsstraat (Amsterdam)= no result");
ux=((Thesauri)ea).getThesaurusURIforLabel("vorm","staande rechthoek");
if(ux!=null) System.out.println(ux.toString());
else System.out.println("vorm -> staande rechthoek = no result");
ux=((Thesauri)ea).getThesaurusURIforLabels("object.support",new String[]{"leather", "leer"}, new String[]{"en", "nl"});
if(ux!=null) System.out.println(ux.toString());
else System.out.println("object.support -> leather, leer : en, nl = no result");
ux=((Thesauri)ea).getThesaurusURIforLabel("soort_collectie_verblijfplaats","particuliere_collectie");
if(ux!=null) System.out.println(ux.toString());
else System.out.println("soort_collectie_verblijfplaats -> particuliere_collectie =no result");
ux=((Thesauri)ea).getThesaurusURIforLabel("soort_collectie_verblijfplaats","kunsthandel of particuliere collectie");
if(ux!=null) System.out.println(ux.toString());
else System.out.println("soort_collectie_verblijfplaats -> kunsthandel of particuliere collectie =no result");
List<URI> us=((Thesauri)ea).interactiveLookupInThesaurusByLabel("vorm","staande rechthoek");
if(us.size()>0) System.out.println(us.get(0).toString());
else System.out.println("vorm -> staande rechthoek = no result");
}{code}
which should give this result:
{code}
http://rkd.nl/thesaurus/plaats/eerste-egelantiersdwarsstraat-amsterdam
http://rkd.nl/thesaurus/shape/vertical-rectangle
http://rkd.nl/thesaurus/support/leather
soort_collectie_verblijfplaats -> particuliere_collectie =no result
http://rkd.nl/thesaurus/type_where/art-dealer-or-private-collection
http://rkd.nl/thesaurus/shape/vertical-rectangle
{code}