View Source

{excerpt}Algorithm to fetch the complete data about a Museum Object{excerpt}
(!) Needs to be refactored.

{toc}

h1. Introduction
This is one of the main tasks of the [Entity API] (the other being search, and return of simple search results to Exhibit).
I think a MO cannot be (much) simplified from its CRM graph. We could consider its sub-objects as "entities", but I think it's better to think of the whole MO as an "entity". If/when we start searching/manipulating other things (eg Artists, Auctions, Collections...) we'll beget other entities.

h2. Algorithm Considerations
- walk down from the root MO, do a tree walk, put URIs in a dictionary so you don't repeat them, stop at certain cut-off points, finally collect and generate JSON
- The obvious (only?) way is to use DESCRIBE (get all <S,P,O> for a given S), but the number of queries will be large: same as number of nodes in the subgraph
-- We use a Sesame API for local access, which speeds up 2x compared to remote access
-- (?) Is there a smarter way than DESCRIBE?
- needs a set of CRM relations to follow: [Business Properties]
- breadth or depth-first walking? Does it matter?

h2. CRM redundancy
CRM has various redundancies
- inferred relations through sub-property inference
- multi-typed nodes, eg
-- E12_Production of the physical painting (Part1) and E65_Creation of the conceptual Image coincide.
(That holds about an original painting, it won't be the case for a reproduction):
{code}
<obj/2926/1> a crm:E22_Man-Made_Object;
crm:P2_has_type rkd-object:painting; crm:P108i_was_produced_by <obj/2926/1/production>.
<obj/2926/1/image> a crm:E38_Image;
crm:P94i_was_created_by <obj/2926/1/production>.
{code}
-- A collection is both E39 actor (legal body) and E78 collection of physical things ([Collections#Collection's Dual Nature]).
This leads that the events "object entry in/exit from collection" are heavily multi-typed:
{code}
rkd-collection:Mauritshuis a crm:E78_Collection, crm:E39_Actor.
<obj/2926/collection/5/entry> a crm:E79_Part_Addition, crm:E10_Transfer_of_Custody, crm:E8_Acquisition.
{code}
- parallel relations. These express different semantics between the same nodes.
Eg continuing the previous example:
{code}
<obj/2926/collection/5/entry>
crm:P110_augmented rkd-collection:Mauritshuis; # domain E79_Part_Addition
crm:P29_custody_received_by rkd-collection:Mauritshuis; # domain E10_Transfer_of_Custody
crm:P22_transferred_title_to rkd-collection:Mauritshuis; # domain E8_Acquisition
{code}
- longcut (indirect path) vs shortcut (direct relation). Similar to "parallel relations" but one is a path, not simple relation.
Eg (see [Material and Medium-Technique]):
{code}
<obj/2926/1> crm:P45_consists_of rkd-support:panel--oak_wood.
<obj/2926/1> P108i_was_produced_by <obj/2926/1/production>.
<obj/2926/1/production> crm:P126_employed rkd-support:panel--oak_wood.
{code}

These may easily get us (or the user) confused.
- the worst confusion will be to "leak" into fetching unnecessary data
- as bad will be for the UI to "leak" into repeating the same target node twice
- how will Jana determine which inferred relations to skip?
- eg the best would be to enumerate all parallel relations on one line, but how to implement this?
{noformat}
Entry in collection
added to collection; received custody; received title: Mauritshuis
Date: 1816
{noformat}

h2. Statement Redundancy
RDF statements are stored in OWLIM as quadruples <?s ?p ?o ?g>.
If the same statement is stored in several graphs (one of them may be empty), it will be returned several times by the Sesame API.
- RKD data doesn't use graphs
- BM data uses per-object graphs so the data can be updated easily per-object in the future (delete the old graph, load the new chunk).
{noformat}
<http://collection.britishmuseum.org/id/object/YCA79332/graph> {
# all data of <http://collection.britishmuseum.org/id/object/YCA79332>
}
{noformat}
- in both cases there are plenty of inferred statements that go into context onto:implicit.

The algorithm takes care of this redundancy by TODO Mitac

h2. Remove Redundant Triples
{jira:RS-265}
The API needs to remove two kinds of redundancies.
These removals are specified below using SPARQL notation, but they are implemented in-memory using Java code.

h3. Remove Inferred Superproperty
Remove inferred superproperties of explicitly stated subproperty.
- Justification: CompleteMO works in Inferred mode, so it obtains a bunch of superproperties, but the subproperty is more specific.
- Examples:
-- we want to print P48_has_preferred_identifier before the other various identifiers (P1_is_identified_by)
-- bmo:PX_physical_description is the most useful amongst various other notes (P3_has_note), so we want to print it first and maybe with label "Description"
- Definition: remove <?s ?p ?o> whenever this holds:
{noformat}
?s ?p ?o.
?s ?p2 ?o.
?p2 rdfs:subPropertyOf ?p.
{noformat}
- Implementation (Mitac formulated the basic logic):
-- cache the rdfs:subPropertyOf hierarchy
-- TODO Mitac: describe the rest

h3. Remove Shortcut Property of Association
Remove direct (shortcut) property if there is longcut path through EX_Association.
- Justification: in some cases BM data carries extra "association codes" that qualify relations (see [BM Association Mapping]): then EX_Association carries more info (eg type=<bequathed_by>) than the direct property.
-- In other cases only a direct property is present.
-- RForms cannot distinguish between the two cases
- Example (from [BM Association Mapping#Code In Reified Association]):
!BM Association-bequathed_by.png!
- Definition: remove <?s ?p ?o> whenever this holds:
{noformat}
?s ?p ?o.
?s P140i_was_attributed_by ?a.
?a bmo:PX_property ?p.
?a P141_assigned ?o
{noformat}
- Implementation:
-- TODO Mitac: that's a rather complex case, how to do it?
-- do [#Remove Inferred Superproperty] *before* this removal: Associations are stated at the most specific level, and you don't want to kill the subproperty prematurely

h2. Configurable Expansion {color:red}(obsolete){color}
Vlado: this is awfully complicated and NOT needed.

The EntityAPI should offer "full" loading of Entities, not only the attributes and relations of the Entity, but also the other related Entities, their attributes, relations and so on.

This expansion has to be done only for some of the relations. The current working assumption is that these could be specified by their names (URIs). There are two methods called shouldExpandXX():
# based on data about the property (e.g. parts are always expanded; so the property crm:P46_is_composed_of always indicates expansion).
TODO: so by "data" you mean the property name.
# allows for custom expansion rules (expand property crm:P102_has_title when the Entity has E55_Type "Painting").
TODO: think up a better example since this is contrived (see below)
# base the expansion of parsing of the JSON templates from RForms.
Vlado: this is an overkill, and who can guarantee these templates will address all and only what is needed?

In the initial implementation there will be only rules of type 1.

h2. Cut-off example
We want to cut off at each thesaurus entry.
# We could cut off its relations by NOT navigating thesaurus-service relations (eg skos:broader, skos:inScheme and equivalent CRM relations) nor inverses (eg P14i_performed)
#- But when Persons get extra data (eg E67_Birth), more relations will be added (eg P98i_was_born), and some may get confused through sub-properties (eg P98i and P108i_was_produced_by are siblings i.e. sub-properties of P92i_was_brought_into_existence_by; and we do use P108i)
# We could cut the node by "rdf:type is skos:Concept" (or E55_Type or E55_Person or E55_Place)
#- But we still need to fetch its rdf:label(s)!

h1. Named Graphs instead of Graph Walk

{info:title=Lesson Learned for RS4}
Overall, I now think the Graph Walk that we do to figure out the extent of an object's graph is wrong.
- It's too brittle: data variations easily break it, adding data breaks it...
See one of the most insidiuous leaks: {jira:RS-1375}
- It's expensive both in terms of queries and processing
- It's a complicated algorithm.

The most logical approach is to use the named graph for this (one graph = one object).
{info}

- Dominic said the same at a call 20120125 re the performance of the Graph Walk algorithm
- Josh emits the statements for <object> in named graph <object>/graph.
It's better to use just <object> for the graph URI, since one is not supposed to manipulate URIs (they are supposed to be opaque)
- But things are not so simple, because only the *explicit* statements are in the per-object graph
The *implicit* (inferred) statements are in the empty (default) graph. They can also be fetched from the special ("magic") onto:implicit graph, which includes no explicit statements.
-- Dominic asked how important inference is for the CompleteMO (i.e. RForm detailed object view). It's used for:
--# P140i (see [#Remove Shortcut Property of Association] above), i.e. to get to all association codes.
--# Fetching some subprops (eg BM ext props) as the base CRM prop, eg:
bmo:PX_curatorial_comment as P3_has_note, bmo:PX_likelihood as P2_has_type, P131_is_identified_by as P1_is_identified_by
-- Vlado talked to the OWLIM team re the feasibility of an enhancement that would put inferred statements in a specific graph (predefined, or dynamically computed from the premises). Damyan explained that this is infeasible, because it can lead to the same ontological statement being asserted in each object's named graph, and for a variety of other technical issues.

Vlado: here's an idea for a 2-step algorithm:
# Get all statements <s,p,o,g> in the object's graph g
# Extract all distinct Nodes from these statements
Note: we also need literals, not just URIs!
# Get all inferred statements <?s, ?p, ?o, onto:implicit> where
?s,?o are in the set extracted above
?p is a business property

TODO: precise the above:
- should *both* ?s,?o be in Nodes, or it's enough for *one* of them to be?
- assess how long are the lists of Nodes (max 300 per object?) and Business Properties (about 140), and what impact that has on the query

h1. EntityAPI
Input: URI of a Museum Object (MO)
Output: RDF/JSON of its complete data
TODO Mitac: describe the above as a real API

{code:borderStyle=solid}

//loads the Entity, its attributes and its relations, as well as some of the related Entities recursively

public Entity loadEntity(URI uri);


/**
* Whether to expand the property (i.e. recursively load the dependent Entity)
* based solely on the property
* @param rel
* @return
*/
private static boolean shouldExpandProperty(URI rel);

/**
* Allows for specific expansion rules, that depend on the Entity
* @param e
* @param rel
* @return
*/
private boolean shouldExpandPropertyForEntity(Entity e, URI rel);

{code}

The class Entity was extended with the following data structures (and corresponding methods):
- attributes: Map<URI, Literal[]>
- relations: Map<URI, Entity[]>

h1. RDF/JSON generation
h2. Use SPARQLResultsJSONWriter
- there is a SPARQLResultsJSONWriter class from Sesame distribution
- its interface is to override the method handleSolution(BindingSet)
- currently it does not implement the check for multiple values for the same property and thus produces incorrect JSON
- fortunately, this is the class used in Forest as well
- after Kosyo fixes it in Forest ([LLD-599@jira]: bad JSON for duplicate subject & property), we can reuse the code
- Jana has currently written a version that uses Forest REST

Considerations:
- (+) This generates the JSON
- (-) depends on the above fix
- (-) may become slower with Forest because we involve an extra component, and there are numerous queries
- (-) Vlado: won't we need to merge the partial JSONs from DESCRIBE of adjacent nodes? Isn't this complicated?

h2. Generate JSON directly
- when an Entity is loaded fully, the JSON could be recursively generated by navigating it
- maybe this can be useful: JSON simple API (http://code.google.com/p/json-simple/): serialize web service results to JSON format