Skip to end of metadata
Go to start of metadata

Algorithm to fetch the complete data about a Museum Object
Needs to be refactored.

Introduction

This is one of the main tasks of the Entity API (the other being search, and return of simple search results to Exhibit).
I think a MO cannot be (much) simplified from its CRM graph. We could consider its sub-objects as "entities", but I think it's better to think of the whole MO as an "entity". If/when we start searching/manipulating other things (eg Artists, Auctions, Collections...) we'll beget other entities.

Algorithm Considerations

  • walk down from the root MO, do a tree walk, put URIs in a dictionary so you don't repeat them, stop at certain cut-off points, finally collect and generate JSON
  • The obvious (only?) way is to use DESCRIBE (get all <S,P,O> for a given S), but the number of queries will be large: same as number of nodes in the subgraph
    • We use a Sesame API for local access, which speeds up 2x compared to remote access
    • Is there a smarter way than DESCRIBE?
  • needs a set of CRM relations to follow: Business Properties
  • breadth or depth-first walking? Does it matter?

CRM redundancy

CRM has various redundancies

  • inferred relations through sub-property inference
  • multi-typed nodes, eg
    • E12_Production of the physical painting (Part1) and E65_Creation of the conceptual Image coincide.
      (That holds about an original painting, it won't be the case for a reproduction):
    • A collection is both E39 actor (legal body) and E78 collection of physical things (Collections#Collection's Dual Nature).
      This leads that the events "object entry in/exit from collection" are heavily multi-typed:
  • parallel relations. These express different semantics between the same nodes.
    Eg continuing the previous example:
  • longcut (indirect path) vs shortcut (direct relation). Similar to "parallel relations" but one is a path, not simple relation.
    Eg (see Material and Medium-Technique):

These may easily get us (or the user) confused.

  • the worst confusion will be to "leak" into fetching unnecessary data
  • as bad will be for the UI to "leak" into repeating the same target node twice
  • how will Jana determine which inferred relations to skip?
  • eg the best would be to enumerate all parallel relations on one line, but how to implement this?
      Entry in collection
        added to collection; received custody; received title: Mauritshuis
        Date: 1816
    

Statement Redundancy

RDF statements are stored in OWLIM as quadruples <?s ?p ?o ?g>.
If the same statement is stored in several graphs (one of them may be empty), it will be returned several times by the Sesame API.

  • RKD data doesn't use graphs
  • BM data uses per-object graphs so the data can be updated easily per-object in the future (delete the old graph, load the new chunk).
    <http://collection.britishmuseum.org/id/object/YCA79332/graph> {
       # all data of <http://collection.britishmuseum.org/id/object/YCA79332>
    }
    
  • in both cases there are plenty of inferred statements that go into context onto:implicit.

The algorithm takes care of this redundancy by TODO Mitac

Remove Redundant Triples

RS-265
The API needs to remove two kinds of redundancies.
These removals are specified below using SPARQL notation, but they are implemented in-memory using Java code.

Remove Inferred Superproperty

Remove inferred superproperties of explicitly stated subproperty.

  • Justification: CompleteMO works in Inferred mode, so it obtains a bunch of superproperties, but the subproperty is more specific.
  • Examples:
    • we want to print P48_has_preferred_identifier before the other various identifiers (P1_is_identified_by)
    • bmo:PX_physical_description is the most useful amongst various other notes (P3_has_note), so we want to print it first and maybe with label "Description"
  • Definition: remove <?s ?p ?o> whenever this holds:
    ?s ?p ?o.
    ?s ?p2 ?o.
    ?p2 rdfs:subPropertyOf ?p.
    
  • Implementation (Mitac formulated the basic logic):
    • cache the rdfs:subPropertyOf hierarchy
    • TODO Mitac: describe the rest

Remove Shortcut Property of Association

Remove direct (shortcut) property if there is longcut path through EX_Association.

  • Justification: in some cases BM data carries extra "association codes" that qualify relations (see BM Association Mapping): then EX_Association carries more info (eg type=<bequathed_by>) than the direct property.
    • In other cases only a direct property is present.
    • RForms cannot distinguish between the two cases
  • Example (from BM Association Mapping#Code In Reified Association):
  • Definition: remove <?s ?p ?o> whenever this holds:
    ?s ?p ?o.
    ?s P140i_was_attributed_by ?a.
    ?a bmo:PX_property ?p.
    ?a P141_assigned ?o
    
  • Implementation:
    • TODO Mitac: that's a rather complex case, how to do it?
    • do Remove Inferred Superproperty before this removal: Associations are stated at the most specific level, and you don't want to kill the subproperty prematurely

Configurable Expansion (obsolete)

Vlado: this is awfully complicated and NOT needed.

The EntityAPI should offer "full" loading of Entities, not only the attributes and relations of the Entity, but also the other related Entities, their attributes, relations and so on.

This expansion has to be done only for some of the relations. The current working assumption is that these could be specified by their names (URIs). There are two methods called shouldExpandXX():

  1. based on data about the property (e.g. parts are always expanded; so the property crm:P46_is_composed_of always indicates expansion).
    TODO: so by "data" you mean the property name.
  2. allows for custom expansion rules (expand property crm:P102_has_title when the Entity has E55_Type "Painting").
    TODO: think up a better example since this is contrived (see below)
  3. base the expansion of parsing of the JSON templates from RForms.
    Vlado: this is an overkill, and who can guarantee these templates will address all and only what is needed?

In the initial implementation there will be only rules of type 1.

Cut-off example

We want to cut off at each thesaurus entry.

  1. We could cut off its relations by NOT navigating thesaurus-service relations (eg skos:broader, skos:inScheme and equivalent CRM relations) nor inverses (eg P14i_performed)
    • But when Persons get extra data (eg E67_Birth), more relations will be added (eg P98i_was_born), and some may get confused through sub-properties (eg P98i and P108i_was_produced_by are siblings i.e. sub-properties of P92i_was_brought_into_existence_by; and we do use P108i)
  2. We could cut the node by "rdf:type is skos:Concept" (or E55_Type or E55_Person or E55_Place)
    • But we still need to fetch its rdf:label(s)!

Named Graphs instead of Graph Walk

Lesson Learned for RS4
Overall, I now think the Graph Walk that we do to figure out the extent of an object's graph is wrong.
  • It's too brittle: data variations easily break it, adding data breaks it...
    See one of the most insidiuous leaks: RS-1375
  • It's expensive both in terms of queries and processing
  • It's a complicated algorithm.

The most logical approach is to use the named graph for this (one graph = one object).

  • Dominic said the same at a call 20120125 re the performance of the Graph Walk algorithm
  • Josh emits the statements for <object> in named graph <object>/graph.
    It's better to use just <object> for the graph URI, since one is not supposed to manipulate URIs (they are supposed to be opaque)
  • But things are not so simple, because only the explicit statements are in the per-object graph
    The implicit (inferred) statements are in the empty (default) graph. They can also be fetched from the special ("magic") onto:implicit graph, which includes no explicit statements.
    • Dominic asked how important inference is for the CompleteMO (i.e. RForm detailed object view). It's used for:
      1. P140i (see Remove Shortcut Property of Association above), i.e. to get to all association codes.
      2. Fetching some subprops (eg BM ext props) as the base CRM prop, eg:
        bmo:PX_curatorial_comment as P3_has_note, bmo:PX_likelihood as P2_has_type, P131_is_identified_by as P1_is_identified_by
    • Vlado talked to the OWLIM team re the feasibility of an enhancement that would put inferred statements in a specific graph (predefined, or dynamically computed from the premises). Damyan explained that this is infeasible, because it can lead to the same ontological statement being asserted in each object's named graph, and for a variety of other technical issues.

Vlado: here's an idea for a 2-step algorithm:

  1. Get all statements <s,p,o,g> in the object's graph g
  2. Extract all distinct Nodes from these statements
    Note: we also need literals, not just URIs!
  3. Get all inferred statements <?s, ?p, ?o, onto:implicit> where
    ?s,?o are in the set extracted above
    ?p is a business property

TODO: precise the above:

  • should both ?s,?o be in Nodes, or it's enough for one of them to be?
  • assess how long are the lists of Nodes (max 300 per object?) and Business Properties (about 140), and what impact that has on the query

EntityAPI

Input: URI of a Museum Object (MO)
Output: RDF/JSON of its complete data
TODO Mitac: describe the above as a real API

The class Entity was extended with the following data structures (and corresponding methods):

  • attributes: Map<URI, Literal[]>
  • relations: Map<URI, Entity[]>

RDF/JSON generation

Use SPARQLResultsJSONWriter

  • there is a SPARQLResultsJSONWriter class from Sesame distribution
  • its interface is to override the method handleSolution(BindingSet)
  • currently it does not implement the check for multiple values for the same property and thus produces incorrect JSON
  • fortunately, this is the class used in Forest as well
  • after Kosyo fixes it in Forest (LLD-599@jira: bad JSON for duplicate subject & property), we can reuse the code
  • Jana has currently written a version that uses Forest REST

Considerations:

  • This generates the JSON
  • depends on the above fix
  • may become slower with Forest because we involve an extra component, and there are numerous queries
  • Vlado: won't we need to merge the partial JSONs from DESCRIBE of adjacent nodes? Isn't this complicated?

Generate JSON directly

  • when an Entity is loaded fully, the JSON could be recursively generated by navigating it
  • maybe this can be useful: JSON simple API (http://code.google.com/p/json-simple/): serialize web service results to JSON format
Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.