Intro
RS uses a set of "Business Properties" (currently about 100) for two crucial tasks:
- Fetch Complete Museum Object
- FTS indexing (molecules), using a custom process
RS-977
Considerations
This is a subset of the more than 300 properties in the ontologies that we use:
- CIDOC CRM has 264 properties (125 inverse pairs, 14 literal properties); and 86 classes
- RSO has 35 properties; and 7 classes
- 25 properties; and 4 classes
- We also use a few properties from external ontologies: SKOS (thesauri), OAC (annotation), BIBO (bibliography), QUDT (units)
Considerations when making the list:
- For each property we must decide whether we want it, one of its superproperties, or both (see Complete Museum Object#Remove Inferred Superproperty)
- We should use superproperties whenever appropriate, to keep RForms simpler (more abstract)
- All appropriate CRM properties have owl:inverseOf and we use inverse inference, so we can use properties either as explicitly stated, or in the opposite direction.
- Leak Avoidance
Process, Tools, Files
- BM-properties.pl: Perl script that extracts all properties used in the BM mapping, from:
config.xml, bibliography-config.xml, biography-config.xml, dimensionunit-config.xml, flat-config.xml, image-config.xml, inline-thesauri-config.xml, thesaurus-config.xml - BM-properties.xls: Excel that I use to record decisions about each property.
- Filter by WANT to generate properties.txt
- Filter by WARN to see notices to particular people
- The last columns show how many occurrences in each of the config files
- properties.txt: list extracted from the excel
- versioned artifact that feeds the Fetch algorithm and the FTS configuration
- lives in https://svn.ontotext.com/svn/researchspace/trunk/entity-api/src/resources/properties.txt
TODO Mitac: adapt the code to use prefixed names, then commit the file from conf to svn
- LuceneIndexCreation.lucene: ASK queries to generate Lucene index
- lives in https://svn.ontotext.com/svn/researchspace/trunk/data/LuceneIndexCreation.lucene
Mitac: I see only an example file of that name. Please correct this description
Leak Avoidance
It is very important to avoid "leaks": property paths that go from one object to another (or many others).
- Such leaks embed the data (or fulltext) of other objects into the root object, and are very undesirable.
- Direction is important: eg we should follow P14_carried_out_by (to get the painting's author) but never its inverse P14i_performed (that would leak into all objects by the same author).
- Must "comb" (сресвам) all properties so they point from the object towards the periphery
- Leaks are caused by properties that can cause a loop (P14/P14i is an example of a trivial loop)
- Subproperty inference should also be taken into consdireration
- superproperties.txt: list of immediate superproperties for each business property
- TODO Vlado: figure out a query (using rdfs:subPropertyOf) to find loops automatically.
- We cannot remove all looping properties from properties.txt (see examples below), so a key strategy is to cut off traversal at object collections and thesaurus terms:
RS-1139
Examples of Looping Properties
- crm:P138i_has_representation: <obj> P138i_has_representation <image>.
crm:P138_represents: <obj> P65_shows_visual_item/P138_represents <person> or <place>- <image> leads to only 1 object, and we cut off at <person> or <place>
- P12i_was_present_at: <obj> P12i_was_present_at <event> (eg exhibition, research)
P11_had_participant: BM Association Mapping#Acquired Through (intermediary or contributor)- We cut off at BM (skos:Concept)
- crm:P46_is_composed_of: <obj> P46_is_composed_of <part> (bell, case, dial…)
crm:P46i_forms_part_of: <obj> P46i_forms_part_of Series/Exhibition/Collection- We cut off at Series/Collection (E78_Collection) or Exhibition (skos:Concept)
RS-1138
- We cut off at Series/Collection (E78_Collection) or Exhibition (skos:Concept)
Examples of Leaks
Examples of potential leaks, or leaks we had in the past:
Object Present at Another's Acquisition
obj1 – P12i_was_present_at -> obj1/acquisition – P11_had_participant<P12_occurred_in_the_presence_of
-> seller/buyer=BM – P12i_was_present_at -> obj2/acquisition
Resolve by cut off at BM (skos:Concept). See more at Investigating FTS Molecules
Object Part Of Collection
- obj2 – P46i_forms_part_of -> BM_Collection.
obj1/acquisition – P110_augmented -> BM_Collection.
BM_Collection – P110i_was_augmented_by<P12i_was_present_at -> obj1/acquisition. - obj2 – FR12_was_present_at -> obj1/acquisition
Resolve by removing P46i_forms_part_of from FRs (cannot exclude by type E78_Collection in FR rules). See more at FR Implementation-old#BUG
Shows Features Of
- RKD uses P130i:
# <artistiek> relation to other artistic object <obj/2926> crm:P130_shows_features_of <obj/2926/related/1>.
- BM uses P130:
<obj> P130i_features_are_also_found_on <obj/original>. <obj/original> produced/took_place_at <place>
I thought that following both relations can create a loop/leak.
But because the referred object has only a few props (and no relations of its own), there can be no leak.
bibo:Document not marked as skos:Concept
RS-700
#3
Bibliography references are not declared skos:Concept, so we spill over from an object to all objects referenced by the same bib.
- Inference: <obj> P70i_is_documented_in <bibN> => <obj> P67i_is_referred_to_by <bibN> => <bibN> P67_refers_to <obj>
- We chase P67 because: <obj> P128_carries <obj/concept/1>. <obj/concept/1> P67_refers_to <place>
- Can be fixed by adding
@prefix bibo: <http://purl.org/ontology/bibo/>. @prefix skos: <http://www.w3.org/2004/02/skos/core#>. insert {?d a skos:Concept; skos:inScheme <http://collection.britishmuseum.org/id/bibliography> } where {?d a bibo:Document}
Shared Image
RS-772
#6
Out of 958035 total images, 57813 (6%) are shared between objects. This is expected.
- For example, AN00014479_001 is used in:
MCM960 as main_representation
MCM7488 as P138
MCM7489 as main_representation
MCM7490 as P138 - we chase P138i_has_representation in BM RForms Mockup#Images of the Object to display the images of the object
- we chase P65_shows_visual_item/P138_represents in BM RForms Mockup#Inscriptions & Images on Object to describe what's depicted on the object
Fix by cutting off at rso:E22_Museum_Object
Shadow Object with Shared Images
RS-1375
RFC2637 is an "Album: 238 photographs taken in Tibet".
- It has 304 associated images:
select (count(*) as ?c) { <http://collection.britishmuseum.org/id/object/RFC2637> crm:P138i_has_representation ?i}
- 229 other objects (I guess individual photographs) share images with that album:
select (count(distinct ?e) as ?c) { <http://collection.britishmuseum.org/id/object/RFC2637> crm:P138i_has_representation ?i. ?e crm:P138i_has_representation ?i. FILTER(?e != <http://collection.britishmuseum.org/id/object/RFC2637>)}
This is a more insidiuous case than Shared Image above:
- RFC2637 is not amongst the 115k objects submitted by Josh
- So it's only a "shadow of an object": it has image data but no other data
- from the domain of bmo:PX_has_main_representation it is inferred as crm:E22_Man-Made_Object
- but because it doesn't have P52_has_current_owner id:the-british-museum, we don't infer it as rso:E22_Museum_Object and we cannot figure out to cut off at this object
select * {<http://collection.britishmuseum.org/id/object/RFC2637> a ?t}
As a result, each of the 229 photograps leaks through the Album to each other photograph.
- the JSON result for "Load Complete Object" is 450k (json.log) instead of typical 20k
- the FTS molecule is 199.5k, same for each photograph
Possible Solutions
- add P52_has_current_owner to "shadow objects" is a bad idea because they'll become searchable, but they have no data.
- (Note: assets_*.trig use the same named graphs as real objects, so there'll be no duplicate statements)
- cut-off at crm:E22_Man-Made_Object doesn't work because parts and related objects are marked E22:
RKD: <obj/2926> crm:P130i_features_are_also_found_on <obj/2926/related/1>. <obj/2926/related/1> a crm:E22_Man-Made_Object. RKD: rso:P46_has_other_part <obj/2926/part/2>. <obj/2926/part/2> a crm:E22_Man-Made_Object; BM: <obj/part/M> a E22_Man-Made_Object; P46i_forms_part_of <obj>.
- the best solution is if Josh does not emit image assets for objects that are not included in the 115k
- we can hot-fix it by executing this update (in a SystemTransaction):
delete where {?e bmo:PX_has_main_representation ?i. filter (not exists {?e crm:P52_has_current_owner id:the-british-museum})}; delete where {?e crm:P138i_has_representation ?i. filter (not exists {?e crm:P52_has_current_owner id:the-british-museum})};
NOTE: this kills RKD images, which don't have such owner.