View Source

{jira:RS-680}
{toc}
{attachments}

h1. Intro
RS uses a set of "Business Properties" (currently about 100) for two crucial tasks:
- Fetch [Complete Museum Object]
- FTS indexing (molecules), using a custom process
{jira:RS-977}

h2. Considerations
This is a subset of the more than 300 properties in the ontologies that we use:
- CIDOC CRM has 264 properties (125 inverse pairs, 14 literal properties); and 86 classes
- RSO has 35 properties; and 7 classes
- 25 properties; and 4 classes
- We also use a few properties from external ontologies: SKOS (thesauri), OAC (annotation), BIBO (bibliography), QUDT (units)

Considerations when making the list:
- For each property we must decide whether we want it, one of its superproperties, or both (see [Complete Museum Object#Remove Inferred Superproperty])
- We should use superproperties whenever appropriate, to keep RForms simpler (more abstract)
- All appropriate CRM properties have owl:inverseOf and we use inverse inference, so we can use properties either as explicitly stated, or in the opposite direction.
- [#Leak Avoidance]

h1. Process, Tools, Files
- [^BM-properties.pl]: Perl script that extracts all properties used in the BM mapping, from:
config.xml, bibliography-config.xml, biography-config.xml, dimensionunit-config.xml, flat-config.xml, image-config.xml, inline-thesauri-config.xml, thesaurus-config.xml
- [^BM-properties.xls]: Excel that I use to record decisions about each property.
!BM-properties.png!
-- Filter by WANT to generate properties.txt
-- Filter by WARN to see notices to particular people
-- The last columns show how many occurrences in each of the config files
- [^properties.txt]: list extracted from the excel
-- versioned artifact that feeds the Fetch algorithm and the FTS configuration
-- lives in [https://svn.ontotext.com/svn/researchspace/trunk/entity-api/src/resources/properties.txt]
-- (!) TODO Mitac: adapt the code to use prefixed names, then commit the file from conf to svn
- LuceneIndexCreation.lucene: ASK queries to generate Lucene index
-- lives in [https://svn.ontotext.com/svn/researchspace/trunk/data/LuceneIndexCreation.lucene]
-- (?) Mitac: I see only an example file of that name. Please correct this description

h1. Leak Avoidance
It is very important to avoid "leaks": property paths that go from one object to another (or many others).
- Such leaks embed the data (or fulltext) of other objects into the root object, and are very undesirable.
- Direction is important: eg we should follow P14_carried_out_by (to get the painting's author) but never its inverse P14i_performed (that would leak into all objects by the same author).
- Must "comb" (сресвам) all properties so they point from the object towards the periphery
- Leaks are caused by properties that can cause a loop (P14/P14i is an example of a trivial loop)
- Subproperty inference should also be taken into consdireration
-- [^superproperties.txt]: list of immediate superproperties for each business property
-- TODO Vlado: figure out a query (using rdfs:subPropertyOf) to find loops automatically.
- We cannot remove all looping properties from properties.txt (see examples below), so a key strategy is to cut off traversal at object collections and thesaurus terms:
{jira:RS-1139}

h2. Examples of Looping Properties
- crm:P138i_has_representation: <obj> P138i_has_representation <image>.
crm:P138_represents: <obj> P65_shows_visual_item/P138_represents <person> or <place>
-- <image> leads to only 1 object, and we cut off at <person> or <place>
- P12i_was_present_at: <obj> P12i_was_present_at <event> (eg exhibition, research)
P11_had_participant: [BM Association Mapping#Acquired Through (intermediary or contributor)]
-- We cut off at BM (skos:Concept)
- crm:P46_is_composed_of: <obj> P46_is_composed_of <part> (bell, case, dial…)
crm:P46i_forms_part_of: <obj> P46i_forms_part_of Series/Exhibition/Collection
-- We cut off at Series/Collection (E78_Collection) or Exhibition (skos:Concept)
{jira:RS-1138}

h2. Examples of Leaks
Examples of potential leaks, or leaks we had in the past:

h3. Object Present at Another's Acquisition
obj1 -- P12i_was_present_at -> obj1/acquisition -- P11_had_participant<P12_occurred_in_the_presence_of
-> seller/buyer=BM -- P12i_was_present_at -> obj2/acquisition

Resolve by cut off at BM (skos:Concept). See more at [Investigating FTS Molecules]

h3. Object Part Of Collection
- obj2 -- P46i_forms_part_of -> BM_Collection.
obj1/acquisition -- P110_augmented -> BM_Collection.
BM_Collection -- P110i_was_augmented_by<P12i_was_present_at -> obj1/acquisition.
- obj2 -- FR12_was_present_at -> obj1/acquisition

Resolve by removing P46i_forms_part_of from FRs (cannot exclude by type E78_Collection in FR rules). See more at [FR Implementation-old#BUG]

h3. Shows Features Of
{jira:RS-1160}
- RKD uses P130i:
{noformat}
# <artistiek> relation to other artistic object
<obj/2926> crm:P130_shows_features_of <obj/2926/related/1>.
{noformat}
- BM uses P130:
{noformat}
<obj> P130i_features_are_also_found_on <obj/original>.
<obj/original> produced/took_place_at <place>
{noformat}

I thought that following both relations can create a loop/leak.
But because the referred object has only a few props (and no relations of its own), there can be no leak.

h3. bibo:Document not marked as skos:Concept
{jira:RS-700} #3
Bibliography references are not declared skos:Concept, so we spill over from an object to all objects referenced by the same bib.
- Inference: <obj> P70i_is_documented_in <bibN> => <obj> P67i_is_referred_to_by <bibN> => <bibN> P67_refers_to <obj>
- We chase P67 because: <obj> P128_carries <obj/concept/1>. <obj/concept/1> P67_refers_to <place>
- Can be fixed by adding
{noformat}
@prefix bibo: <http://purl.org/ontology/bibo/>.
@prefix skos: <http://www.w3.org/2004/02/skos/core#>.
insert {?d a skos:Concept;
skos:inScheme <http://collection.britishmuseum.org/id/bibliography>
} where {?d a bibo:Document}
{noformat}

h3. Shared Image
{jira:RS-772} #6
Out of 958035 total images, 57813 (6%) are shared between objects. This is expected.
- For example, AN00014479_001 is used in:
MCM960 as main_representation
MCM7488 as P138
MCM7489 as main_representation
MCM7490 as P138
- we chase P138i_has_representation in [BM RForms Mockup#Images of the Object] to display the images *of* the object
- we chase P65_shows_visual_item/P138_represents in [BM RForms Mockup#Inscriptions & Images on Object] to describe what's depicted *on* the object

Fix by cutting off at rso:E22_Museum_Object

h3. Shadow Object with Shared Images
{jira:RS-1375}
RFC2637 is an "Album: 238 photographs taken in Tibet".
- It has 304 associated images:
{noformat}
select (count(*) as ?c) {
<http://collection.britishmuseum.org/id/object/RFC2637> crm:P138i_has_representation ?i}
{noformat}
- 229 other objects (I guess individual photographs) share images with that album:
{noformat}
select (count(distinct ?e) as ?c) {
<http://collection.britishmuseum.org/id/object/RFC2637> crm:P138i_has_representation ?i.
?e crm:P138i_has_representation ?i.
FILTER(?e != <http://collection.britishmuseum.org/id/object/RFC2637>)}
{noformat}

This is a more insidiuous case than [#Shared Image] above:
- RFC2637 is not amongst the 115k objects submitted by Josh
- So it's only a "shadow of an object": it has image data but no other data
- from the domain of bmo:PX_has_main_representation it is inferred as crm:E22_Man-Made_Object
- but because it doesn't have P52_has_current_owner id:the-british-museum, we don't infer it as rso:E22_Museum_Object and we cannot figure out to cut off at this object
{noformat}
select * {<http://collection.britishmuseum.org/id/object/RFC2637> a ?t}
{noformat}

As a result, each of the 229 photograps leaks through the Album to each other photograph.
- the JSON result for "Load Complete Object" is 450k ([^json.log]) instead of typical 20k
- the FTS molecule is 199.5k, same for each photograph

h4. Possible Solutions
# add P52_has_current_owner to "shadow objects" is a bad idea because they'll become searchable, but they have no data.
-- (Note: assets_*.trig use the same named graphs as real objects, so there'll be no duplicate statements)
# cut-off at crm:E22_Man-Made_Object doesn't work because parts and related objects are marked E22:
{noformat}
RKD: <obj/2926> crm:P130i_features_are_also_found_on <obj/2926/related/1>.
<obj/2926/related/1> a crm:E22_Man-Made_Object.
RKD: rso:P46_has_other_part <obj/2926/part/2>.
<obj/2926/part/2> a crm:E22_Man-Made_Object;
BM: <obj/part/M> a E22_Man-Made_Object; P46i_forms_part_of <obj>.
{noformat}
# the best solution is if Josh *does not* emit image assets for objects that are not included in the 115k
# we can hot-fix it by executing this update (in a SystemTransaction):
{noformat}
delete where {?e bmo:PX_has_main_representation ?i.
filter (not exists {?e crm:P52_has_current_owner id:the-british-museum})};
delete where {?e crm:P138i_has_representation ?i.
filter (not exists {?e crm:P52_has_current_owner id:the-british-museum})};
{noformat}
NOTE: this kills RKD images, which don't have such owner.