View Source



h1. Intro
Mitac wrote a molecule extractor that saves the FTS molecules for all objects:
- molecules-myIndex.txt: objects: traverse [Complete Museum Object], collect all literals, plus prefLabels of terms
- molecules-thesIndex.txt: thesauri: prefLabels and altLabels of terms

h2. Empty Molecules Due to SameAs
There used to be empty object molecules due to bad sameAs
{noformat} len=0{noformat}
These are now eliminated since Josh puts sameAs in separate files, and we don't load them. To remove from file:
{code:bash}grep -v id/codex molecules-myIndex.txt > a
mv a molecules-myIndex.txt {code}

h2. Empty Term Molecules
There are some empty terms: 5 in RKD, 215 in BM
{code:bash}grep len=0 molecules-thesIndex.txt{code}
- 15 that are referenced by related/broader, but not defined, eg:
- 200 normal-looking terms, eg thesauri/x114719. Most of these are again referenced by related/broader but not defined on their own.
- To find other errors, we filter by word "broader/related". But there are cases when a related term is on a second line:
skos:related idThes:x112260,
So we throw out terms that appear several times:
egrep -h "x103041|x103042|x103044|x103295|x103504|x103733|x103811|x103822|x103825|x103904|x103946|x103966|x103971|x103973|x104045|x104168|x104190|x104649|x104972|x104977|x105121|x105159|x105181|x105286|x105372|x105393|x105441|x105563|x105569|x105793|x105946|x105948|x105983|x106505|x107114|x107191|x107202|x107461|x107464|x107545|x107547|x107590|x107593|x107598|x107761|x107908|x108000|x108011|x108083|x108156|x108305|x108306|x108312|x108414|x108433|x108573|x108575|x108635|x108702|x108703|x108722|x108768|x108787|x108789|x108791|x108977|x109030|x109476|x109514|x109524|x109531|x109534|x109562|x109613|x109616|x109666|x109761|x109914|x110120|x110157|x110194|x110197|x110198|x110199|x110266|x110286|x110454|x110652|x110702|x110730|x110731|x110767|x110786|x110792|x110793|x110803|x110852|x110886|x111003|x111112|x111135|x111222|x111246|x111251|x111343|x111363|x111364|x111366|x111367|x111423|x111430|x111432|x111435|x111436|x111493|x111495|x111613|x111665|x111666|x111710|x112100|x112154|x112170|x112213|x112214|x112260|x112261|x112306|x112307|x112338|x112345|x112483|x112489|x112490|x112519|x112744|x112812|x112992|x113048|x113200|x113376|x113408|x113445|x113630|x113652|x113750|x113752|x113753|x113756|x113927|x114004|x114005|x114224|x114294|x114296|x114480|x114490|x114525|x114528|x114529|x114530|x114719|x115092|x115211|x115230|x115231|x115232|x115234|x115352|x115549|x115564|x115567|x116043|x116045|x116182|x116344|x116353|x116367|x116468|x116733|x116774|x116776|x116777|x116853|x116855|x116870|x116872|x116926|x116944|x117261|x117535|x117555|x117615|x117619|x117837|x117917|x117918|x118082|x118436|x118491" * |egrep -v "(broader|related)" | sort | uniq -u > empty-BM-terms.txt

{jira:RS-1379} Josh:
- fix thesauri/modification/RP: no prefLabel, this is required by RForm
- Please check these 23 terms (all are in thesaurusandplace_1.trig)
- If a term is not defined, is it worth keeping it as related/broader? Eg idThes:x103973 is not defined, so what does it mean to say that "Sa'a-Ulawa" is a sub-group thereof?
idThes:x103967 a crm:E74_Group, skos:Concept;
skos:broader idThes:x103966, idThes:x103973;
skos:inScheme idThes:ethname;
skos:prefLabel "Sa'a-Ulawa";

Can we find this more easily with SPARQL? This finds all 250 terms without prefLabel, but doesn't say whether they are defined on their own:
select * {?c a skos:Concept.
filter not exists {?c skos:prefLabel ?l}}

h1. Investigate Molecules

h2. Reformat
Mitac changed the dump format:
<uri>, len=<length>, <phrase1>
- <phrase2> ...

Put <phrase1> on its own bulleted line:
perl -ple "s/(len=\d+),\s+([^\n]+)/\1\n - \2/" molecules-myIndex.txt > a
mv a molecules-myIndex.txt
perl -ple "s/(len=\d+),\s+([^\n]+)/\1\n - \2/" molecules-thesIndex.txt > a
mv a molecules-thesIndex.txt

(This alternative is slightly broken in case len=0):
perl -pe "s{(len=\d+),\s+}{$1\n - }" molecules-myIndex.txt > a

h2. Molecule Lengths
- Extract lengths, sort by len (descending):
grep "^http.*len=[^0]" molecules-myIndex.txt | sort -nrs -t = -k 2 > molecule-sizes-20121219.txt

- Check molecules of identical size (especially large ones) for common junk
-- Previously:
260722: RFI42168 RFI42169 RFI42170 RFI42171 RFI42172 RFI42173 RFI42174... (167 objects)
124704: PPA356887 PPA356889 PPA356890 PPA356892 PPA356893 PPA356895... (106 objects)
-- Currently:
199486: RFC40130 RFC40131 RFC40132 RFC40133 RFC40134 RFC40135 RFC40136... (229 objects)
172289: EPF109974 EPF109976 EPF109978 EPF109979 EPF109980 EPF109981... (126 objects)
- Count repeated sizes
perl -ne "/len=(.*\n)/; print $1" molecule-sizes-20121219.txt | uniq -dc | sort -nr > repeated-sizes.txt
-- the top-3 repeated sizes are
1059: YCA71368 YCA71442 YCA71829... (775 objects)
891: YCA70997 YCA71927 YCA37921... (610 objects)
889: YCA71479 YCA71936 YCA10263... (598 objects)

h2. Extract Molecule
Get molecule of specified objects
- Long repeated sizes are due to leaks:
perl RFC40130 RFC40131 RFC40132 RFC40133 RFC40134 RFC40135 RFC40136
perl EPF109974 EPF109976 EPF109978 EPF109979 EPF109980 EPF109981
Searching with one of these (eg RFC40130) returns all 229
- Short repeated sizes show no commonality (the repetition is a coincidence)
perl YCA71368 YCA71442 YCA71829
perl YCA71479 YCA71936 YCA10263
perl YCA71479 YCA71936 YCA10263

h2. Investigate Common Junk
The investigation started because:
- The FTS molecules of all BM objects are between 149998 and 152029 bytes (there are even exact matches!).
- BM objects are bigger than RKD, which are 8-22k.

Neither of this is normal. I guesses that a large amount of common junk text is collected, together with a small amount of per-object specific text. I.e. we have an FTS leak: starting from object1, the properties that we chase go into a sub-object of another object2, and then all objects of teh same collection.

h3. GAA21182 and GAA21183
(this is old)
- The FTS molecule is one long sequence of words.
- Luckily the order is approximately the same, else I couldn't have compared it
- First split it to shared-specific parts (lines) using emacs M-x compare-window, then sorted
The wdiff program is also useful, see below
- Then compared, to extract the specific parts
{code:bash}diff --suppress-common --side-by-side GAA21182.txt GAA21183.txt{code}
- Extracted text from TTL
{code:bash}perl -ne "m{\"(.+?)\"} and print qq{$1\n}" GAA21182.ttl|sort{code}
- Made summary table showing all specific parts and TTL text.
First char is a code: the comment applies to all lines with same first char (no char means it's ok, i.e. in both TTL and FTS specific part)

| *object1* | *object2* | *comment, all with same first char* |
| | | |
| len=150307 | len=150081 | FTS len. Most of this is common junk |
| !1882 | !1978 | in TTL and in FTS common part, checked manually |
| !Acquisition date :: 1882 :: | !Acquisition date :: 1978 :: | |
| !Consists of :: glass | !Consists of :: glass | |
| !Purchased from :: Chester, Greville John :: | !Donated by :: Roberts, V G :: | |
| ?BM-GAA21182 | ?BM-GAA21183 | NuxeoID generated by Kasabov |
| 1882,0510.17 | 1978,0818.2 | |
| 2.50 | | |
| 3.00 | | |
| 448765 | 448764 | |
| Dimension | | |
| Dimension :: 2.50cm :: | | |
| Dimension :: 3.00cm :: | | |
| GAA21182 | GAA21183 | |
| Object type :: bead :: | Object type :: necklace :: | |
| Opaque black glass disc bead with a goose or swan stamped... | Necklace of thirty-seven blue glass beads (twenty-two cube.. | |
| Subject :: bird :: | Subject :: mammal :: | |
| Uses technique :: stamped :: | | |
| >bead | >neck-ornament | thesaurus term. TODO check all: skos:altLabel or skos:broader? |
| >cm | >necklace | |
| >inscription | >xian lian | |
| >stamped | | |
| >Length | | |
| >Width | | |
| -BM Dimension | | thesaurus name. TODO investigate, should not appear (skos:inScheme is not traversed) |
| -BM Inscription Type | | |
| -BM Technique | | |
| -The British Museum TECHNIQUE Concept Scheme | | |
| -QUDT Unit | | |
| .Information Object | | class name. Doesn't hurt and cannot discard it (rdfs:label is traversed even for rdf:Class) |
| .Man-Made Feature | | |
| .Measurement Unit | | |
| .Physical Feature | | |
| .Visual Item | | |

Here's a diff, keep in mind that the truncated lines are often several kb long

This statistic shows 99% common junk:
{code:bash}wdiff -123s GAA21182.txt GAA21183.txt{code}
GAA21182.txt: 24914 words : 24838 99% common : 56 0% deleted : 20 0% changed
GAA21183.txt: 24871 words : 24838 99% common : 21 0% inserted : 12 0% changed

I was able to isolate the [^common-junk.txt] by using commands like this several times:
{code:bash}wdiff -12 GAA21182.txt GAA21183.txt | grep -v === > c{code}

The question is which is the sub-object that becomes shared between all objects
- I searched for this common junk string: "Purchased from :: Gordon, Margot".
- it comes from in PD_101119_PPA59031.rdf
- so I guessed that Acquisition is the shared sub-object.
- I verified [properties.txt|] (used for GetCompleteMO) and [LuceneIndexCreation.lucene|] (used for FTS indexing)
- in both we have this pair of properties:
| *property* | *used for* |
| P12i_was_present_at | Obj present at Event (eg exhibition, research) |
| P11_had_participant | [BM Association Mapping#Acquired Through (intermediary or contributor)] |

- unfortunately it causes this leak:
obj1 -> P12i -> obj1/acquisition -> P11<P12 -> seller/buyer=BM -> P12i -> obj2/acquisition

h3. RKD Objects
From the above I had a hunch the common junk comes (mostly?) from Acquisition.
All BM objects have the same acquirer (P22) and that's BM. So I found the RKD objects acquired by Mauritshuis:
{code:bash}grep "crm:P22_transferred_title_to <" *.ttl{code}


Then I extracted the corresponding molecules (2926.txt, 2946.txt, 3048.txt and one that's not acquired by Mauritshuis: 53707.txt
The hunch was confirmed: common junk is much higher between objects with common acquirer:
> wdiff -123s 2926.txt 3048.txt
2926.txt: 2020 words 1031 51% common 134 6% deleted 855 42% changed
3048.txt: 2582 words 1031 39% common 391 15% inserted 1160 44% changed
> wdiff -123s 2926.txt 2946.txt
2926.txt: 2020 words 1124 55% common 306 15% deleted 590 29% changed
2946.txt: 1884 words 1124 59% common 181 9% inserted 579 30% changed
> wdiff -123s 2926.txt 53707.txt
2926.txt: 2020 words 522 25% common 0 0% deleted 1498 74% changed
53707.txt: 1820 words 522 28% common 10 0% inserted 1288 70% changed

h3. Shadow Objects
See [Business Properties#Shadow Object with Shared Images]

- len=199309:: 229 objects "Photograph (black and white) from an album"
RFC40130 RFC40131 RFC40132 RFC40133 RFC40134 RFC40135 RFC40136 ...
Eg the first one includes foreign ids: BM-RFC40144, RFC40345, BM-RFC40143, BM-RFC40142...
- len=172269: 126 photographic negatives by Kissling, Werner, Ruatoki
EPF109974 EPF109976 EPF109978 EPF109979 EPF109980 EPF109981 EPF109982 EPF109994 EPF109995 EPF109996 EPF109997 EPF109998
- len=109100: 133 litograph prints by Raffet, Denis Auguste Marie (sometimes with coauthors)
PPA368462 PPA368469 PPA368474 PPA368477 PPA368481 PPA368482 PPA368485 PPA368492 PPA368495 PPA368496 PPA368499 PPA368502
- len=100297: 19 bracelets, gaming sticks etc by Northwest Coast Peoples
ENA121769 ENA121770 ENA121789 ENA121790 ENA121791 ENA121828 ENA121829 ENA121830 ENA121831 ENA121832 ENA121834 EOC115652
(!) Not sure why this happens, the first 4 don't have any images

We hot-fixed this problem by deleting such shadow objects, which caused another bug: RKD images disappeared (their objects don't have BM as owner).

The same bug appears again.

h2. Investigate Susana FTS Molecule
{jira:RS-981} Anna
After [#Investigate Common Junk] is elminated, check the molecule for Susana to ensure it's the same as free text in Turtle

h1. Fixes

h2. Focused FTS Indexing
The problem was that the properties chased for the FTS index cause a loop, thus leak from one object to another.
Mitac patched OWLIM's Lucene module to do the same as getCompleteMO:
- use properties.txt to limit which properties are traversed
- when it reaches a skos:Concept, it uses much reduced set of properties (only labels)
- if it hits FC70_Thing a second time, cuts off

h2. Misc Notes

- Extracted the properties from all BM configs, put in [BMX Issues^BM-properties.xls], marked in red the ones I have objections about
- To analyze all possible loops, involving not just properties but also their superproperties
- Extracted sub-prop hierarchy from ecrm-current.ttl: [^ecrm_subPropertyOf.txt]
-- found a mistake, reported to ECRM mlist
-- the hierarchy is not very deep (3-4 levels)
-- Idea: extract with SPARQL query
- Added RSO and BMO subproperties, check against properties in use (all RSO, and BM-properties.xls)
- May have to distinguish properties for FullMO vs properties for FTS.
Eg P4_has_time-span is not needed for FTS, because there's no point FTS-indexing a date string
- Make RKD thesauri same as BM thesauri (skos:prefLabel instead of rdfs:label or P1/P3)
- Specific properties
-- investigate where does this come from
<object/PPA59031> crm:P15i_influenced <object/PPA59031/acquisition>
- P138 creates a potential loop since it's inverse of P138i and superprop P67i:
P67i_is_referred_to_by P70i_is_documented_in P138i_has_representation
-- Can we possibly eliminate P3?
--- For their important labels, objects are supposed to use rdfs:label, and terms
--- We can get other interesting texts (eg bmo:PX_physical_description) explicitly