View Source

{toc}

{jira:RS-151}
{jira:RS-981}
{jira:RS-983}
{jira:RS-980}

h1. Intro
Mitac wrote a molecule extractor that saves the FTS molecules for all objects:
[http://researchspace.ontotext.com/molecules.rar]
- molecules-myIndex.txt: objects: traverse [Complete Museum Object], collect all literals, plus prefLabels of terms
- molecules-thesIndex.txt: thesauri: prefLabels and altLabels of terms

h2. Empty Molecules Due to SameAs
There used to be empty object molecules due to bad sameAs
{noformat}http://collection.britishmuseum.org/id/codex/457748 len=0{noformat}
These are now eliminated since Josh puts sameAs in separate files, and we don't load them. To remove from file:
{code:bash}grep -v id/codex molecules-myIndex.txt > a
mv a molecules-myIndex.txt {code}

h2. Empty Term Molecules
There are some empty terms: 5 in RKD, 215 in BM
{code:bash}grep len=0 molecules-thesIndex.txt{code}
[^empty-terms.txt]
- 15 that are referenced by related/broader, but not defined, eg:
{noformat}
place/-not-found-in-the-place-thesaurus
thesauri/-sword-fitting-not-found-in-the-object-thesaurus
thesauri/-theatre/amphitheatre-not-found-in-the-subject-thesaurus
{noformat}
- 200 normal-looking terms, eg thesauri/x114719. Most of these are again referenced by related/broader but not defined on their own.
- To find other errors, we filter by word "broader/related". But there are cases when a related term is on a second line:
{noformat}
idThes:x5814
skos:related idThes:x112260,
idThes:x112261.
{noformat}
So we throw out terms that appear several times:
{code:bash}
egrep -h "x103041|x103042|x103044|x103295|x103504|x103733|x103811|x103822|x103825|x103904|x103946|x103966|x103971|x103973|x104045|x104168|x104190|x104649|x104972|x104977|x105121|x105159|x105181|x105286|x105372|x105393|x105441|x105563|x105569|x105793|x105946|x105948|x105983|x106505|x107114|x107191|x107202|x107461|x107464|x107545|x107547|x107590|x107593|x107598|x107761|x107908|x108000|x108011|x108083|x108156|x108305|x108306|x108312|x108414|x108433|x108573|x108575|x108635|x108702|x108703|x108722|x108768|x108787|x108789|x108791|x108977|x109030|x109476|x109514|x109524|x109531|x109534|x109562|x109613|x109616|x109666|x109761|x109914|x110120|x110157|x110194|x110197|x110198|x110199|x110266|x110286|x110454|x110652|x110702|x110730|x110731|x110767|x110786|x110792|x110793|x110803|x110852|x110886|x111003|x111112|x111135|x111222|x111246|x111251|x111343|x111363|x111364|x111366|x111367|x111423|x111430|x111432|x111435|x111436|x111493|x111495|x111613|x111665|x111666|x111710|x112100|x112154|x112170|x112213|x112214|x112260|x112261|x112306|x112307|x112338|x112345|x112483|x112489|x112490|x112519|x112744|x112812|x112992|x113048|x113200|x113376|x113408|x113445|x113630|x113652|x113750|x113752|x113753|x113756|x113927|x114004|x114005|x114224|x114294|x114296|x114480|x114490|x114525|x114528|x114529|x114530|x114719|x115092|x115211|x115230|x115231|x115232|x115234|x115352|x115549|x115564|x115567|x116043|x116045|x116182|x116344|x116353|x116367|x116468|x116733|x116774|x116776|x116777|x116853|x116855|x116870|x116872|x116926|x116944|x117261|x117535|x117555|x117615|x117619|x117837|x117917|x117918|x118082|x118436|x118491" * |egrep -v "(broader|related)" | sort | uniq -u > empty-BM-terms.txt
{code}

{jira:RS-1379} Josh:
- fix thesauri/modification/RP: no prefLabel, this is required by RForm
- Please check these 23 terms (all are in thesaurusandplace_1.trig)
[^empty-BM-terms.txt]
- If a term is not defined, is it worth keeping it as related/broader? Eg idThes:x103973 is not defined, so what does it mean to say that "Sa'a-Ulawa" is a sub-group thereof?
{noformat}
idThes:x103967 a crm:E74_Group, skos:Concept;
skos:broader idThes:x103966, idThes:x103973;
skos:inScheme idThes:ethname;
skos:prefLabel "Sa'a-Ulawa";
{noformat}

Can we find this more easily with SPARQL? This finds all 250 terms without prefLabel, but doesn't say whether they are defined on their own:
{code:sql}
select * {?c a skos:Concept.
filter not exists {?c skos:prefLabel ?l}}
{code}

h1. Investigate Molecules

h2. Reformat
Mitac changed the dump format:
{noformat}
<uri>, len=<length>, <phrase1>
- <phrase2> ...
{noformat}

Put <phrase1> on its own bulleted line:
{noformat}
perl -ple "s/(len=\d+),\s+([^\n]+)/\1\n - \2/" molecules-myIndex.txt > a
mv a molecules-myIndex.txt
perl -ple "s/(len=\d+),\s+([^\n]+)/\1\n - \2/" molecules-thesIndex.txt > a
mv a molecules-thesIndex.txt
{noformat}

(This alternative is slightly broken in case len=0):
{noformat}
perl -pe "s{(len=\d+),\s+}{$1\n - }" molecules-myIndex.txt > a
{noformat}

h2. Molecule Lengths
- Extract lengths, sort by len (descending):
{noformat}
grep "^http.*len=[^0]" molecules-myIndex.txt | sort -nrs -t = -k 2 > molecule-sizes-20121219.txt
{noformat}

- Check molecules of identical size (especially large ones) for common junk
-- Previously:
260722: RFI42168 RFI42169 RFI42170 RFI42171 RFI42172 RFI42173 RFI42174... (167 objects)
124704: PPA356887 PPA356889 PPA356890 PPA356892 PPA356893 PPA356895... (106 objects)
...
-- Currently:
199486: RFC40130 RFC40131 RFC40132 RFC40133 RFC40134 RFC40135 RFC40136... (229 objects)
172289: EPF109974 EPF109976 EPF109978 EPF109979 EPF109980 EPF109981... (126 objects)
...
- Count repeated sizes
{noformat}
perl -ne "/len=(.*\n)/; print $1" molecule-sizes-20121219.txt | uniq -dc | sort -nr > repeated-sizes.txt
{noformat}
-- the top-3 repeated sizes are
1059: YCA71368 YCA71442 YCA71829... (775 objects)
891: YCA70997 YCA71927 YCA37921... (610 objects)
889: YCA71479 YCA71936 YCA10263... (598 objects)

h2. Extract Molecule
Get molecule of specified objects
- Long repeated sizes are due to leaks:
{noformat}
perl fts-get.pl RFC40130 RFC40131 RFC40132 RFC40133 RFC40134 RFC40135 RFC40136
perl fts-get.pl EPF109974 EPF109976 EPF109978 EPF109979 EPF109980 EPF109981
{noformat}
Searching with one of these (eg RFC40130) returns all 229
- Short repeated sizes show no commonality (the repetition is a coincidence)
{noformat}
perl fts-get.pl YCA71368 YCA71442 YCA71829
perl fts-get.pl YCA71479 YCA71936 YCA10263
perl fts-get.pl YCA71479 YCA71936 YCA10263
{noformat}

h2. Investigate Common Junk
The investigation started because:
- The FTS molecules of all BM objects are between 149998 and 152029 bytes (there are even exact matches!).
- BM objects are bigger than RKD, which are 8-22k.

Neither of this is normal. I guesses that a large amount of common junk text is collected, together with a small amount of per-object specific text. I.e. we have an FTS leak: starting from object1, the properties that we chase go into a sub-object of another object2, and then all objects of teh same collection.

h3. GAA21182 and GAA21183
(this is old)
- The FTS molecule is one long sequence of words.
- Luckily the order is approximately the same, else I couldn't have compared it
- First split it to shared-specific parts (lines) using emacs M-x compare-window, then sorted
The wdiff program is also useful, see below
- Then compared, to extract the specific parts
{code:bash}diff --suppress-common --side-by-side GAA21182.txt GAA21183.txt{code}
- Extracted text from TTL
{code:bash}perl -ne "m{\"(.+?)\"} and print qq{$1\n}" GAA21182.ttl|sort{code}
- Made summary table showing all specific parts and TTL text.
First char is a code: the comment applies to all lines with same first char (no char means it's ok, i.e. in both TTL and FTS specific part)

| *object1* | *object2* | *comment, all with same first char* |
| http://collection.britishmuseum.org/id/object/GAA21182 | http://collection.britishmuseum.org/id/object/GAA21183 | |
| len=150307 | len=150081 | FTS len. Most of this is common junk |
| !1882 | !1978 | in TTL and in FTS common part, checked manually |
| !Acquisition date :: 1882 :: | !Acquisition date :: 1978 :: | |
| !Consists of :: glass | !Consists of :: glass | |
| !Purchased from :: Chester, Greville John :: | !Donated by :: Roberts, V G :: | |
| ?BM-GAA21182 | ?BM-GAA21183 | NuxeoID generated by Kasabov |
| 1882,0510.17 | 1978,0818.2 | |
| 2.50 | | |
| 3.00 | | |
| 448765 | 448764 | |
| Dimension | | |
| Dimension :: 2.50cm :: | | |
| Dimension :: 3.00cm :: | | |
| GAA21182 | GAA21183 | |
| Object type :: bead :: | Object type :: necklace :: | |
| Opaque black glass disc bead with a goose or swan stamped... | Necklace of thirty-seven blue glass beads (twenty-two cube.. | |
| Subject :: bird :: | Subject :: mammal :: | |
| Uses technique :: stamped :: | | |
| >bead | >neck-ornament | thesaurus term. TODO check all: skos:altLabel or skos:broader? |
| >cm | >necklace | |
| >inscription | >xian lian | |
| >stamped | | |
| >Length | | |
| >Width | | |
| -BM Dimension | | thesaurus name. TODO investigate, should not appear (skos:inScheme is not traversed) |
| -BM Inscription Type | | |
| -BM Technique | | |
| -The British Museum TECHNIQUE Concept Scheme | | |
| -QUDT Unit | | |
| .Information Object | | class name. Doesn't hurt and cannot discard it (rdfs:label is traversed even for rdf:Class) |
| .Man-Made Feature | | |
| .Measurement Unit | | |
| .Physical Feature | | |
| .Visual Item | | |

Here's a diff, keep in mind that the truncated lines are often several kb long
!FTS-diff.png!

This statistic shows 99% common junk:
{code:bash}wdiff -123s GAA21182.txt GAA21183.txt{code}
{noformat}
GAA21182.txt: 24914 words : 24838 99% common : 56 0% deleted : 20 0% changed
GAA21183.txt: 24871 words : 24838 99% common : 21 0% inserted : 12 0% changed
{noformat}

I was able to isolate the [^common-junk.txt] by using commands like this several times:
{code:bash}wdiff -12 GAA21182.txt GAA21183.txt | grep -v === > c{code}

The question is which is the sub-object that becomes shared between all objects
- I searched for this common junk string: "Purchased from :: Gordon, Margot".
- it comes from http://collection.britishmuseum.org/id/object/PPA59031/acquisition in PD_101119_PPA59031.rdf
- so I guessed that Acquisition is the shared sub-object.
- I verified [properties.txt|https://svn.ontotext.com/svn/researchspace/trunk/entity-api/src/resources/properties.txt] (used for GetCompleteMO) and [LuceneIndexCreation.lucene|https://svn.ontotext.com/svn/researchspace/trunk/data/LuceneIndexCreation.lucene] (used for FTS indexing)
- in both we have this pair of properties:
| *property* | *used for* |
| P12i_was_present_at | Obj present at Event (eg exhibition, research) |
| P11_had_participant | [BM Association Mapping#Acquired Through (intermediary or contributor)] |

- unfortunately it causes this leak:
obj1 -> P12i -> obj1/acquisition -> P11<P12 -> seller/buyer=BM -> P12i -> obj2/acquisition

h3. RKD Objects
From the above I had a hunch the common junk comes (mostly?) from Acquisition.
All BM objects have the same acquirer (P22) and that's BM. So I found the RKD objects acquired by Mauritshuis:
{code:bash}grep "crm:P22_transferred_title_to <http://rkd.nl/thesaurus/institution/Koninklijk_Kabinet_van_Schilderijen_Mauritshuis" *.ttl{code}

05_BadendeSusana.xml.ttl:<obj/2926/collection/5/entry>
07_man_met_baret.xml.ttl:<obj/2946/collection/4/entry>
08_NicolaesTulp.xml.ttl:<obj/3048/collection/3/entry>
10_oude_vrouw.xml.ttl:<obj/2952/collection/3/entry>
11_Andromeda.xml.ttl:<obj/2940/collection/5/entry>
12_lachende_man.xml.ttl:<obj/3064/collection/7/entry>
{noformat}

Then I extracted the corresponding molecules (2926.txt, 2946.txt, 3048.txt and one that's not acquired by Mauritshuis: 53707.txt
The hunch was confirmed: common junk is much higher between objects with common acquirer:
{noformat}
> wdiff -123s 2926.txt 3048.txt
2926.txt: 2020 words 1031 51% common 134 6% deleted 855 42% changed
3048.txt: 2582 words 1031 39% common 391 15% inserted 1160 44% changed
> wdiff -123s 2926.txt 2946.txt
2926.txt: 2020 words 1124 55% common 306 15% deleted 590 29% changed
2946.txt: 1884 words 1124 59% common 181 9% inserted 579 30% changed
> wdiff -123s 2926.txt 53707.txt
2926.txt: 2020 words 522 25% common 0 0% deleted 1498 74% changed
53707.txt: 1820 words 522 28% common 10 0% inserted 1288 70% changed
{noformat}

h3. Shadow Objects
See [Business Properties#Shadow Object with Shared Images]
{jira:RS-1375}

- len=199309:: 229 objects "Photograph (black and white) from an album"
RFC40130 RFC40131 RFC40132 RFC40133 RFC40134 RFC40135 RFC40136 ...
Eg the first one includes foreign ids: BM-RFC40144, RFC40345, BM-RFC40143, BM-RFC40142...
- len=172269: 126 photographic negatives by Kissling, Werner, Ruatoki
EPF109974 EPF109976 EPF109978 EPF109979 EPF109980 EPF109981 EPF109982 EPF109994 EPF109995 EPF109996 EPF109997 EPF109998
- len=109100: 133 litograph prints by Raffet, Denis Auguste Marie (sometimes with coauthors)
PPA368462 PPA368469 PPA368474 PPA368477 PPA368481 PPA368482 PPA368485 PPA368492 PPA368495 PPA368496 PPA368499 PPA368502
- len=100297: 19 bracelets, gaming sticks etc by Northwest Coast Peoples
ENA121769 ENA121770 ENA121789 ENA121790 ENA121791 ENA121828 ENA121829 ENA121830 ENA121831 ENA121832 ENA121834 EOC115652
(!) Not sure why this happens, the first 4 don't have any images

We hot-fixed this problem by deleting such shadow objects, which caused another bug: RKD images disappeared (their objects don't have BM as owner).

The same bug appears again.

h2. Investigate Susana FTS Molecule
{jira:RS-981} Anna
After [#Investigate Common Junk] is elminated, check the molecule for Susana to ensure it's the same as free text in Turtle

h1. Fixes

h2. Focused FTS Indexing
{jira:RS-977}
{jira:RS-1139}
The problem was that the properties chased for the FTS index cause a loop, thus leak from one object to another.
Mitac patched OWLIM's Lucene module to do the same as getCompleteMO:
- use properties.txt to limit which properties are traversed
- when it reaches a skos:Concept, it uses much reduced set of properties (only labels)
- if it hits FC70_Thing a second time, cuts off

h2. Misc Notes

- Extracted the properties from all BM configs, put in [BMX Issues^BM-properties.xls], marked in red the ones I have objections about
{jira:RS-934}
- To analyze all possible loops, involving not just properties but also their superproperties
{jira:RS-680}
- Extracted sub-prop hierarchy from ecrm-current.ttl: [^ecrm_subPropertyOf.txt]
-- found a mistake, reported to ECRM mlist
-- the hierarchy is not very deep (3-4 levels)
-- Idea: extract with SPARQL query
- Added RSO and BMO subproperties, check against properties in use (all RSO, and BM-properties.xls)
- May have to distinguish properties for FullMO vs properties for FTS.
Eg P4_has_time-span is not needed for FTS, because there's no point FTS-indexing a date string
- Make RKD thesauri same as BM thesauri (skos:prefLabel instead of rdfs:label or P1/P3)
- Specific properties
-- investigate where does this come from
<object/PPA59031> crm:P15i_influenced <object/PPA59031/acquisition>
- P138 creates a potential loop since it's inverse of P138i and superprop P67i:
P67i_is_referred_to_by P70i_is_documented_in P138i_has_representation
P138_represents
-- Can we possibly eliminate P3?
--- For their important labels, objects are supposed to use rdfs:label, and terms
--- We can get other interesting texts (eg bmo:PX_physical_description) explicitly