compared with
Current by Vladimir Alexiev
on Feb 13, 2013 12:52.

Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (40)

View Page History
{toc}

h1. FTS Molecule Extractor
{jira:RS-151}
{jira:RS-981}
{jira:RS-983}
{jira:RS-980}

- There are empty molecules:
h1. Intro
Mitac wrote a molecule extractor that saves the FTS molecules for all objects:
[http://researchspace.ontotext.com/molecules.rar]
- molecules-myIndex.txt: objects: traverse [Complete Museum Object], collect all literals, plus prefLabels of terms
- molecules-thesIndex.txt: thesauri: prefLabels and altLabels of terms

h2. Empty Molecules Due to SameAs
There used to be empty object molecules due to bad sameAs
{noformat}http://collection.britishmuseum.org/id/codex/457748 len=0{noformat}
These are now eliminated since Josh doesn't do sameAs. I removed them from the file:
These are now eliminated since Josh puts sameAs in separate files, and we don't load them. To remove from file:
{code:bash}grep -v id/codex molecules-myIndex.txt > a
mv a molecules-myIndex.txt {code}

h1. Investigate Susana FTS Molecule
h2. Empty Term Molecules
There are some empty terms: 5 in RKD, 215 in BM
{code:bash}grep len=0 molecules-thesIndex.txt{code}
[^empty-terms.txt]
- 15 that are referenced by related/broader, but not defined, eg:
{jira:RS-981} {noformat}
WAIT until [#Investigate Common Junk] is finished
place/-not-found-in-the-place-thesaurus
thesauri/-sword-fitting-not-found-in-the-object-thesaurus
thesauri/-theatre/amphitheatre-not-found-in-the-subject-thesaurus
{noformat}
- 200 normal-looking terms, eg thesauri/x114719. Most of these are again referenced by related/broader but not defined on their own.
- To find other errors, we filter by word "broader/related". But there are cases when a related term is on a second line:
{noformat}
idThes:x5814
skos:related idThes:x112260,
idThes:x112261.
{noformat}
So we throw out terms that appear several times:
{code:bash}
egrep -h "x103041|x103042|x103044|x103295|x103504|x103733|x103811|x103822|x103825|x103904|x103946|x103966|x103971|x103973|x104045|x104168|x104190|x104649|x104972|x104977|x105121|x105159|x105181|x105286|x105372|x105393|x105441|x105563|x105569|x105793|x105946|x105948|x105983|x106505|x107114|x107191|x107202|x107461|x107464|x107545|x107547|x107590|x107593|x107598|x107761|x107908|x108000|x108011|x108083|x108156|x108305|x108306|x108312|x108414|x108433|x108573|x108575|x108635|x108702|x108703|x108722|x108768|x108787|x108789|x108791|x108977|x109030|x109476|x109514|x109524|x109531|x109534|x109562|x109613|x109616|x109666|x109761|x109914|x110120|x110157|x110194|x110197|x110198|x110199|x110266|x110286|x110454|x110652|x110702|x110730|x110731|x110767|x110786|x110792|x110793|x110803|x110852|x110886|x111003|x111112|x111135|x111222|x111246|x111251|x111343|x111363|x111364|x111366|x111367|x111423|x111430|x111432|x111435|x111436|x111493|x111495|x111613|x111665|x111666|x111710|x112100|x112154|x112170|x112213|x112214|x112260|x112261|x112306|x112307|x112338|x112345|x112483|x112489|x112490|x112519|x112744|x112812|x112992|x113048|x113200|x113376|x113408|x113445|x113630|x113652|x113750|x113752|x113753|x113756|x113927|x114004|x114005|x114224|x114294|x114296|x114480|x114490|x114525|x114528|x114529|x114530|x114719|x115092|x115211|x115230|x115231|x115232|x115234|x115352|x115549|x115564|x115567|x116043|x116045|x116182|x116344|x116353|x116367|x116468|x116733|x116774|x116776|x116777|x116853|x116855|x116870|x116872|x116926|x116944|x117261|x117535|x117555|x117615|x117619|x117837|x117917|x117918|x118082|x118436|x118491" * |egrep -v "(broader|related)" | sort | uniq -u > empty-BM-terms.txt
{code}

h1. Investigate BM FTS Molecule
{jira:RS-1379} Josh:
- fix thesauri/modification/RP: no prefLabel, this is required by RForm
- Please check these 23 terms (all are in thesaurusandplace_1.trig)
[^empty-BM-terms.txt]
- If a term is not defined, is it worth keeping it as related/broader? Eg idThes:x103973 is not defined, so what does it mean to say that "Sa'a-Ulawa" is a sub-group thereof?
{jira:RS-983} {noformat}
idThes:x103967 a crm:E74_Group, skos:Concept;
skos:broader idThes:x103966, idThes:x103973;
skos:inScheme idThes:ethname;
skos:prefLabel "Sa'a-Ulawa";
{noformat}

Investigated objects GAA21182 and GAA21183 (TTL and FTS.txt are attached).
Can we find this more easily with SPARQL? This finds all 250 terms without prefLabel, but doesn't say whether they are defined on their own:
{code:sql}
select * {?c a skos:Concept.
filter not exists {?c skos:prefLabel ?l}}
{code}

h1. Investigate Molecules

h2. Reformat
Mitac changed the dump format:
{noformat}
<uri>, len=<length>, <phrase1>
- <phrase2> ...
{noformat}

Put <phrase1> on its own bulleted line:
{noformat}
perl -ple "s/(len=\d+),\s+([^\n]+)/\1\n - \2/" molecules-myIndex.txt > a
mv a molecules-myIndex.txt
perl -ple "s/(len=\d+),\s+([^\n]+)/\1\n - \2/" molecules-thesIndex.txt > a
mv a molecules-thesIndex.txt
{noformat}

(This alternative is slightly broken in case len=0):
{noformat}
perl -pe "s{(len=\d+),\s+}{$1\n - }" molecules-myIndex.txt > a
{noformat}

h2. Molecule Lengths
- Extract lengths, sort by len (descending):
{noformat}
grep "^http.*len=[^0]" molecules-myIndex.txt | sort -nrs -t = -k 2 > molecule-sizes-20121219.txt
{noformat}

- Check molecules of identical size (especially large ones) for common junk
-- Previously:
260722: RFI42168 RFI42169 RFI42170 RFI42171 RFI42172 RFI42173 RFI42174... (167 objects)
124704: PPA356887 PPA356889 PPA356890 PPA356892 PPA356893 PPA356895... (106 objects)
...
-- Currently:
199486: RFC40130 RFC40131 RFC40132 RFC40133 RFC40134 RFC40135 RFC40136... (229 objects)
172289: EPF109974 EPF109976 EPF109978 EPF109979 EPF109980 EPF109981... (126 objects)
...
- Count repeated sizes
{noformat}
perl -ne "/len=(.*\n)/; print $1" molecule-sizes-20121219.txt | uniq -dc | sort -nr > repeated-sizes.txt
{noformat}
-- the top-3 repeated sizes are
1059: YCA71368 YCA71442 YCA71829... (775 objects)
891: YCA70997 YCA71927 YCA37921... (610 objects)
889: YCA71479 YCA71936 YCA10263... (598 objects)

h2. Extract Molecule
Get molecule of specified objects
- Long repeated sizes are due to leaks:
{noformat}
perl fts-get.pl RFC40130 RFC40131 RFC40132 RFC40133 RFC40134 RFC40135 RFC40136
perl fts-get.pl EPF109974 EPF109976 EPF109978 EPF109979 EPF109980 EPF109981
{noformat}
Searching with one of these (eg RFC40130) returns all 229
- Short repeated sizes show no commonality (the repetition is a coincidence)
{noformat}
perl fts-get.pl YCA71368 YCA71442 YCA71829
perl fts-get.pl YCA71479 YCA71936 YCA10263
perl fts-get.pl YCA71479 YCA71936 YCA10263
{noformat}

h2. Investigate Common Junk
The investigation started because:
- The FTS molecules of all BM objects are between 149998 and 152029 bytes (there are even exact matches!).
- BM objects are bigger than RKD, which are 8-22k.

Neither of this is normal. I guesses that a large amount of common junk text is collected, together with a small amount of per-object specific text. I.e. we have an FTS leak: starting from object1, the properties that we chase go into a sub-object of another object2, and then all objects of teh same collection.

h3. GAA21182 and GAA21183
(this is old)
- The FTS molecule is one long sequence of words.
- Luckily the order is approximately the same, else I couldn't have compared it
| .Visual Item | | |

h1. Investigate Common Junk
The investigation started because:
- The FTS molecules of all BM objects are between 149998 and 152029 bytes (there are even exact matches!).
- BM objects are bigger than RKD, which are 8-22k.

Neither of this is normal.
I guesses that a large amount of common junk text is collected, together with a small amount of per-object specific text. I.e. we have an FTS leak: starting from object1, the properties that we chase go into a sub-object of another object2, and then all objects of teh same collection.

h2. BM Common Junk
Here's a diff, keep in mind that the truncated lines are often several kb long
!FTS-diff.png!
obj1 -> P12i -> obj1/acquisition -> P11<P12 -> seller/buyer=BM -> P12i -> obj2/acquisition

h2. RKD Common Junk
h3. RKD Objects
From the above I had a hunch the common junk comes (mostly?) from Acquisition.
All BM objects have the same acquirer (P22) and that's BM. So I found the RKD objects acquired by Mauritshuis:
{code:bash}grep "crm:P22_transferred_title_to <http://rkd.nl/thesaurus/institution/Koninklijk_Kabinet_van_Schilderijen_Mauritshuis" *.ttl{code}
{noformat}
> grep "crm:P22_transferred_title_to <http://rkd.nl/thesaurus/institution/Koninklijk_Kabinet_van_Schilderijen_Mauritshuis" *.ttl
05_BadendeSusana.xml.ttl:<obj/2926/collection/5/entry>
07_man_met_baret.xml.ttl:<obj/2946/collection/4/entry>
{noformat}

h3. Shadow Objects
See [Business Properties#Shadow Object with Shared Images]
{jira:RS-1375}

- len=199309:: 229 objects "Photograph (black and white) from an album"
RFC40130 RFC40131 RFC40132 RFC40133 RFC40134 RFC40135 RFC40136 ...
Eg the first one includes foreign ids: BM-RFC40144, RFC40345, BM-RFC40143, BM-RFC40142...
- len=172269: 126 photographic negatives by Kissling, Werner, Ruatoki
EPF109974 EPF109976 EPF109978 EPF109979 EPF109980 EPF109981 EPF109982 EPF109994 EPF109995 EPF109996 EPF109997 EPF109998
- len=109100: 133 litograph prints by Raffet, Denis Auguste Marie (sometimes with coauthors)
PPA368462 PPA368469 PPA368474 PPA368477 PPA368481 PPA368482 PPA368485 PPA368492 PPA368495 PPA368496 PPA368499 PPA368502
- len=100297: 19 bracelets, gaming sticks etc by Northwest Coast Peoples
ENA121769 ENA121770 ENA121789 ENA121790 ENA121791 ENA121828 ENA121829 ENA121830 ENA121831 ENA121832 ENA121834 EOC115652
(!) Not sure why this happens, the first 4 don't have any images

We hot-fixed this problem by deleting such shadow objects, which caused another bug: RKD images disappeared (their objects don't have BM as owner).

The same bug appears again.

h2. Investigate Susana FTS Molecule
{jira:RS-981} Anna
After [#Investigate Common Junk] is elminated, check the molecule for Susana to ensure it's the same as free text in Turtle

h1. TODO Fixes
The problem is that the properties chased for the FTS index cause a loop.

h2. Focused FTS Indexing
{jira:RS-977}
{jira:RS-1139}
The problem was that the properties chased for the FTS index cause a loop, thus leak from one object to another.
Mitac patched OWLIM's Lucene module to do the same as getCompleteMO:
- use properties.txt to limit which properties are traversed
- when it reaches a skos:Concept, it uses much reduced set of properties (only labels)
- if it hits FC70_Thing a second time, cuts off

h2. Misc Notes

- Extracted the properties from all BM configs, put in [BMX Issues^BM-properties.xls], marked in red the ones I have objections about, and attached to [BMX Issues]
{jira:RS-934}
- TODOL need to analyze in-depth all possible loops, involving not just properties but also their superproperties
{jira:RS-680}
- Luckily the sub-prop hierarchy is not very deep (3-4 levels).
I extracted it from ecrm-current.ttl: [^ecrm_subPropertyOf.txt] and found a mistake (reported to ECRM mlist).
- Extracted sub-prop hierarchy from ecrm-current.ttl: [^ecrm_subPropertyOf.txt]
-- found a mistake, reported to ECRM mlist
-- the hierarchy is not very deep (3-4 levels)
- TODO: -- Idea: extract with SPARQL query
- TODO: add Added RSO and BMO subproperties, check against properties in use (all RSO, and BM-properties.xls)
- May have to distinguish properties for FullMO vs properties for FTS.
Eg P4_has_time-span is not needed for FTS, because there's no point FTS-indexing a date string
- Make RKD thesauri same as BM thesauri (skos:prefLabel instead of rdfs:label or P1/P3)
- Specific properties
- TODO: -- investigate where does this come from
<object/PPA59031> crm:P15i_influenced <object/PPA59031/acquisition>
- TODO: P138 creates a potential loop since it's inverse of P138i and superprop P67i:
P67i_is_referred_to_by P70i_is_documented_in P138i_has_representation
P138_represents
- Mitac (DONE?): patched OWLIM's Lucene module to do the same as FullMO:
when it reaches a skos:Concept, it uses much reduced set of properties (only labels)
- TODO: may have to distinguish properties for FullMO vs properties for FTS.
Eg P4_has_time-span is not needed for FTS, because there's no point FTS-indexing a date string
- TODO: make RKD thesauri same as BM thesauri (skos:prefLabel instead of rdfs:label or P1/P3)
- TODO: -- Can we possibly eliminate P3?
--- For their important labels, objects are supposed to use rdfs:label, and terms
--- We can get other interesting texts (eg bmo:PX_physical_description) explicitly