View Source

{excerpt}Counting and analysis of repository content{excerpt}

h2. Counting
- Total statements
select (count(*) as ?c) {?s ?p ?o}
- statements per property.
The max limit=200, so we get them in two portions:
select ?p (count(*) as ?c) {?s ?p ?o} group by ?p order by ?p
select ?p (count(*) as ?c) {?s ?p ?o} group by ?p order by ?p offset 200
- class instances (one instance has many rdf:type!)
select ?t (count(*) as ?c) {?s rdf:type ?t} group by ?t order by ?t

h2. Analysis
We provide historic data, but focus on the latest data (BM-triples.xls of 2012-12).

h3. Properties
Wihout sameAs expansion: 89995389 (2.9M=3.1% less triples)

- rdf:type=58426160 is 62.9% of all triples (see breakdown below)
- Object (business) & thesauri triples are 26.0+4.9=30.9%, of which we can assume objects are 21% and thesauri 10%.
- FRs=5751214 are 6.2% of all triples, or 29% of business triples
- bmo:PX_physical_description=25584 ~ rso:FC70_Thing=23993 is 3x more than the 8k objects!? Due to owl:sameAs
- owl:sameAs=72010 is 9x more than the 8k objects.
Each object has 3 sameAs URIs (a,b,c), which causes 9 statements: aa bb cc ab bc ca ba cb ac
That's what an equivalence relation will do to you.
- skos:inScheme=357283 ~ skos:Concept=357318 is the total number of thesaurus terms
- skos:exactMatch=4495 come from RKD. E.g. rkd-plaats:renaix and rkd-plaats:renaix give 4 triples (2 symmetric, 2 reflexive)

h3. Types
- _:nodeXX=23528903: 40.3% useless OWL DL restriction types
{noformat} crm:En_Whatever rdf:type [owl:Restriction...] {noformat}
We could eliminate these (24% of all triples) by:
-# Delete such statements *after* loading the ontologies and *before* loading the data
delete where {?e rdfs:subClassOf ?t. ?t a owl:Restriction}
-# Write a perl script to cut down ECRM to RDFS+inverse (what Doerr wanted) + transitive
- CRM classes=30864964: 52.8%: this is broken down into a decreasing number down the class hierarchy (ok):
owl:Thing=3627096 ~ crm:E1_CRM_Entity=3626903

h3. Statements and MB