View Source

{toc}
h1. Old mapping
renamed BM-data-full-old.rar
- Includes thesauri: biographical, bibliographical, thesaurusAndplace
- Test with one file (PrintsAndDrawings_133.rdf): 23576123 bytes RDF, 37302191 bytes NT (1.5822x more)
{noformat}perl -ne "/^(.*?) (.*?) (.*) \.$/; print qq{$1\n$3\n}"|sort|uniq{noformat}
-- this file is 1.13156 x bigger than average
-- museum objects: 2440 (estimated total: 371.9k Prints and Drawings objects)
-- triples: 306386
-- thesarus nodes: 2186: person-institution, department, thesauri, unit, dimension, event, exhibition, bibliography, series
-- *unique* literals: 18612
-- object-related nodes: 20922
-- blank nodes: 15066
- ratios
-- 76.95 bytes RDF per triple
-- 125.57 triples per object
-- about 25 nodes per object
- Totals
-- 14251190272 RDF/XML (14Gb unzipped, 338Mb rar)
-- 684 files: 20835073.5 bytes per file
-- 185.2 M triples
-- 36.87 M nodes
-- 1.475 M objects

h1. New mapping
[http://dl.dropbox.com/u/57052428/P%26D.rar] (renamed to BM-data-PrintsAndDrawings.rar), only Prints and Drawings objects
{noformat}
riot --validate "P&D_133.rdf"
riot "P&D_133.rdf" > "P&D_133.nt"
perl -ne "/^(.*?) (.*?) (.*) \.$/; print qq{$1\n$3\n}" "P&D_133.nt" | sort | uniq > "P&D-subjects.txt"
perl -ne "/^(.*?) (.*?) (.*) \.$/; print qq{$2\n}" "P&D_133.nt" | sort | uniq > "P&D-props.txt"
{noformat}
- counts
-- size: 25.7Mb, just about average
-- museum objects: 2318 (estimated total: 387.1k Prints and Drawings objects)
-- triples: 251404
-- *unique* literals: 22468
-- object-related nodes: 36730+787 (codex/object + title)
-- blank nodes: 11897
- averages
-- 25723*1024/2514040 = 104.77 bytes RDF per triple
-- 108.45 triples per object
- Data growth
-- 167/153 files = 1.092
-- 4278/3594 Gb = 1.190
-- 387.1/371.9 k Prints and Drawings objects = 1.04
-- new estimated total: 1.534 M objects

h1. Newest mapping
Count statements and unique entities (nodes & literals). Works for ttl and trig files:
{noformat}
ls -1 *.t* | xargs -n 1 riot.bat >> 0statements.nt
wc -l 0statements.nt
perl -ne "/^(.*?) (.*?) (.*) \.$/; print qq{$1\n$3\n}" 0statements.nt | sort | uniq > 0subjects.txt
wc -l 0subjects.txt
{noformat}

- 8000 objects (1k per department: AES AOA ASIA CM GR ME PD PE)
- 710184 triples
- 194207 unique
- 89 triples per object

h1. Full Set
- zipped 0.9Gb, unzipped 24Gb. 5929 files
- about 2M objects

Thesauri (occurrences of skos:Concept):
- BM-data/thesauri:
-- bibliography: 8425
-- biography: 176449
-- dimensionunits: 2
-- flatauthorities: 1467
-- thesaurusandplace: 182333
-- inline: 26917.
Note: when N objects use the same term, it's repeated N times in the same named graph.
- RS and RKD thesauri*.ttl: 27157
- TOTAL: 422750