Specifiation for migrating Rembrandt data to CRM
Limitations
In RS3.1 we don't migrate the following fields:
- Various Remarks, since they map to Annotations and we have still not decided how to do it (see Alternatives in Property Types and Annotations)
- By way of example, we have mapped <toeschrijving> Attribution, which is the most complex case.
It includes author (source), qualification, remark, date; while others have only Remark
- By way of example, we have mapped <toeschrijving> Attribution, which is the most complex case.
- Group <link_sample_record> (12 fields), since that maps to Image annotation
- Several fields with unknown meaning, typically "x" (see susana.ttl for details):
<positie_signatuur_ref>, <reference_image.front>, <reference_image.back> - Key fields that are outside the corresponding record (so cannot be correlated reliably), and whose relation is unclear:
<link_research_record>.<link_documentation_record.lref>, <link_research_record>.<link_documentation_record.lref> - files that are missing in some records, "NGL" in one, and "RKD" in all others. (We'll use only own IIP server):
<file.image.location>, <file.application.location>
Special Selection
<collectie> Collection
The last record (CURRENT COLLECTION) has special treatment, see in susana.ttl. In 05_Susanna, the records are in chronological order.
Optional:
- If <begindatum_in_collectie> are not in chronological order, throw an exception
- If <einddatum_in_collectie> are not in chronological order, throw an exception
<toeschrijving> Attribution
The first <toeschrijving> record says only "Rembrandt". The second has more data, and the third is bogus (checked in XMLs 02..07). Therefore:
- If more than one, use the second one; else use the only one.
Special Field Handling
Duplicate fields
Many fields are duplicate: NL tag at outer level, and EN tag within <object_number_RKDtechnical>.
These pairs are listed in the comments column of Rembrandt data#Reduced Sample Record, and in comments in susana.ttl.
They are handled in two different ways, described below
Merge
For multi-value fields (denoted "," in the comment): emit both, eg
Overwrite
For single-value fields (denoted "=" in the comment): emit only the NL field, ignoring EN, eg:
(here we deal with two fields at once)
For thesaurus fields, coordinate with Maria whether to use the NL or EN field, eg
- we prefer bilingual thesarus (RKDtechnical)
- RKD prefers Dutch thesaurus (RKDimages) since it's more authoritative (RKD are in the process of cleaning and merging thesauri)
Text fields
- Language tags: for a free text field coming from a NL tag, emit @nl. For an EN tag, emit @en
- If a text field includes quotes or newlines, emit it as extended Turtle string, eg
- This includes the following fields (because of quote):
<file.application>, <literature>, <literatuur>, <research.reason_objective><value>
Dates
Extract <datering>
The painting's date is present in two fields, that are not always consistent and that include text:
file | <datering> | <date> |
---|---|---|
02_Aristoteles | 1653 gedateerd | 1653 (dated) |
03_Batseba | 1643 | 1643 (dated) |
04_HermanDoomer | 1640 gedateerd | 1640 (dated) |
05_BadendeSusana | 1636 | 1636 (dated) |
06_Flora | 1635 gedateerd | 1635 |
07_man_met_baret | rond het midden of in de tweede helft van de jaren 1630 | ca. 1635-1640 |
08_NicolaesTulp | 1632 gedateerd | 1632 (dated) |
09_man_in_orientaalse | 1632 gedateerd | 1632 (dated) |
10_oude_vrouw | na ca. 1631 | ca. 1660 |
11_Andromeda | 1630/1631 | 1630/1631 |
12_lachende_man | 1629/1630 | 1629/1630 |
To extract a useful date:
- use <datering> and ignore <date> (this is a random decision)
- look for numbers (digit sequences) in <datering>
- emit as "YYYY"^^xsd:gYear
- handle 1 or 2 dates (P82 vs P82a&P82b)
Other Dates
For other date fields
- Replace "/" with "-" (eg "1758/05/23" is not valid xsd:date lexical value)
- Assign types xsd:date vs xsd:gYearMonth vs xsd:gYear depending on the date form (yyyy-mm-dd vs yyyy-mm vs yyyy)
(More elaborate handling of date vs gYearMonth vs gYear is not for RS3.1)
P82 vs P82a&P82b
Several fields can contain one or two dates
- if there is one date, emit
- if there are two dates, emit
This applies to:
- <datering>: painting (this single field can hold 2 dates, see Extract <datering>)
- <begindatum_lijst>, <einddatum_lijst>: frame
- <begindatum_tentoonstelling>, <einddatum_tentoonstelling>: exhibition
- <begindatum_veiling>, <einddatum_veiling> auction
- <research.date_begin>, <research.date_end>: research
Numbers
- emit integer fields, eg <bedrag>=157 as eg 157), which is equivalent to "157"^^xsd:integer
- emit floating fields, eg <hoogte>=47,2 as eg "47.2"^^xsd:double
- convert comma to dot
- emit keys as string, even if they are numeric, eg
Trim white space
Trim leading/trailing white-space from all fields. Best to use a parser option for this.
Useful for:
- Space before image file (06_Flora.xml)
- empty field (just a newline)
Missing or Empty fields
Missing or empty fields MUST NOT emit any RDF. This includes:
- missing elements
- totally empty elements:
- elements having only whitespace:
This is very important, otherwise invalid or inconsistent TTL will result
Missing Frame
If there aren't any Frame fields (<begindatum_lijst>, <einddatum_lijst>, <naam_lijstenmaker>, <lijstmateriaal>) then:
- don't emit any statements related to part/2:
- crm:P57_has_number_of_parts should be 1 not 2
Fake Values
Treat the following values as missing (i.e. don't emit)
- <naam_koper> = "-"
- <sample.name_number> = "x" (for RS3.3)
The following fields always have empty or fake value, so are simply ignored:
- <collectie_afdrukken> <oorspronkelijke_lijst> <reference_image.back> <reference_image.front>
Files
The XMLs include references to various files, see Documentation, Files, Images#File types for details.
They are handled according the the following decision table ("content" means to check if element content starts with this):
source tag | content | target property | extra actions (and justification) |
---|---|---|---|
<file.image> | rso\:P3_has_image_file |
|
|
<file.application> | < | rso\:P3_has_html | Decode HTML entities lt gt amp |
<file.application> | http | rso\:P3_has_url | |
<file.application> | throw exception, printing the content |
Counters
Some XML elements allow repetitions, which results in several nodes. We use counters to generate the URIs for these nodes.
The counters are reset to 1 at the start of every object, incremented globally for the object (no matter the nesting)
- object: obj/priref (root of these below)
- parts: part/n (1=painting, 2=frame)
- <andere_benaming>, <title.other_older>: title/other/n (both XML elements use one counter)
- <artistiek>: related/n
- <literatuur>, <bronnen>: reference/n (both XML elements use one counter)
- <collectie>: collection/n
- <tentoonstellingen>: exhibition/n
- <veiling>: acquisition/n
- <link_research_record>: research/n
- <link_documentation_record>: document/n
- <link_file_record>: file/n
Thesauri
See Thesaurus Lookup function for details!
- For thesaurus fields, call these:
- LookupInThesaurusByLabel (String field, String label)
for fields with simple content (eg <drager>) - LookupInthesaurusByLabels (String field, StringWithLang[] labels)
for fields with <value> elements (eg <object.support>. These include multiple labels with language
- LookupInThesaurusByLabel (String field, String label)
- for <iconclass_code>, generate the URI yourself
Remove spaces, replace "(...)" with "_..._", prepend "_" and rst-iconclass: namespace