View Source

{toc}
{attachments}

h1. Preparation

- read [Rembrandt data] to learn the data. A DB schema is enclosed (not so complex) and sample data in excel
- read [Rembrandt to CRM- old] for a draft sample converted to CRM.
-- the diagrams are not entirely correct
- IMHO easiest to understand is the RDF Turtle:
-- [https://svn.ontotext.com/svn/researchspace/data/susana.ttl]
(SSL, to get svn accounts post a task in jira comp "00 Infra" giving the names & emails)
-- [http://personal.sirma.bg/vladimir/crm/art/susana.ttl.html]
Color-coded html (export from Emacs), not final version
- read [Data Migration and Ingestion Tools] for tools that can be useful.
The study of these tools should determine the approach
- I still need to write a spec of the migration (should be done Nov 9 right after my vacation).
You can see notes in the excel [Rembrandt data#Reduced Sample Record] (last column) that should give you some idea about tricky parts

h1. Approach

Mitac, Kalin, SSL please write considerations

h2. Preferred approach (Java/DOM parser + output to RDF(Turtle or NTriples))

* parse the XML using DOM parser
* convert each XML tag to a list of statements (type, note, attributes and relations)
* output the statements to a file

h3. Example code:
{code:title=Excerpt from migration - simple fields; the code is refactored|borderStyle=solid}
//PRIREF
nodes = root.getElementsByTagName("priref");
currentURI = buildURI(baseTEXT,"/priref");

addTriple(baseURI,"crm:P48_has_preffered_identifier",currentURI);
addEntity(nodes,currentURI,"rst-identifier:RKD_priref");
//**************************************************************************************\
//PRIMARY TITLE
nodes = root.getElementsByTagName("benaming_kunstwerk");
currentURI = buildURI(baseTEXT,"/title/primary");

addTriple(baseURI,"crm:P102_has_title",currentURI);
addEntity(nodes,currentURI,"rst-note:title-primary");
//**************************************************************************************
//OTHER TITLES
nodes = root.getElementsByTagName("andere_benaming");
for(int i = 0;i < nodes.getLength();i ++) {
currentURI = buildURI(baseTEXT,"/title/other/" + (i+1));
addTriple(baseURI,"crm:P102_has_title",currentURI);
addEntity(nodes,currentURI,"rst-note:title-other");
}
{code}

{code:title=Excerpt from migration - nested/collection fields; the code is not refactored|borderStyle=solid}
dos.writeBytes("### <collectie>\n");

for(int i = 0;i < nodes.getLength() - 1;i ++) {
currentURI = buildURI(base + name,"/collection/" + String.valueOf(i+1) + "/entry");
children = (Element) nodes.item(i).getChildNodes();
tmp = children.getElementsByTagName("collectienaam");
tmp0 = children.getElementsByTagName("plaats_collectie_verblijfplaats");
tmp1 = children.getElementsByTagName("begindatum_in_collectie");
tmp2 = children.getElementsByTagName("einddatum_in_collectie");

dos.writeBytes(currentURI + " a crm:E79_Part_Addition; crm:P111_added " + tmpURI + ".\n");
dos.writeBytes(currentURI + " a crm:E10_Transfer_of_Custody; crm:P30_transferred_custody_of " + tmpURI + ".\n");
dos.writeBytes(currentURI + " a crm:E8_Acquisition; crm:P24_transferred_title_of " + tmpURI + ".\n");
dos.writeBytes(currentURI + " a crm:E79_Part_Addition; crm:P110_augmented rkd-collection:\"" + tmp.item(0).getTextContent() + "\".\n");
dos.writeBytes(currentURI + " a crm:E10_Transfer_of_Custody; crm:P29_custody_received_by rkd-collection:\"" + tmp.item(0).getTextContent() + "\".\n");
dos.writeBytes(currentURI + " a crm:E8_Acquisition; crm:P22_transferred_title_to rkd-collection:\"" + tmp.item(0).getTextContent() + "\".\n");

tmpURI = buildURI(base + name,"/collection/" + String.valueOf(i+1) + "/entry/date");

dos.writeBytes(currentURI + " crm:P4_has_time-span " + tmpURI + ".\n");

if(tmp1.getLength() > 0) {

dos.writeBytes(tmpURI + " crm:P82_at_some_time_within \"" + tmp1.item(0).getTextContent() + "\"^^xsd:gYear.");
}

else {

dos.writeBytes(tmpURI + " crm:P82_at_some_time_within \"" + "N/A" + "\"^^xsd:gYear.");
}

currentURI = buildURI(base + name,"/collection/" + String.valueOf(i+1) + "/exit");

dos.writeBytes(currentURI + " a crm:E80_Part_Removal; crm:P113_removed " + tmpURI + ".\n");
dos.writeBytes(currentURI + " a crm:E10_Transfer_of_Custody; crm:P30_transferred_custody_of " + tmpURI + ".\n");
dos.writeBytes(currentURI + " a crm:E8_Acquisition; crm:P24_transferred_title_of " + tmpURI + ".\n");
dos.writeBytes(currentURI + " a crm:E80_Part_Removal; crm:P112_diminished rkd-collection:\"" + tmp.item(0).getTextContent() + "\".\n");
dos.writeBytes(currentURI + " a crm:E10_Transfer_of_Custody; crm:P28_custody_surrendered_by rkd-collection:\"" + tmp.item(0).getTextContent() + "\".\n");
dos.writeBytes(currentURI + " a crm:E8_Acquisition; crm:P23_transferred_title_from rkd-collection:\"" + tmp.item(0).getTextContent() + "\".\n");

tmpURI = buildURI(base + name,"/collection/" + String.valueOf(i+1) + "/exit/date");

dos.writeBytes(currentURI + " crm:P4_has_time-span " + tmpURI + ".\n");
dos.writeBytes(tmpURI + " crm:P82_at_some_time_within \"" + tmp2.item(0).getTextContent() + "\"^^xsd:gYear.");

dos.writeBytes(tmpURI + " crm:P49_has_former_or_current_keeper rkd-collection:\"" + tmp.item(0).getTextContent() + "\".\n");
dos.writeBytes(tmpURI + " crm:P51_has_former_or_current_owner rkd-collection:\"" + tmp.item(0).getTextContent() + "\".\n");
dos.writeBytes(tmpURI + " crm:P53_has_former_or_current_location\"" + tmp0.item(0).getTextContent() + "\".\n");

coll = i;
}
{code}

h3. In progress

* thesaurus usage is just a mockup, not implemented and cases when element not found/multiple elements found are not resolved
* refactoring of the initial code is only partially done, we wanted to show the approach (in the code examples above), not to write the full migration

* Files [migration-aproach-mitac-kalin.zip |https://confluence.ontotext.com/download/attachments/9503576/migration-aproach-mitac-kalin.zip]
** Sample output (OUTPUT.txt)
** Java files (*.java)
* Please note the use of DataOutputStream writeBytes() in the code above is problematic as we lose the benefit of built in utf-8 support in Java. If a BufferedWriter write() call is used we do not have the issue.
* I propose refactoring the code to parametrize calls the as much as possible so for instance apart from the parseFrame method much could be driven by a properties file. This grows out of the proposal to use a properties file for setting the thesaurus references.