Preparation
- read Rembrandt data to learn the data. A DB schema is enclosed (not so complex) and sample data in excel
- read Rembrandt to CRM- old for a draft sample converted to CRM.
- the diagrams are not entirely correct
- IMHO easiest to understand is the RDF Turtle:
- https://svn.ontotext.com/svn/researchspace/data/susana.ttl
(SSL, to get svn accounts post a task in jira comp "00 Infra" giving the names & emails) - http://personal.sirma.bg/vladimir/crm/art/susana.ttl.html
Color-coded html (export from Emacs), not final version
- https://svn.ontotext.com/svn/researchspace/data/susana.ttl
- read Data Migration and Ingestion Tools for tools that can be useful.
The study of these tools should determine the approach - I still need to write a spec of the migration (should be done Nov 9 right after my vacation).
You can see notes in the excel Rembrandt data#Reduced Sample Record (last column) that should give you some idea about tricky parts
Approach
Mitac, Kalin, SSL please write considerations
Preferred approach (Java/DOM parser + output to RDF(Turtle or NTriples))
- parse the XML using DOM parser
- convert each XML tag to a list of statements (type, note, attributes and relations)
- output the statements to a file
Example code:
Excerpt from migration - simple fields; the code is refactored
Excerpt from migration - nested/collection fields; the code is not refactored
In progress
- thesaurus usage is just a mockup, not implemented and cases when element not found/multiple elements found are not resolved
- refactoring of the initial code is only partially done, we wanted to show the approach (in the code examples above), not to write the full migration
- Files migration-aproach-mitac-kalin.zip
- Sample output (OUTPUT.txt)
- Java files (*.java)
- Please note the use of DataOutputStream writeBytes() in the code above is problematic as we lose the benefit of built in utf-8 support in Java. If a BufferedWriter write() call is used we do not have the issue.
- I propose refactoring the code to parametrize calls the as much as possible so for instance apart from the parseFrame method much could be driven by a properties file. This grows out of the proposal to use a properties file for setting the thesaurus references.
Labels:
None