View Source

{excerpt}Documents and Files about a Painting{excerpt}
{toc}

h2. Get XMLs

The XMLs of the paintings are obtained according to [Rembrandt data#Data Sources] ([RS-57@jira])
- "Zelfportret" [http://rkd.adlibsoft.com/rembrandt-demo/painting/zelfportret] returns error: "Sorry\! An error occured while you were using this site. The page you are looking for might have been removed, had its name changes, or is temporarily unavailable."
- "Borststuk van een man met een gevederde baret" [http://rkd.adlibsoft.com/rembrandt-demo/painting/borststuk-van-een-man-met-een-gevederde-baret] has Other/former title "Zelfportret", but is not to be confused with "Zelfportret".
- renumbered and saved filenames that include the most important words from the title (see tables below)
- broke each XML element on its own line
{code}pfind . /xml$/ "=~s{><}{>\n<}g"{code}

h2. File Kinds

The 11 Rembrandt XMLs include files (<link_file_record>) of the following kinds:
|| kind || tag || given as || ext || handling ||
| image | <file.image> | file name | tif | Original TIFs are 50-200Mb in size. Download as JPG (smallest rendition), store and serve through Nuxeo (in popup window or as block element) |
| link | <file.application> | absolute link | pdf | Save the link, open in popup window |
| html excerpt | <file.application> | object embed | swf | Save the html excerpt, serve (in popup window or as block element). Uses ScribdViewer.swf to render, so the browser must support Flash. Eg [Documentation, Files, Images^Mh0146-Letter-1877-Hopman.htm] |

We can cross-check these are all the tags and file extensions: both these (bash) commands return nothing:
{code}egrep -i "tif|pdf|swf" *.xml | egrep -v "<file.(image|application)>"
egrep -v "<file.(image|application)>" *.xml | egrep -i "tif|pdf|swf"
{code}

h2. Number of Files

We can count the different files with commands like these
{code}fgrep -c "<file.image>" *.xml
fgrep -c ".tif" *.xml{code}

We can cross-check with this command:
{code}fgrep -c "<file.image>" *.xml | fgrep -v ".tif"{code}
That's how I caught these problems (accounted for in [Data Migration Spec]):
- 07_NicolaesTulp.xml: <file.image>mh0146_front_nldetail_1997_038</file.image>
missing file ext
- 06_Flora.xml: <file.image>N-4930-00-000096-017-PYR.tif / N-4930-00-000096-018-PYR.tif</file.image>
Some bright mind put two images in one element

{table-plus:autoTotal=true|columnTypes=S,I,I,I}
|| painting || images || links || html ||
| 02_Aristoteles.xml | 57 | | 5 |
| 03_Batseba.xml | 12 | | 5 |
| 04_HermanDoomer.xml | 8 | | 6 |
| 05_BadendeSusana.xml | 42 | 1 | 2 |
| 06_Flora.xml | 35 | | |
| 07_man_met_baret.xml | 18 | | 2 |
| 08_NicolaesTulp.xml | 64 | | 4 |
| 09_man_in_orientaalse.xml | 10 | | 2 |
| 10_oude_vrouw.xml | 3 | | |
| 11_Andromeda.xml | 46 | | |
| 12_lachende_man.xml | 28 | | 2 |
{table-plus}

h2. Number of Images

- extract XML file names and images within each
{code}grep '<file.image>' *.xml > 0images-files.txt{code}
- extract image names only
{code}perl -ne 'm{<file.image>(.*)</file.image>} and print "$1\n"' *.xml |sort > images.txt{code}
- check for duplicates
{code}uniq -d images.txt{code}
Only N-4930-00-000052-PYR.tif: mentioned twice in Flora
- get all images at size Height=400 pixels (0images-get.bat):
{code}curl -s "http://rembrandtdatabase.adlibsoft.com/IIPImageServer/IIPImageServer.exe?FIF=D:/rembrandtdatabase.adlibsoft.com/Images/$1.tif&HEI=400&CVT=JPEG" -o $1.jpg{code}
- Out of 323 images, 147 images (45%) are missing. Eeach has 17 bytes and includes this text:
{code}Error/7:1 3 FIF{code}
- Save missing image names in 0images-missing.txt
- Make full table of files, images, and status (MISSING or not)
{code}join -a 1 0images-files.txt 0images-missing.txt{code}
- Convert to excel ([Documentation, Files, Images^0images-files.xlsx]), add pivot chart (hint: add Status to both Legend Fields and Values\!)

!0images-files.png!

- To check what the data migration has produced:
{code}perl -ne 'm{([Documentation, Files, Images^"]*?\.jpg)} and print qq{$1\n}' *.ttl | sort {code}

h2. Main Image

The main image (shown in search results) has these qualifiers
- file.spec.overall_detail/value\[@lang="neutral"\]="OVERALL"
- file.spec.front_back/value\[@lang="neutral"\]="FRONT"

Unfortunately there are many candidate images per painting, as shown in the table below.
So the Main image cannot be determined from a query.
The main image (shown in bold below) and has to be marked manually, using rdf:type rso:E38_Main_Image.
(In 2 cases we downloaded better images from the web)

|| .xml file || main .jpg images ||
| 02_Aristoteles | *DT219367* 177463 241722 223525 272638 DP147597 |
| 03_Batseba | *DT226609* 23941 115937 DT225592 263034 DP145915 |
| 04_HermanDoomer | *DP145921* 73943 206843 DT2102 DT230089 mh0147_front_uv_1982|
| 05_BadendeSusana | *mh0147_front_nl_2002* mh0147_front_uv_1985 mh0147_front_rl_1982|
| 06_Flora | *N-4930-00-000052-PYR* N-4930-00-000094-PYR N-4930-00-000096-027-PYR N-4930-00-000096-029-PYR N-4930-00-000096-032-PYR N-4930-00-000096-035-PYR N-4930-00-000052-PYR |
| 07_man_met_baret | *mh0149_front_nl_2010* mh0149_front_nl_1999_007|
| 08_NicolaesTulp | *mh0146_front_nl_1998* mh0146_front_nlsimulation_1998 mh0146_damages002_1877 (one with no <file.image>)|
| 09_man_in_orientaalse | *DT509* 47208 47208_1941 143570 263045 DP121368 207750|
| 10_oude_vrouw | *mh0610_front_nl_2008* (Downloaded), mh0610_front_uv_2008 |
| 11_Andromeda | *mh0707_irr_2001* (Downloaded), mh0707_back_nl_2001_001 |
| 12_lachende_man | *mh0598_front_nl_1998_001* mh0598_front_eer_1970_001 mh0598_irp_1970 mh0598_front_rl_1998_001 mh0598_front_nl_1998_007 mh0598_front_uv_1998_001 |