Skip to end of metadata
Go to start of metadata

AnnoCultor

AnnoCultor started out as "Just Code it in Java", Mitac's favored approach It is a Java application used by VU for the eCulture Pilot. Papers (local copies):

AnnoCultor has grown into a fully-fledged conversion framework that allows converting databases and XML files to RDF, and semantically tag them with links to vocabularies, to be published on Linked Data and the Semantic Web. Consists of:

  • AnnoCultor Converter: converts SQL databases, XML files, and SPARQL datasets to RDF. Converters are written in XML in a simple declarative way, and common XML editing skills are sufficient to write one.
  • AnnoCultor Tagger: allows assigning semantic tags (terms from existing vocabularies) to your data. Recently used to semantically tag nearly 7 million records from the Europeana collections with location data
  • AnnoCultor Time Ontology: a vocabulary of time periods: milleniums, centuries, half centuries, quarters, decades, and years. It also includes historical periods, like middle ages.
    (Last time I looked, there was no viable time periods ontology, maybe that's changed)

Source:

It implements the following conversion Rules:

  • Create constant
  • Rename resource property
  • Rename literal property
  • Replace value
  • Sequence
  • Lookup person
  • Lookup place
  • Lookup term
  • Facet rename property
  • Batch
  • Use value of other path
  • Use other subject

CRM Mapping

An approach by Doerr and company for mapping anything to CRM.
This is all about slicing, dicing and combining source paths to target paths.

Got from CRM site: crm_mappings, tools.

Papers:

Tool versions (please note that older releases may have stuff that is not in earlier releases!)

TalenD

Open source ETL framework.

  • Used extensively by Onto's LifeSci group. They also develop custom components for RDF output, semantic annotation, etc
  • Proposed for use by FP7 SME [Bids:Linked City]
  • Used by UC Berkeley for a CollectionSpace deployment

Vaso and Deyan Peychev swear by it. Includes:

  • GUI for creating and Framework for executing complex ETL flows, with exception catching, document routing, etc
  • GUI for creating mappings, eg from XML to RDF.
    I'll talk to Deyan whether the TalenD mapper can implement CRM path slicing & dicing

Resources:

  • [Talend:] space (now open to SSL and SirmaITT)
  • [LIFESKIM:Talend] intro, [Tutorial] (these may be already merged into the above space)

MINT

MINT is a data conversion toolkit used by numerous projects (Athena, Judaica etc) to contribute to Europeana.
Nice graphical mapper, nice demo movie etc

Delving

The platform includes:

  • Aggregator toolkit, including SIP Creator
    SIP means "Submission Information Package", which is a name for the metadata sets that Europeana ingests through OAI-PMH
    • Graphical Metadata Mapping and storage Tool (may be useful for Rembrandt & Cranach data conversion)
      Can source any XML. Generates the "obvious" mappings. Scriptable using Groovy and a simple domain-specific-language for splitting and joining fields, etc. Excellent movie
    • Metadata Repository accessed via OAI-PMH (Open Archives Initiative - Protocol for Metadata Harvesting)
    • Integrated Solr/Lucene Search Engine, Open-Search API
    • Persistent Identifier Management
    • Dynamic Thumbnail Caching
    • Source XML Data Analyser
    • Data Set Upload and Remote Management
  • Web Portal
    Make digital objects discoverable by the rest of the world through a web portal with powerful search features. The portal has an integrated multilingual Content Management System and interface support for 28 European languages and role-based User Management.
    • Web Portal
      • Simple & Advanced search
      • Summary Views as Grid or List
      • Facet-based result drill-down
      • Related Items
      • Detailed Metadata Result View
        Interface Support for 28 European Languages
    • User Management
      • Create and manage users
      • Custom views based on user-roles
    • Integrated CMS
      • Create pages and custom content
      • Upload images
      • Create custom menus
      • Simple versioning system
    • Annotation Component for Objects, Images, Movies and Maps (donated by Austrian Institute of Technology, full integration planned in 2011)

CultureCloud

CultureCloud is a sort of transnational Europeana aggregator that wants to do a lot of things that would be important for us.
E.g. terminology mapping, cross-linking of objects, crowd-sourcing (many people being able to edit), Europeana publishing...

Karma

Karma is a Data Integration Tool by USC. It enables users to quickly and easily integrate data from a variety of data sources including databases, spreadsheets, delimited text files, XML, JSON, KML and Web APIs to RDF.

  • http://www.isi.edu/integration/karma/: the Karma website is very informative, including papers and videos.
    It describes applications to biosciences, cultural heritage (Smithsonian), geo mashups, web APIs (eg Twitter).
  • It includes a nice graphical tool for creating the semantic mapping of data, and a nice informative way of presenting it, eg:

  • Builds the mapping semi-automatically: uses field pattern learning (based on Conditional Random Fields) and ontology graph traversal to help the user construct the mapping
  • Property domain and range definitions are very important for Karma's work. I think that CRM is a bit too abstract to be an appropriate target for Karma, but it would be interesting to try
  • Stronger semantic capabilties than Google Refine, but weaker (or no) data munging capabilities (see review below)
  • I wonder if it an be integrated with the RDB2RDF W3C standard, and maybe with Ultrawrap.
  • Karma is a data structure mapping tool, not an individual (term) matching tool. The latest application (see below) includes term matching, but no tool support

Review of Karma Application to the Museum Domain

Recently Karma has been applied to the Museum Domain (for the Smithsonian museum). A nice infographic:

Dominic got a preprint:
Connecting the Smithsonian American Art Museum to the Linked Data Cloud (ESWC 2013).pdf

Here's a brief review of that paper by Vlado:

  • Smithsonian's Gallery Systems TMS installation has 100 tables, but only 8 tables are mapped: those that drive the museum Web site. So it's NOT a complex mapping task
  • Not a large collection: 41k objects, 8k authors, 44k total terms
  • Map to own ontology based on EDM (not a very complex model).
    • Why did you need your own ontology? You can attach extra properties to EDM classes, without introducing subclasses.
    • Does not map to full EDM representation, eg Proxies are missing (see Fig.1)
    • "EDM and CIDOC CRM: both are large and complex ontologies, but neither fully covers the data that we need to publish": I see two inaccuracies here:
      • EDM is quite simpler than CRM (although EDM events are inspired by CRM)
      • CRM is certainly adequate to represent all of the info. (Note: "constitutent" means crm:Agent, so saam:constituentId would be mapped to an Agent_Appellation)
  • Also use these ontologies:
    • SKOS for classication of artworks, artist and place names
    • RDAGr2 for biographical (same as Josh)
    • schema.org for places (why not geonames?)
  • "in the complete SAAM ontology there are 407 classes, 105 data properties, 229 object properties": why so many? Fig.1 depicts only a few, you wouldn't need so many to map 8 tables, and that's a lot more than CRM
    • Ok, I think I can guess the reason. That's the sum of entities (classes and properties) in all used ontologies. But the particular mapping uses only a few. In fact it's typical in an ontology engineering task that you'd bring in a large number of entities, but use relatively few. So I think Karma needs a "subsetting" function so the user can let it know which entities are relevant (consider "shopping basket" in NIEM)
  • "the community can benefit from guidance on vocabularies to represent data": that's true
  • "Challenge 1: Data preparation... We addressed these data preparation tasks before modeling them in Karma by writing scripts in Java": yes, very often in the real world you need to split, pattern-match or concatenate. IMHO these are first-class tasks just like semantic modeling. Tool support for them is also needed, eg as Google Refine provides.
    • "Lesson 3: The data preparation/data mapping split is effective": in more complex situations the data munging depends on the meaning of other data that's already semantically mapped, therefore such split is not always easy. That's why GUI tools sometimes hit a limitation and you need to "escape" into a programming model/language
    • "RDF mapping tools (including Karma) lack the needed expressivity": languages like XQuery and XSPARQL have it
  • "Lesson 4: Property domain and range definitions are important": indeed! I think that CRM is a bit too abstract to be an appropriate target for Karma, but it would be interesting to try
  • "3 Linking to External Resources" describes a quite simple approach of matching people by name and life dates. It uses simple/standard comparison metrics and combination methods (I am pretty sure SILK supports these). It shows great F scores on a small set. There is no tool support.
    • maps to owl:sameAs triples. Would be interesting to hear your thoughts on the "skos:exactMatch or owl:sameAs" question and answers.semanticweb.com
  • "4 Curating the RDF and the Links": PROV info is recorded about the mapping links (eg mapping confidence/score, who/when verified) and displayed in a GUI tool. SILK has a similar tool, and positive/negative examples are used for machine learning.
  • "5 Related Work" needs to be elaborated, and made more objective
    • eg there's also the Polish National Digital Museum aggregation. Both LODAC and Polish use OWLIM, that's why we know about it. I'm sure there are more...
    • "we have performed significantly more linking of data than any of these previous efforts": that is not true IMHO. Check out Europeana Enrichment (part of the Europeana dataset) that maps entities from 20M CH objects to dbPedia, Geonames and a subject thesaurus. I'm not sure how comprehensive is that enrichment, but the volume is much bigger
    • "We tried to use Silk on this project, but we found it extremely difficult to write a set of matching rules that produced high quality matches": I find that hard to believe, but would be very interested to try it if it proves a weakness in SILK. If you publish the Smithsonian thesaurus data, I'll try it out
    • "[an approach that] deals well with missing values and takes into account the discriminability of the attribute values in making a determination of the likelihood of a match": SILK is open source and has an extensible Rules language, couldn't these needs/features be added to SILK?

TODO:

  • read Song, D., Heflin, J.: Domain-independent entity coreference for linking ontology
    instances. ACM Journal of Data and Information Quality (ACM JDIQ) (2012)
  • check out EverythingIsConnected.be
Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.