Ontotext Experience
CTO Research
Our CTO group have researched such tools in May-Aug 2012, investing considerable time (25p/d)
- conf page: [Ontotext:CTO Link Discovery] (you may have no access), includes presentation, relevant parts copied below
- jira task: ONTOTECH-207 (you may have no access)
- overview of SILK, LIMES, FEBRL.
- In their opinion SILK is clearly the best
- practical experiments with SILK, including:
- DBpedia/Geonames: cities, museums
- DBpedia/MuzicBrainz: music bands
- DrugBank/LinkedCT: drugs vs interventions
- SILK test installation is still available
Ontotext IdRF
[IdRF:]: implemented by Milena Yankova as part of her thesis.
Used for an European Skills, Competences and Occupations classification (ESCO) mapping pilot.
Notable Tools
Research this list
- Also see [DOM:CLAS#CLAS and ontology matching]
Note: Karma is not a terminology matching tool. It's a data mapping tool, see review: Data Migration and Ingestion Tools#Karma
SILK
Silk - A Link Discovery Framework for the Web of Data
"A tool for discovering relationships between data items within different Linked Data sources. Data publishers can use Silk to set RDF links from their data sources to other data sources on the Web."
Mapping framework for mapping LOD entities, terminologies etc. Includes a workbench, matching engine, server, MapReduce implementation. Excellent scalability, includes distributed processing option.
- Output - alignment of entities, connected by owl:sameAs or any other defined predicate
Silk Links
- homepage: http://www4.wiwiss.fu-berlin.de/bizer/silk
- source: http://www.assembla.com/code/silk/git/nodes
- wiki: http://www.assembla.com/spaces/silk/wiki
- user manuals: http://wifo5-03.informatik.uni-mannheim.de/bizer/silk/#manual
- discussion: http://groups.google.com/group/silk-discussion
Papers:
- A Comparison and Generalization of Blocking and Windowing Algorithms for Duplicate Detection (QDB 2009)
- Efficient Multidimensional Blocking for Link Discovery without losing Recall (WebDB 2011)
- Learning Linkage Rules using Genetic Programming
Silk Implementation
- Developed by Chris Bizer with funding by Vulcan Inc and FP7 LOD2
- ASL license (version 2)
- Implemented in SCALA (see Yammer discussions), a scalable JVM language
- Multi-threading support
- Source code available and recently updated
- Integration: Java API for in-process integration
- Server edition allows for reconciliation service implementation - maintains an internal DB of known entities and is capable to process input stream of new data to be reconciled
Silk Rules
Data sets are interlinked via mapping rules. Powerful rules language
- mappings based on similarities of property values (labels, names, dates, etc.)
- fuzzy match strategies, similarity metrics, data normalisation, aggregation on similarity metrics (min, max, average)
- restrictions on (ontology) classes/types
Eg a rule can be: Match cities based on
- name (string comparison after lowercase transformation and with certain Levenstein distance tolerance),
- population (numeric with certain tolerance) and geographic location;
- combined in a certain way
Example linkage (dbpedia:City vs. geonames:P)
Unable to render embedded object: File (silkRule.png) not found.
Silk Rule Development
- Manual mapping (via Linkage Rule Editor) - visual interface supports rule development (drag & drop, wizards, templates)
- Linkage rule generation
- Suggests pairing candidates and the user approves/rejects suggestions
- Based on the user response the system generates linkage rules using genetic algorithms (interactive rule learning)
STITCH@CATCH
http://stitch.cs.vu.nl/: Vocabulary&Alignment Repository ("Semantic Interoperability To access Cultural Heritage")
"Web-based Repository Service for Vocabularies and Alignments in Cultural Heritage: ESWC10 paper and slides by the current Europeana scientific advisor
A vocabulary server including:
- conversion to SKOS
- alignment (mapping)
- access by Web services, SOAP, JSON...
- annotation suggestions, especially for multi-lingual mapping. "subjected a corpus of 250.000 dually indexed books of the National Library of the Netherlands (KB) to an instance-based method to derive an alignment between two KOSs, the Brinkman thesaurus and the Biblion one"
(KB is the host organization for Europeana) - numerous demos
Also see CATCH, STITCH for tools from related people
Amalgame
Amalgame is a tool for semi-automatic vocabulary mapping by VU Amsterdam
Loaded with a number of vocabularies.
- http://semanticweb.cs.vu.nl/europeana/datacloud: Auto-generated, but doesn't seem to include the complete list
- http://semanticweb.cs.vu.nl/europeana/browse/list_graphs
- You can see the number of triples.
E.g. http://viaf.org/viaf-persons.ttl.gz has 50M (but the file is not available at that URL) - Many vocabularies are split between several files.
E.g. search for "iconclass", Then you can learn about it from the various URLs
- You can see the number of triples.
- http://semanticweb.cs.vu.nl/europeana/amalgame/conceptfinder: Browse the different vocabularies