Skip to end of metadata
Go to start of metadata

Terminology (Vocabulary) Matching draft spec by Dominic


Requirement for Terminology Matching

Author: Dominic Oldman              

Date: 25 th February 2012

Version: 0.96 (Draft)

1                 Introduction


The document describes an application to be used for co-referencing terminologies, people and places. These terminologies will be provided by the projects that use ResearchSpace but may also be sourced from specialist terminology services (the exact mechanism to be decided). The projects that use ResearchSpace will be required to map their data to the CRM standard providing contextual information through their object collection. This contextual information, along with string matching techniques, will be used to provide a hybrid system of determining equivalency between two different datasets.


The overall objective of the co-referencing system is that when a researcher performs a search on using a particular term, place or name that this search operates across all the different CRM datasets, regardless of the local terminology and vocabulary systems, and locates similar objects with similar properties.


The approach needs to cater for situations where data coming into ResearchSpace does not necessarily use popular and well known public thesauri and authority controls (or any authority control at all) and even where this was the case many terms, people and places, are not currently documented within those public authorities.


For example, a dataset may refer to a person with very little biographical information about their birth, death, nationality, gender, profession etc. But will be linked with records that have object information which would tend to provide evidence regarding or related those attributes including the time-span they were active, where they worked, the types of objects they worked with and the subject areas they were interested in.


<Note: Systems like the Getty artist name authority could be easily converted to the same CRM format that we are using for BM bibliographic data. – question? Would this be sensible?>

2                 User Interface & Efficiency

The system needs to be able to present potential matches and allow the user to confirm matches as quickly as possible. The main aim of the system is to facilitate this matching as rapidly as possible. There are a number of ways by which the user can make a decision on whether two terms or things are equivalent. In most cases these methods will be combined. They are (in no particular order);

1.      The position and context of the term within the respective thesaurus hierarchies. In cases involving basic terminology the user can quickly see that the terms are being used in the same context by the surrounding terminology and the broader and narrower terms. Therefore when a user in making a decision whether to match two terms from different terminologies the position of the terms in the two thesauri should be displayed clearly.

There is clear evidence in these two hierarchies that the term ‘cotton’ is the same in both thesauri.

2.      Scope notes can also provide additional clarity. In the British Museum’s materials thesaurus the term, ‘cotton’, has the scope note, “This term covers raw cotton as well as the textile.” Therefore the UI should provide any scope notes for terms that are a potential match.


3.      Alternative (non-preferred terms) and related terms can also be used to qualify preferred terms.


4.      Contextual information from the records in which the terms are used can provide substantial evidence that terms are used in the same way and most thesauri management systems provide ‘usage’ facilities for exactly this purpose.  However, this particular manual method is the one that the system should seek to eliminate and the user should only be asked to browse object records as a last resort . The measure of the system is that the user never has to move away from the matching screen to investigate object records manually. The algorithms used to determine a good match should be used to provide the most relevant evidence on screen.




3                 Contextual Evidence

3.1           CRM relationships

CRM datasets are mapped to a consistent set of concepts and relationships. This means that objects can be easily compared to support terminology, name and place matching. The following CRM properties, for example, may be used to determine whether two terms describing a material, say cotton, are the same.  







They both use the word, ‘cotton’


Type of E57_Material



They both have other materials of manufacture that are the same.


Type of E57_Material




They are both used in the same object types


E55_type with BM scheme – object. E.g. a veil, a dress, a curtain etc.





They were produced/created by the same person





They were produced in the same location.


Type E12_Production

Low - Medium


They were both produced using similar processes


Production type (Type E12_Production)



They both have the same production technique


E55_Type - In scheme technique

Medium - High


They were created from the same school or have the same style


E74_Group – In scheme – school - (Type E12_Production)



They were produced by the same Ethnic group


E74_Group – In scheme – Ethnic - (Type E12_Production)



They share a production Association type


Production Association terms



They both originate from the same place


Type E12_Production



They come from the same period


Type E12_Production



They come from the same culture


Type E12_Production



They both have a depiction of the same thing



Low – Medium


They have the same bibliography attached to them



E31_Document, in scheme bibliography – isbn or title

Low - Medium



P 107i_is _current_or_former_member_of

Standard URI - id:thesauri/nationality/Roman



They appeared in the same exhibition




rdfs:label of type exhibition (in scheme)




They carry the same concept



e.g. P67_refers_to and event, ethnic group or place. PX_ commemorates and event. P129_Is_about a subject



The priority will be slightly different for different terminology schemes. In particular, specific relationships that relate to particular terminology would have a greater significance. For example, if we were matching an Ethnic group then number 9 would have more significance.


Clearly string matching is part of the process particularly where there is little other evidence. However, it should not be primary. Take the example of where different organisations use different names or titles to describe the same thing. Perhaps a different alias or spelling of the name is used. This should not stop the system from suggesting viable matches based on contextual information. A system that uses the maiden name of a person while another the married name, should not preclude a match.   There may be similar cases for concept terminology where the evidence shows clearly that the say, two objects are made of the same thing, but two different terms are used. It is possible that one system has used the wrong term!

3.2           In scheme

When comparing two datasets it will be the CRM properties and types that will provide additional evidence regarding comparison (and this may be backed up by the use of similar strings) . In other words, regardless of how different organisations may have named their terminologies (if authorised and managed separately) it is the CRM mapping that determines the appropriateness of how terms, places and names are compared. However, where organisations have used SKOS concept schemes the user should be able to optionally configure the behaviour further against this classification.

The system will provide manual mapping functionality. This may be useful for smaller terminologies that have been used for typing resources with institutional terms ( P2_has_type ). These may be relevant to the point above. For example, the British Museum has a small number of production types that narrow the result of a query and therefore the types of example that would be relevant. These types may be available using a different set of terms in other systems.

3.3           Configuration

The system should provide a configuration screen with a list of all the relationships that can be applied, and any other matching parameters, with default settings. As more relationships are added to the application and to the matching algorithms they can be added to this configuration. The user can turn on and off these relationships depending on those considered most appropriate for the terminology matching routine they are running. This means that a user could turn off all string matching and simply rely on contextual evidence, or vice versa and rely on string equivalency, scope notes and alternative terms. CRM configuration may require another level of configuration for datasets that have slightly different CRM mappings. It would be useful to allow a profile for different situations that can be saved and reused without each user having to configure each setting individually each time.


4                 Matching Levels

4.1           Level 1

Where string matching is used alone the only verification to determine whether in fact the terms are the same is to look at example object data that is associated with the use of the terms. In this event the number of objects that could be using the terms may be large. At a very basic level a matching application could present object information to the user until the user was able to determine whether the match was correct or incorrect.

Note: In this scenario and additional level of functionality in addition to different string matching algorithms, would be synonyms (see wordnet).

4.2           Level 2

At the next level the system could be preconfigured to provide particular information from associated records that would most likely be relevant to determining the match rather than general object information. The presentation of this data would still be haphazard but the UI may be able to provide more focussed information that the user can page through, rather than the emphasis being on the user to find the relevant information within example object records.

4.3           Level 3

The next level is that the type of information that would be presented in 2 above could be used to determine the match in the first instance and therefore rather than present potentially useful information for a user to make a decision, the evidence provided from the records would be directly relevant because it was used in the algorithm to determine the match in the first place.

The system should be designed for the ultimate implementation of level 3.

5                 Hierarchical matching

This section assumes that the data is provided with an authorised thesaurus. This may not be the case and data may not come with any associated thesaurus or organised authority.

5.1           Exact match

The following hierarchy is from the British Museum materials thesaurus and the Library of Congress subject headings.

Here there is a match on cotton where both are spelt the same and both are used with similar object types and are used in similar production processes, etc. The result of the match would mean that where cotton is specified all the narrower terms for cotton in both thesauri would be used.

This would also be true in this example where there is an exact match but the terms are at a different level in the hierarchy. A search for linen would include narrower terms from both vocabularies and a search for Flax would also do the same.

5.2           Broader match

In the example above, there is no match for ‘cotton’. However, other terms in the same leaf of the hierarchy are matched. The user should therefore have the opportunity to match cotton to the broader term, ‘Fibre’. In this case, a search for Fibre would include, ‘cotton’, but the user could only search for the word, ‘cotton’ in the ResearchSpace system if it were added to the master thesaurus or the user can search using the local terminology. 

It should be possible for ResearchSpace administrators to identify these terms and add them to the ResearchSpace master thesauri. This will ensure that the ResearchSpace master terminologies are developed with new datasets.

5.3           Use Case

The system is trying to match the materials of objects being brought into the ResearchSpace environment.  It finds objects with the term cotton in the target dataset but is finds no similar typographical term in the master dataset. However, it does find similar objects that have similar attributes to the target and these have the material term, ‘Fibre’.  The algorithm would look at;

1.      Other terms at the same level as cotton in the target thesauri and see if there are other correlations in the source thesaurus, and;

2.      See if other objects with similar attributes have a different term in the same place in the hierarchy and potential suggest an exact match,

3.      Offer a broader match on the term above the level where some correlation has been established.  

6                 Terminology and things without Identifiers

The target dataset may have terminology, where no identifiers are provided and the term is simply part of the URI. For example;<term string>

This lack of authority may result in different spellings of the same term or the use of different terms to describe the same thing within the same dataset.

It should be possible for the same dataset to be checked for these inconsistencies and errors using the same algorithms used when comparing two different datasets. In other words it should be possible to set the source and target as the same dataset.

Where this is done the user should have the option to select the term that should be used in all circumstances. For example, where the word ‘cotton’ is sometimes spelt  ‘coton’ then the user can match ‘cotton’ with ‘coton’ so that all instances of coton used in the same context, (P45_consists_of)’ are changed to the same URI.

The system should provide the ability for unauthorised datasets to swap out their terminology with that of the master terminology set so that the same identifiers are used within the context of the target dataset.





7                 People

The same spelling of a name will be more significant and in other cases it will have less significance. For example, the name, John Smith may match will match with many John Smith’s and therefore the user would have to examine the evidence for all the John Smiths to determine which one, if any, were a match.

The number of matches and therefore the amount of evidence provided to the user may be reduced by examining the contextual evidence provided by the object. For names this may include;

  1. The date of birth
  2. The date of death
  3. Period when active
  4. Nationality
  5. Profession
  6. Things that they wrote (bibliographic evidence
  7. Other names they had

In some cases this information will come from within a biographical authority maintained by the dataset owner or may be linked to an external ID. For example, VIAF, Getty ULAN and ISNI. In these circumstances the system should be able to pull information from these sources to support the matching process. The UI should be able to show external evidence which is relevant to the decision on equivalency.

In other cases, the person who is being matched will not be part of these external resources and the information available within a local biographical authority will be sparse or even non-existent. In these cases the ability to use contextual information becomes paramount. 

8                 Types & In scheme

When comparing two datasets it will be the CRM properties and types that will provide additional evidence regarding comparison. In other words, regardless of how different organisations may have named their terminologies (if authorised and managed separately) it is the CRM mapping that determines the appropriateness of how terms, places and names are compared. However, where organisations have used SKOS concept scheme the user should be able to optionally configure the behaviour further against this classification. i.e only map in situations where the user has determined the schemes for the source and the target that they want to concentrate on.

The system should provide manual mapping functionality. This may be useful for smaller terminologies that have been used for typing resources with institutional terms ( P2_has_type ). These may be relevant to the point above. For example, the British Museum has a small number of production types that narrow the result of a query and therefore the types of example that would be relevant. These types may be available using a different set of terms in other systems.

This is important for activities like production where P14_carried_out_by could be either a person or a group or school or an Ethnic Group.


9                 Precision & Recall - Scope

The first iteration of the system must follow a design which makes it amenable to further criteria and expansion of the algorithms used. The aim is to improve matching and reduce the amount of material that the user needs to read in the UI to determine the match. Therefore the UI should consider carefully the mechanism for showing evidence which may require more examples in the earlier iterations of the system compared to a fully equipped system.

10           Weighting


There should be a weighting system for the string matching and object evidence relationships used which can be configured by the user. For general terminology matching, for example, string matching may be configured higher than for people and place matching. A default should be defined by the system as appropriate. The system will place the evidence for the highest rated match to the user. If the user is unconvinced they should be able to call up the evidence from the next example.


11           Summary Key elements


1.      The system will make use of the relationships provided by the CRM


2.      The system requires that any suggested matches by the system are confirmed by a user of the system. The system therefore must provide the user with the evidence that was collected to determine the match in the first instance. It should be a primary concern of the application to reduce to a minimum and even eliminate the need for the user to consult object record information outside the immediately matching screen.


3.      The algorithms should be constructed such that additional logic can be added as more is learnt about the best evidence to use for.  


4.      The system will always need a user of the system to confirm matches made by the system and indeed, will record the user (and the date-time) when the matched term was confirmed.


5.      The desired outcome of co-referencing should be that an identifier for a term used in one set of records should be matched with identifiers for terms from another where they are used to mean the same thing. In some cases an exact match may not be possible and only a broader match achieved.


6.      In the case of terminology that is uncontrolled and without identifiers, the system should create those identifiers.


12           Recursion


A recursive system is required to make use of resolved matches to use as evidence for further matches. When the system has enough evidence to suggest a match between two terms then these terms will be identified but their equivalency can only be asserted by a manual process which is recorded, dated, attributed with provenance information. Once asserted the match can be used as further evidence for other term, place and name matches. For example, once a match for the material, ‘cotton’ has been made then this provides additional information when selecting objects for matching other terms, i.e. the person only worked with the material, ‘cotton’. 


Note and Question:


Let’s say that we choose a master terminology not used by any of the existing datasets in ResearchSpace. How do we match against it (and should be only choose a master set at a later stage when we have a better mass of data).


13           Different Vocabularies – Concept terms, People and Places

13.1      Concept terms

Concept terms will differ in the extent to which they are used in the same ways within different information systems. In many cases they are described with SKOS which provides preferred, alternative and related labels for any given concept.  Semantic mapping will still provide the best mechanism for matching, for example, in trying to match the material clay with clay the following evidence may be available.


·       The same object  types (e.g. two tablets)

·       The same production techniques

·       The same production period.

·       The same ethnic groups involved

·       The same production techniques.

·       The same people involved in the production.

·       The same production locations. (same town, region, country, etc)

·       The same language used for inscription

·       Etc….


Building up evidence this type of evidence may suggest that the two objects are made of clay and therefore that the term ‘clay’ from one data source means and is used in the same way as ‘clay’ in the other data source and that these terms can be co-referenced.  


Concepts terms can also lend themselves to typological matching as a secondary form of evidence. E.g. clay is spelt the same as clay. 


13.2      Place names

The use of string matching for place names is more dangerous since place names are replicated across the Earth and disambiguation is certainly required. The same type of semantic evidence can be used to deduce place names (an object of a particular type and material may have only been made in a particular location).  


[Note – if a place name thesaurus has different levels which point to disambiguation then how are these used – If object evidence points to a particular location then would comparing the surrounding context be useful – e.g. for Hockley > Essex > England.  If evidence showed that an object is likely to have originated from Hockley then if the other data source records Hockley > Essex > England then wouldn't this be useful evidence]


[Note that Martin recommends the use of TGN and the Alexandra Gazetteer schemas as a way of describing name vocabularies and that a derived schema could be formulated into a CRM extension.]




Object 1 consists of ‘cotton’                             Object 2  consists of ‘natural fibre’


Object 1


The evidence shows that both objects are of the same type:                 


They were produced in the same country (one in Yorkshire and one in Lancashire). The system may consult a geo web service to verify this.


The objects were used in the same process.


13.3      String Matching Techniques

String matching is secondary but nevertheless should be used to give different ratings in different circumstances. There should be a configuration for option for turning off and on both string and semantic matching algorithms. String matching may be used as partial evidence when matching terms including terms used for evidence. For example, appellations may be used to evidence a person match (Emperor Augustus, has appellations Gaiuis Octavius, Augustus Caesar, Octavian, Octavianus, Caesar, C).


String matching may be required when little or no semantic contextual information is available.

14           Approach

14.1      Internal

1.      The system should deal with both flat and hierarchical vocabularies.

2.      The system will use a hybrid of matching techniques but primarily using semantic based evidence. 

3.      The ResearchSpace system should have a set of ‘master’ vocabularies against which all imported vocabularies are co-referenced.

4.      The default vocabularies for tools within ResearchSpace are the master vocabularies.  

5.      A user can choose another set of vocabularies of choose other individual vocabularies

6.      [this would make sense in many cases where projects are searching across other ResearchSpace repositories and the imported data only has a limited amount of terminology to use ]

7.      The vocabulary set is chosen by the searcher from a configuration menu. The default authorities are those designated as the ResearchSpace master set. 

8.      When a user completes a page of matching, the non-matched terms can be used in future as evidence of non-equivalency.

9.      That the majority of data submitted for co-referencing will be mapped to the CRM.

10. Some terminologies will be standalone and not submitted with any or little contextual or semantic information except the scope of the authority file that they belong to.

11. That some data will contain terminology that has no controlling authority and will be inconsistent and, in some cases, erroneous (typos). The system should identify erroneous terms so that suggestions can be provided.

12. Provide a UI that supports matching decisions and commitment of those decisions quickly.

14.2      External

The use of external services can be used as additional semantic evidence. (e.g. Dbpedia).  In some cases terminology will be delivered with identifiers that are known and can be incorporated without further matching techniques.

15           Level of Matching

The aim of the system is to match equivalent terminology and/or to match terms to broader terminology so that, as a minimum, researchers can start their work a broad starting point. For example, the object type, ‘wine glass’ should attempt to match with an equivalent term but must establish a match with the broader term ‘glass’ in a hierarchical object name thesaurus.


16           Wireframes

16.1      Single term Matching View

16.2      Matching Entire Vocabularies – Multiple term View

The diagram should be changed to reflect the fact that no specific vocabulary is selected and that the system looks at the datasets as a whole. However, as suggested above the system could be configured to display only matches for specific schemes from a UI perspective (other matches would still be recorded for verification at a later date).

17           Functionality

17.1      Individual Linking View

This view allows a user to view and match terms individually either through manual means or with the help of a find match button. This provides the same functionality as the finding algorithms in the multiple match view but is applied term by term.

·       The user determines the source and targets.

·       The user determines the concepts that’s/he wants to work with.

·       The terms / taxonomies can be arranged in alphabetical order.

·       The terms / taxonomies can be arranged additionally with matched and unmatched term  order (the user can review already matched terms and change them is there is an error.).

·       Clicking on any term shows previous matches in the table below including type of match and relevance.

·       Terms are highlighted their respective structures (flat or hierarchical – if available).

·       The information that established a match is displayed.

·       This can be semantic (primary) and typographic (secondary) evidence as well as the author who confirmed the match.

·       Only an administrator can override a match that has been confirmed by a user ot the same user themselves.

·       The equivalency suggestions are displayed in the table below the vocabularies and can be adjusted, selected and saved by the user.

·       A search facility with wildcard functions is available for both source and target vocabularies.

·       The user can change the equivalency type suggested by the system or leave it.

·       The user confirms the matches individually but can block or select a number and confirm all within that selection or confirm a whole page or match results.

·       A source term may have a number of matches in the target vocabulary.

·       The match can be exact, broader or ‘not the same’.

·       Scope note displays provide information to the user for the selected terms in each vocabulary.

·       Selecting a term in the master vocabulary then allows the user to hunt for a term to match in the target vocabulary. (A find button is provided that performs the same function as the multiple linking view but for one term).

·       The default is that there is no relationship.

·       Saved equivalencies are stored with provenance and user information (including date of assertion).


17.2      Multiple Linking View

·       This view attempts to show the status of all the terms in relation to the target vocabulary after running a matching algorithm. The user can block select terms and save the equivalency status of the terms.

·       The matches are displayed in a long list, however the user can check their status in the hierarchy by selecting an individual term. If there is a suggested equivalency this term is displayed in the target hierarchy (or flat authority).

·       The user can sort on any of the headers (target term, equivalency status, relevance, target term)

·       A paging system exists so that the user can go through a block of suggested matches at a time.

·       A filter allows the user to show a particular status.

·       The user can see the evidence that created the suggested equivilency

·       Scope note information should be available for a selected term and its suggested target.

18           Broader Terms

The system will group together multiple matches for a term and help determine whether a broader match is more appropriate for a particular target term.

19           Configuration

·       The system has both semantic algorithms and typographical algorithms. This can be turned on and off independently.

·       The algorithms themselves have parameters that can be turned on and off.

·       The user can turn on and off aspects of the semantic algorithm which relate to a particular type of CRM evidence.

·       The user can adjust the typographic parameters.

·       A user can refresh the matching after changing the configuration this will only affect unconfirmed matches.

·       The user can reset the co-referencing and start from scratch. 

20           Technical requirements

To be agreed and documented in a simple technical specification

Platform should be web based.

21           Additional Work

Place names support

Extension of CRM to support place names              

People Name Support

22           Equivalency encoding 

To be explained



1.      How are semantic algorithms combines with syntactical ones?

2.      How are the equivalency matches actually encoded in the RDF?

3.      How do we implement VIAF identifiers?

4.      How is ResearchSpace affected if co-referencing for a particular dataset is low?

Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.
  1. Mar 24, 2013

    1. As properly mentioned, string vs contextual info approaches' success would vary depending on specifics of the data being matched. Where well established names are present, string comparison approach would be quite successful. For abstract terms, context would be more useful.

    2. We need to account for large data volumes. Lets say we compare persons and we intend to use string comparison primarily. Usually vocabularies would contain tens of thousands entries with multiple labels each. Comparing each label to each is not an option. We need to narrow down some candidates for comparison and apply the task only on them. And even so, it won't be wise to expect that this can be done in seconds. So GUI should be designed for long tasks executed in the background.

    3. Lets say we have already a couple of different data sets in RS and we need another one mapped. Do we match against a single vocabulary or all of the existing? Where do we search for usages/context - in all data sets or just in one specific among the existing.

    4. A major problem is that we currently don't have another data set to test our tool with or to get ideas from. We can proceed with string matching as we can use VIAF for example. But we don't have any data set with deep info (related places, techniques, schools, exhibitions etc.)

    5. Contextual-based matching should be done in iterations - each one using mappings from previous iterations. (described above in section Recursion) For example, if we have a tool related to "South America", "Molasses" and "1920s" in one data set, in first iteration we need to match "South America" to some geo term, "molasses" to some material and "1920s" to a time period in the second data set. It is after that that we can have enough common context to match the very tool. An implication of this is that if user unmatches some of the terms (like "molasses") we may need to reconsider whether the two tools - A and B have really a common context. If they had been matched automatically, that is. Matching manually 80000 entries is not really an option, is it?

    6. If the matching tool really has be implemented in the whole described scope, I believe 2 iterations should be planned for that.

  2. Aug 23, 2013


    Doerr, M. (1998). Effective Terminology Support for Distributed Digital

    Collections. in: SIXTH DELOS WORKSHOP, Preservation of Digital

    Information. 17-19 June, Tomar, Portugal

    Doerr, M. (1997). Reference Information Acquisition and Coordination. In

    ASIS '97-Digital Collections: Implications for Users, Funders,

    Developers and Maintainers, Proc. of the 60th Annual Meeting of the

    American Society for Information Sciences. 1-6 November, (pp. 295-312).

    New Jersey, USA: Information Today Inc.:Medford (1-57387-048-X)--

    This was a Masters Thesis I supervised:

    Doerr, M., Kritsotaki, A., & Stead, S. (2004). Which Period is it? A

    Methodology to Create Thesauri of Historical Periods. Computer

    Applications and Quantitative Methods in Archaeology Conference,

    CAA2004. 13-17 April, Prato, Italy