analysis of matches, precision and recall
Intro
- AGROVOC: http://aims.fao.org/standards/agrovoc/functionalities/search
32.2k terms in up to 22 languages - NALT: http://agclass.nal.usda.gov/
Stats for the EN version: Preferred Terms 51k; Lead-in Terms (cross-references) 41k, total 90k terms. 41k
The paper "Thesaurus Alignment for Linked Data Publishing" describes the matching
- Matching is done using only the EN prefLabel (FR in case of RAMEAU), i.e. simple string matching
- Uses these string measures: Hamming distance, N-Grams, Levenshtein, Jaro, Jaro-Winkler, SMOA
- Compute closeness using all string measures, then average the result. What's the purpose of this? They don't make any judgement which measure(s) are better, so in effect just water down better measures with worse measures
- Reports 98% precision.
It's surprising good news that so much can be accomplished only with label matching. - About 42% of all terms are mapped to NALT (13.4k/31.9k = 42%)
- Homonymy example showing the need for disambiguation (taking the context in account):
Calice (Lat, Perianthe plant) Calices (FR, objects liturgiques)
RDF Files
- obtained NALT_2013_SKOS.rar (NALT_2013_SKOS.rdf) and fixed the skos namespace
- obtained agrovoc.skos.xml.zip (agrovoc_20120709.rdf)
- received AGROVOC_NALT.nt from FAO.
- The SPARQL endpoint (http://202.45.142.113:10035/repositories/agrovoc) returns only 100 rows at a time
- "wc -l AGROVOC_NALT.nt" returns 13390 total matches
- loaded these files to OWLIM
Check basic matching:
PREFIX agrovoc: <http://aims.fao.org/aos/agrovoc/> PREDIX nalt: <http://lod.nal.usda.gov/nalt/> # FILTER(STRSTARTS(STR(?nalt),"http://lod.nal.usda.gov/nalt/")) doesn't work, not sure why select * { ?a skos:exactMatch ?n. OPTIONAL {?a skos:prefLabel ?agro. FILTER(lang(?agro)="en")} OPTIONAL {?n skos:prefLabel ?nalt. FILTER(lang(?nalt)="en")} } limit 10
Correspondence
I have a couple of questions, since I'm trying to assess the Recall of the method described in the paper.
- Is this made using the approach described in this paper: Thesaurus Alignment for Linked Data Publishing (DC 2011)
- Yes
- It seems this is the same mapping from 2011, since the number of exactMatches (13391) is very close to the one quoted in the paper.
- Yes
- Do you have any info about the growth of NALT?
The paper says "30.3k concepts" while the NALT site says "51k Preferred Terms"
If you rerun the matcher now, do you think it'd find significantly more matches?
If you rerun the matcher now, do you think it'd find significantly more matches?- We haven’t done a lot of matching in the past 10 months or so. We have been discussing this problem. As we don’t have an alignment management tool (we hope to get one into VocBench this year) it is not so easy to maintain the matches across evolving thesauri.
Manual Recall Estimation
The paper describes that about 42% of AGROVOC terms are mapped to NALT (13.4k/31.9k = 42%). This seemed a bit low for two thesauri that are dedicated to agriculture.
- I've tried manual matching of 25 concepts from each thesaurus (the first descriptor term from each letter of the alphabet),
and I get 12% matches - The matching ratio of 42% is significantly better (3.5x), which is quite interesting (i.e. I can't explain it
From AGROVOC to NALT
For each letter, take the first term with status=Descriptor
N | AGROVOC | NALT | Note |
1 | Aaptosyax grypus | – | |
2 | B-lymphocytes | – | |
3 | C3 plants | C3 plants | |
4 | Dabs (UF Yellowtail flounder, a fish) | – | |
5 | eagles | – | |
6 | F1 hybrids | – | |
7 | GABA | – | |
8 | habitat improvement | – | there is habitat conservation and alts: preservation, protection, restoration) |
9 | IAA | – | |
10 | Jacaranda | – | |
11 | Kabatiella | – | |
12 | La Pampa | – | |
13 | Macaca | – | |
14 | NAA (Naphthylacetic acid) | – | even alt doesn't match |
15 | Oak (tree) | – | there is oak logs |
16 | Padus (Pachnaeus is before it in alphabetical order: no match) | – | |
17 | Q fever | Q fever | |
18 | Rabaulichthys altipinnis | – | |
19 | Saarland (first non-place is Sabal: no match) | – | |
20 | T-2 toxin (also Tabanidae: no match) | – | |
21 | Uaru amphiacanthoides | – | |
22 | Vaccination | vaccination | |
23 | Wadi | – | |
24 | X ray irradiation | – | |
25 | Yaks | – |
3/25=12%
From NALT to AGROVOC
For each letter, take the first preferred term (not in italic).
I skip US-specific organizations etal, eg U.S. Cooperative Extension Service
N | NALT | AGROVOC |
1 | A-DNA | – |
2 | babassu oil | – |
3 | C3 plants | C3 plants |
4 | Daily Reference Values | – |
5 | early childhood education | – |
6 | factor VIII | – |
7 | galactosides | – |
8 | H-Y antigen | – |
9 | ice milk | – |
10 | jackfruits | Jack fruit; jackfruit (tree) |
11 | kallikreins | – |
12 | La Nina | – |
13 | macroalgae | – |
14 | nafcillin | – |
15 | oases | – |
16 | p-anisidine value | – |
17 | Q fever | – |
18 | radiation resistance | – |
19 | sacral spine | – |
20 | table wines | – |
21 | udic regimes | – |
22 | vaccination | – |
23 | waferboards | – |
24 | X-ray diffraction | – |
25 | yams | yams |
3/25=12%
Analysis and Precision
Most of the matches (95%) are trivial, meaning the two matched labels are the same (case-insensitive comparison).
Analysis details:
AGROVOC-NALT-analysis.xls
Below we analyze the non-trivial matches.
Deleted Terms
Some matches (231 = 1.7%) are about old removed AGROVOC terms.
# select (count(*) as ?c) { select * { ?a skos:exactMatch ?n. OPTIONAL {?a skos:prefLabel ?agro. FILTER(lang(?agro)="en")} OPTIONAL {?n skos:prefLabel ?nalt. FILTER(lang(?nalt)="en")} FILTER (!BOUND(?agro) || !BOUND(?nalt)) }
Eg here is one such error:
<http://aims.fao.org/aos/agrovoc/c_9655> <http://www.w3.org/2004/02/skos/core#exactMatch> <http://lod.nal.usda.gov/nalt/1890> .
- http://aims.fao.org/aos/agrovoc/c_9655 returns nothing but the exactMatch statement
- http://lod.nal.usda.gov/nalt/1890 is "Acacia farnesiana"
- this corresponds to http://aims.fao.org/aos/agrovoc/c_39 (also this)
- c_39 has one exactMatch statement, but it's to GND not to NALT (and is a correct match):
http://d-nb.info/gnd/4665983-3 - I guess c_9655 was deleted, but the mapping does not reflect that.
Non-Trivial Matches
Find nontrivial matches (the labels are different): 375 (2.8%)
select * { ?a skos:exactMatch ?n. {?a skos:prefLabel ?agro. FILTER(lang(?agro)="en")} {?n skos:prefLabel ?nalt. FILTER(lang(?nalt)="en")} FILTER (LCASE(?agro) != LCASE(?nalt)) } ORDER BY LCASE(?agro) LIMIT 100 OFFSET 0 # then OFFSET 100, OFFSET 200, OFFSET 300
Precision
We find 50 wrong matches (see next section)
- 15 are systemic error
- 35 are due to misspelling-tollerant string metrics (Levenshtain and Jaro-Winkler). These introduce false positives, eg
aviculture apiculture health wealth forest range forest ranger health care health card Qualite de la viande Qualite de la vie - The original paper reports 98% precision, i.e. 2% false positives (about 270), which wre cleaned up using maybe 20p/d of manual cleaning
- We still find that 11% of these were missed by the manual cleaning.
- Especially in biology, there are many Latin terms that are similar, but mean different things (eg genus vs species, or unrelated species)
So what is better, to allow misspelling-tollerant metrics or not?
- The Variant Spelling excel section shows 137 good matches due to such metrics
- The original paper implies 270 wrong matches
- My conclusion is that since Thesauri terms are not very likely to include misspellings, such metrics do more harm than good.
- It's better to include explicitly legitimate spelling variants (eg behaviour-behavior, programme-program) than to allow random misspellings
Wrong Matches
AGROVOC term | NALT term | comments |
agricultural economics | agricultural economist | unrelated |
balanitis | balanites | Male genital disease vs tree species |
baphia | raphia | legume vs palm |
bidens pilosa | bidens | species vs genus. Appropriate is broaderMatch not exactMatch |
birnaviridae | barnaviridae | different viruses, see
http://viralzone.expasy.org
|
capillaria hepatica | capillaria | species vs genus. Appropriate is broaderMatch not exactMatch |
chitosan | chitin | Chitosan is produced by deacetylation of chitin |
chlamydomonadales | chlamydomonadaceae | order vs family. Appropriate is narrowMatch not exactMatch |
clostridium butyricum | clostridium acetobutylicum | some systemic error (misaligned matches?) |
clostridium pasteurianum | clostridium butyricum | " |
clostridium thermocellum | clostridium pasteurianum | " |
cofactors | clostridium thermocellum | chemical compound bound to a protein vs bacterium |
dicentrarchus | decapterus | different family: Moronidae vs Carangidae (jack) |
endrin | endria | organochloride insecticide vs (can't find in NALT?) |
fumariaceae | funariaceae | herbaceous plants vs mosses |
integrated land management | integrated weed management | unrelated |
intracellular fluid | extracellular fluid | opposite |
intraspecific hybridization | interspecific hybridization | opposite |
irrigation equipment | fumigation equipment | unrelated |
jordan river | jordan | river vs country |
larix occidentalis | strix occidentalis | larch tree vs spotted owl |
macroclimate | microclimate | opposite |
percophidae | percopsidae | different order: Perciformes vs Percopsiformes |
petrology | metrology | unrelated |
portuguesa | portugal | Venezuelan region vs European country |
puccinia graminis | puccinia coronata | some systemic error (misaligned matches?) |
puccinia helianthi | puccinia graminis | " |
puccinia hordei | puccinia helianthi | " |
puccinia horiana | puccinia hordei | " |
puccinia melanocephala | puccinia horiana | " |
puccinia pelargonii zonalis | puccinia melanocephala | " |
puccinia recondita | puccinia polysora | " |
puccinia striiformis | puccinia sorghi | " |
pumping | jumping | unrelated |
pyrrhocoris | puccinia striiformis | " |
pythium aphanidermatum | pyrrhocori | " |
pythium aphanidermatum | pyrrhocoris | " |
pythium butleri | pythium aphanidermatum | " |
radium | radio | unrelated |
raillietia | raillietina | Different phylum: Arthropoda vs Platyhelminthes |
retinoid | retina | chemical vs part of the eye |
ribosomal rna | ribosomal dna | Ribonucleic acid vs Deoxyribonucleic acid |
salts | sales | unrelated |
selenium | helenium | chemical element vs herbaceous plant |
sesamum angustifolium | solanum angustifolium | Unrelated: family Pedaliaceae vs Solanaceae |
swine vesicular disease virus | human enterovirus b | Subtype vs Species. Appropriate is broaderMatch not exactMatch |
syrinx | larynx | syrinx is avian equivalent to the mammalian larynx. Appropriate is broaderMatch not exactMatch |
toxocara canis | toxocara cati | dog roundworm vs feline roundworm. closeMatch |
trichothecium | trichothelium | Unrelated: class Ascomycetes vs Lecanoromycetes |
urban development | human development | vaguely related |
wast | past | unrelated |