A high-level description of the phases (sub-pipelines) involved in an example concept extraction pipeline
4. Named entity recognition and disambiguation phase
4.1. Named entity disambiguation
Using the named entity candidates discovered by the LD Gazetteer, and given the specific article context, this phase makes use of a specialized classifier to assign a "positive" or "negative" label to each candidate. As a result, the ambiguity associated with the complete set of gazetteer lookups is eliminated – at most one named entity remains per document offset, and the redundant ("negative") named entity candidates are removed.
The disambiguation mechanism relies on a set of Lucene indexes that store:
a short textual description of each candidate (based on DBPedia and Freebase abstracts);
a set of URIs representing the entities that appear in the full DBpedia article for each candidate.
Various similarity scores are computed by a specialized processing resource that accesses the indices and evaluates the correspondence between the candidate and the context (the article content and the content stored in the aforementioned indices). The final classification is conducted by a separate processing resource, based on these pre-computed scores and some additional features.
Currently, the component supports disambiguation of named entities belonging to either of the following classes: “Person”, “Location”, “Organization”, "PeriodicalPublication", "Event", "RecurringEvent", "Activity", "AnatomicalStructure", "Award", "CelestialBody", "Color", "Currency", "Device", "Disease", "Drug", "Food", "GovernmentType", "Holiday", "Ideology", "Language", "MeanOfTransportation", "MusicGenre", "ProgrammingLanguage", "Project", "Species", "Work", "Thing".
4.2. Discovery of novel named entities
Named entities not available in the “LD Gazetteer” component's cache are not recognized, and therefore not handled by the above-described disambiguation mechanism. The recognition of novel entities belonging to the “Person”, “Location” and “Organization” classes is handled by the PLO Tagger processing resource, which compensates for the lack of perfect coverage by the classifier-based tagging approach.
The results extracted during phases 4.1 and 4.2 are combined in a way that eliminates the overlapping among annotations produced by the disambiguation classifier and the PLO tagger components. The implemented logic guarantees that the disambiguated entities that have a meaningful URI are preferred to the anonymous entities discovered by the PLO tagger.
5. Generic entity extraction phase
This phase implements a rule-based enrichment with entities of generic type. Currently, these include:
dates (normalization logic is provided as well)
numbers
money
percentages
measurements
6. Result consolidation phase
This phase contains rules that take into account all entity types discovered during the preceding phases in order to refine the extraction results. At its end, instance URIs are generated and assigned to entities that have no such identifiers.
This phase involves the execution of the “Orthomatcher” processing resource, which deals with the discovery of orthographical variations of the labels and aliases of people, location and organization entities. Subsequently, the trusted URIs of disambiguated entities are propagated to novel aliases annotated by the PLO tagger, based on linkage done by the “Orthomatcher” component. At the end of the phase, URIs are generated for the entities that have not been assigned a valid URI during the above-described phases.
7. Relation extraction phase
This phase conducts rule-based extraction of various relationships between the atomic entities discovered at the preceding stages. Currently, the following types of relations are supported:
“Person – Role”
“Person – Location”
“Organization – Location”
“Person – Role - Organization”
“Person – Role - Location”
“Person – Organization – Location”
“Person – Role - Organization – Location”
“Organization – Organization”
“Acquisition”
“Organization – Abbreviation”
“Quotation”
8. Clean-up phase
During this phase, a final clean-up takes place, through which redundant intermediate annotations are removed, the document readability is improved, and the annotation sets are reorganized in order to assume the structure expected by the components that process the documents after the completion of the concept extraction pipeline.