A high-level description of the phases (sub-pipelines) involved in an example concept extraction pipeline
During this phase, generic pre-processing takes place:
Most of the processing resources used during this phase are part of the GATE distribution. Several additional rules that improve the output and facilitate the subsequent tasks are also provided (e.g. rules that insert additional splits on newline characters and document elements available through the “Original markups” annotation set; rules that modify noun phrase chunks in order to improve the extraction of keyphrase candidates; rules that generate canonical forms for such noun phrases).
This phase consists of logic that generates keyphrase candidates, assigns relevance scores to the candidates, and classifies them into positive or negative instances via a specialized processing resource for supervised classification.
At the completion of this phase, positive keyphrase instances are stored in a separate set for further reference.
At this stage, document content is enriched by means of semantic gazetteers that annotate various types of entities. These gazetteer lookups are not part of the extraction results, but they provide features required by the statistical models and rules that produce the final set of entities.
All sub-pipelines are responsible for the execution of a single gazetteer (ANNIE Gazetteer, LD Gazetteer), and the consequent transfer of features from the gazetteer-generated annotations to other annotation types involved in the named entity recognition and disambiguation phases.
The “LD Gazetteer” component's cache is populated with instances from the following sources: DBpedia, Freebase, Geonames.
Using the named entity candidates discovered by the LD Gazetteer, and given the specific article context, this phase makes use of a specialized classifier to assign a "positive" or "negative" label to each candidate. As a result, the ambiguity associated with the complete set of gazetteer lookups is eliminated – at most one named entity remains per document offset, and the redundant ("negative") named entity candidates are removed.
The disambiguation mechanism relies on a set of Lucene indexes that store:
a short textual description of each candidate (based on DBPedia and Freebase abstracts);
a set of URIs representing the entities that appear in the full DBpedia article for each candidate.
Various similarity scores are computed by a specialized processing resource that accesses the indices and evaluates the correspondence between the candidate and the context (the article content and the content stored in the aforementioned indices). The final classification is conducted by a separate processing resource, based on these pre-computed scores and some additional features.
Currently, the component supports disambiguation of named entities belonging to either of the following classes: “Person”, “Location”, “Organization”, "PeriodicalPublication", "Event", "RecurringEvent", "Activity", "AnatomicalStructure", "Award", "CelestialBody", "Color", "Currency", "Device", "Disease", "Drug", "Food", "GovernmentType", "Holiday", "Ideology", "Language", "MeanOfTransportation", "MusicGenre", "ProgrammingLanguage", "Project", "Species", "Work", "Thing".
Named entities not available in the “LD Gazetteer” component's cache are not recognized, and therefore not handled by the above-described disambiguation mechanism. The recognition of novel entities belonging to the “Person”, “Location” and “Organization” classes is handled by the PLO Tagger processing resource, which compensates for the lack of perfect coverage by the classifier-based tagging approach.
The results extracted during phases 4.1 and 4.2 are combined in a way that eliminates the overlapping among annotations produced by the disambiguation classifier and the PLO tagger components. The implemented logic guarantees that the disambiguated entities that have a meaningful URI are preferred to the anonymous entities discovered by the PLO tagger.
This phase implements a rule-based enrichment with entities of generic type. Currently, these include:
dates (normalization logic is provided as well)
This phase contains rules that take into account all entity types discovered during the preceding phases in order to refine the extraction results. At its end, instance URIs are generated and assigned to entities that have no such identifiers.
This phase involves the execution of the “Orthomatcher” processing resource, which deals with the discovery of orthographical variations of the labels and aliases of people, location and organization entities. Subsequently, the trusted URIs of disambiguated entities are propagated to novel aliases annotated by the PLO tagger, based on linkage done by the “Orthomatcher” component. At the end of the phase, URIs are generated for the entities that have not been assigned a valid URI during the above-described phases.
This phase conducts rule-based extraction of various relationships between the atomic entities discovered at the preceding stages. Currently, the following types of relations are supported:
“Person – Role”
“Person – Location”
“Organization – Location”
“Person – Role - Organization”
“Person – Role - Location”
“Person – Organization – Location”
“Person – Role - Organization – Location”
“Organization – Organization”
“Organization – Abbreviation”
During this phase, a final clean-up takes place, through which redundant intermediate annotations are removed, the document readability is improved, and the annotation sets are reorganized in order to assume the structure expected by the components that process the documents after the completion of the concept extraction pipeline.