View Source

Text can be analyzed at multiple levels including:

* categorization to a specific categories such as blog post, political news, sport news, etc.
* topic extraction - recognizing important words and phrases in the text
* named entity recognition (NER) - extracting people, organization, location, time, amounts of money, etc.
* term definitions organized in hierarchies or thesauri
* concept extraction - extracting well-defined rich entities in a database
* relation extraction between all concepts

h3. Categorization

News stories are classified into one or more of the following news categories:

* National
* International
* Economy
* Politics
* Sports
* Media/culture
* Science
* Religion

We are in the process of collecting documents from categories that have no or less than useful stories such as Science, Religion, Techs and Sports.

h3. Topic extraction

Topics are important phrases or simply key-words that are explicitly mentioned in the document. These phrases, however, are not bound to a specific ontology or knowledge and are extracted on the fly from the stories. As such, they represent the news in terms of coverage better but do not bring along rich background information the way concepts do.

For instance, a news story about Mark Rutte mentions that he is the leader of the "People's Party for Freedom and Democracy (VVD)", however, the abbreviation "VVD" is not part of the available names for this party (as "ANP", for example). Therefore this will not be extracted by the concept extraction algorithm. Instead, it will be extracted by the topic extraction routine. Abbreviations are used in particular in informal chats and as search words. That is why having topics is very valuable for the search and navigation of content. The extraction of topics also adds dynamics and increases the coverage of the system.

h3. Term extraction

In Term extraction, algorithms automatically link chunks of text to terms of a thesaurus or other representations. For instance, a scientific paper abstract is linked to terms from the INSPEC thesaurus.

h3. Named Entity Recognition

In the Named Entity Recognition task, an algorithm seeks to locate and classify chunks of text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.

NER systems take an unannotated block of text, such as this one:

_Jim bought 300 shares of Acme Corp in 2006._ and produce

this one:


h3. Concept extraction

Concept extraction is a task that aims to link a chunk of text to a rich concept representation of an object in a database. These objects, which we call concepts, have additional links to other concepts and thus represent a rich graph structure. For instance, Barack Obama is linked to the Democratic Party (his political orientation), other Democrats such as Joe Biden, a place of birth, a spouse, children, education, etc.

The main problem in this task is disambiguating between multiple candidates for a single mention of their name in the text.


To illustrate the levels of ambiguity and the importance of handling them, below we present the level of ambiguity based on 354 manually curated documents.


Concept: an instance in the knowledge base, with a unique ID and multiple names/labels
Mention: an actual appearance of an entity name in the text

A mention of a concept is a match for any of the concept names (alternative names). The number of mentions with more than two competing concepts in the gold standard is 9304. The mention with most ambiguities is “Paul Smith”, for which there are 70 concepts. The average number of candidates for a single mention is 6. We may consider this as a representative number of concepts between which our algorithm disambiguates on average.

All mentions are 16401 and the percentage of ambiguous mentions is 57%.

Below we show the first 10 ambiguous mentions.

Paul Smith: 70 candidates
John: 64 candidates
William: 62 candidates
Premier League: 59 candidates
De Hoop: 55 candidates
Alex: 54 candidates
Alexander: 52 candidates
Washington: 49 candidates
John Roberts: 48 candidates
Michael: 44 candidates

Bellow we represent only the first 10 alternative concepts for “Paul Smith”, identified by their URI.

* [])[])
* [])[])
* [])[]
* [][])
* [][])

In the second sheet we represent the top 100 most frequently appearing ambiguous concepts in the gold standard set. It can be seen that the Netherlands and, in general, locations like Europe are mentioned many times throughout this set of documents and all of them carry a certain ambiguity.

h3. Relation extraction

Relations are references between one or more concepts and things that are extracted only from the text of the story. These should not be mistaken with relations that already exist in the knowledge base. Relations add new knowledge and connections between concepts in the knowledge base and can be of the following types:

h3. Generic Relations (they appear as Relation in the testing UI)

Generic Relations attempt to connect at least one recognized concept to other concepts or topics without predefining relationship types.

h5. PersonCareer relations

PersonCareer relation expresses a relation between a person and: \(i) a company where he has a position (e.g. John Smith from General Electrics); (ii) his occupation (President Obama, general director Mark Thomas); (iii) his position within a location (Mayor of London); (iv) his position within an organization active in a specific location (Ben Bernanke is the chairman of the Federal Reserve, the central bank of the United States).

h5. Company relations

Company relations express relation between a company and: \(i) its location (Siemens of Germany); (ii) its daughter company (VOX Global, a subsidiary of Omnicom Group Inc (OMC); (iii) its mother company; (iv) competitor companies (The Toulouse, France-based aircraft maker is outselling rival Bombardier Inc.); (v) company type (bank, telecom, etc.); (vi) customer companies; (vii) collaborator companies (Bloomberg Finance L.P., a Delaware limited partnership); (viii) another company such as merger/acquisition (Amgen Inc. agreed to acquire Onyx Pharmaceuticals Inc.); (ix) company's abbreviation; \(x) and a quotation made by the same company ("...", an allegation denied by Herbalife.).

h5. Other relation types

* Relation between two locations such as subregion of, part of, located in, etc. (Tegucigalpa, Honduras;Buddhist kingdom of Mustang, northwest of Katmandu, Nepal)
* Relation between a person and quotation ("There’s a perception that doctors are meant to heal wounds, not bleed them,” Mr. Kamara said.)