compared with
Current by Laura Tolosi
on Dec 15, 2014 16:30.

This line was removed.
This word was removed. This word was added.
This line was added.

Changes (1)

View Page History
Via document classification, a document is automatically assigned to a category, out of a large set of predefined categories. For example, the document can be about "Sport", or "Science and Technology" or "Politics", etc. A document can actually belong to multiple categories, with higher or lower affinity.

Technically, the task of document classification is carried out by a _machine learning model_, trained on a _large corpus_ of example documents, from a predefined _categorization scheme_. The definitions of categories are inherently _domain-specific_, as it is hard to define a scheme that encompasses "all themes" that text documents can be about. We opted for a categorization that best suits news data (also suitable to blogs, twitter, etc.). The technical documentation presents the approach in detail.