
Categorization scheme
We chose the IPTC categorization scheme, suitable for news articles. https://confluence.ontotext.com/display/GSC/Document+Classification+Corpora
To date, the categorizations comprizes 17 broad topics: Arts_Culture_Entertainment, Conflicts_War_Peace,Crime_Law_Justice, Disaster_Accident, Economy_Business_Finance, Education, Environment, Health, Human_Interest, Labor, Lifestyle_Leisure, Politics, Religion_Belief, Science_Technology, Society, Sports, Weather.
For the next develpment versions, it is possible (and desired) to extend the categorization scheme by appending sub-categories, identical or inspired by the IPTC. More refined categories can result in a more specific description of the topic of the document, but can raise problems with model fitting.
Corpus
A. A corpus consisting of long abstracts from dbpedia of articles that belong to the 17 IPTC categories, as shown here: https://confluence.ontotext.com/display/GSC/Document+Classification+Corpora . The corpus is available in EN and BG.
B. One corpus has been obtained form the ACM classification system . It consists of titles and abstracts of scientific papers published by ACM. Here is the file. Each row starts with CCS, which is the root category of the tree. Tab-separated records specify parths in the tree. The leaves are articles, given as title and abstract, tab-separated. Example of articles in category:
CCS -> General and reference -> Cross-computing tools and techniques -> Metrics
Title | Abstract |
---|---|
Measured impact of crooked traceroute | Data collected using traceroute-based algorithms underpins research into the Internet's router-level topology, though it is possible to infer false links from this data... |
Semantic mining on customer survey | Business intelligence aims to support better business decision-making. Customer survey is priceless asset for intelligent business decision-making.... |
Predicting software complexity by means of evolutionary testing | One characteristic that impedes software from achieving good levels of maintainability is the increasing complexity of software... |
Runtime monitoring of software energy hotspots | GreenIT has emerged as a discipline concerned with the optimization of software solutions with regards to their energy consumption.... |
Structured merge with auto-tuning: balancing precision and performance | Software-merging techniques face the challenge of finding a balance between precision and performance... |
Approaches
- Multi-label large margin hierarchical perceptron, Woolam and Khan, Int. J. of Data Mining, Modelling and Management, 2008
- Large margin hierarchical classification, Dekel et al, ICML '04 Proceedings of the twenty-first international conference on Machine learning
Model
Pipeline based on the dbpedia articles corpus (A) already available at S4: http://docs.s4.ontotext.com/display/S4docs/News+Classifier