View Source


h2. Categorization scheme

We chose the [IPTC |] categorization scheme, suitable for news articles.

To date, the categorizations comprizes 17 broad topics: Arts_Culture_Entertainment, Conflicts_War_Peace,Crime_Law_Justice, Disaster_Accident, Economy_Business_Finance, Education, Environment, Health, Human_Interest, Labor, Lifestyle_Leisure, Politics, Religion_Belief, Science_Technology, Society, Sports, Weather.

For the next develpment versions, it is possible (and desired) to extend the categorization scheme by appending sub-categories, identical or inspired by the IPTC. More refined categories can result in a more specific description of the topic of the document, but can raise problems with model fitting.

h2. Corpus
One corpus has been obtained form the [ACM classification system|] . It consists of titles and abstracts of scientific papers published by ACM. [Here|^Links.txt] is the file.
h2. Approaches
* [Multi-label large margin hierarchical perceptron|^MultiLabelHierarchicalPerceptron.pdf], Woolam and Khan, Int. J. of Data Mining, Modelling and Management, 2008
* [Large margin hierarchical classification|^DekelKeSi04.pdf], Dekel et al, ICML '04 Proceedings of the twenty-first international conference on Machine learning

h2. Model