View Source

{toc}
{attachments}

h2. Categorization scheme

We chose the [IPTC | http://show.newscodes.org/index.html?newscodes=medtop&lang=en-GB] categorization scheme, suitable for news articles.
https://confluence.ontotext.com/display/GSC/Document+Classification+Corpora

To date, the categorizations comprizes 17 broad topics: Arts_Culture_Entertainment, Conflicts_War_Peace,Crime_Law_Justice, Disaster_Accident, Economy_Business_Finance, Education, Environment, Health, Human_Interest, Labor, Lifestyle_Leisure, Politics, Religion_Belief, Science_Technology, Society, Sports, Weather.

For the next develpment versions, it is possible (and desired) to extend the categorization scheme by appending sub-categories, identical or inspired by the IPTC. More refined categories can result in a more specific description of the topic of the document, but can raise problems with model fitting.

h2. Corpus

h2. Approaches
* [Multi-label large margin hierarchical perceptron|^MultiLabelHierarchicalPerceptron.pdf], Woolam and Khan, Int. J. of Data Mining, Modelling and Management, 2008
* [Large margin hierarchical classification|^DekelKeSi04.pdf], Dekel et al, ICML '04 Proceedings of the twenty-first international conference on Machine learning


h2. Model