We chose the IPTC categorization scheme, suitable for news articles. https://confluence.ontotext.com/display/GSC/Document+Classification+Corpora
To date, the categorizations comprizes 17 broad topics: Arts_Culture_Entertainment, Conflicts_War_Peace,Crime_Law_Justice, Disaster_Accident, Economy_Business_Finance, Education, Environment, Health, Human_Interest, Labor, Lifestyle_Leisure, Politics, Religion_Belief, Science_Technology, Society, Sports, Weather.
For the next develpment versions, it is possible (and desired) to extend the categorization scheme by appending sub-categories, identical or inspired by the IPTC. More refined categories can result in a more specific description of the topic of the document, but can raise problems with model fitting.
One corpus has been obtained form the ACM classification system
- Multi-label large margin hierarchical perceptron, Woolam and Khan, Int. J. of Data Mining, Modelling and Management, 2008
- Large margin hierarchical classification, Dekel et al, ICML '04 Proceedings of the twenty-first international conference on Machine learning