We chose the IPTC categorization scheme, suitable for news articles. https://confluence.ontotext.com/display/GSC/Document+Classification+Corpora
To date, the categorizations comprizes 17 broad topics: Arts_Culture_Entertainment, Conflicts_War_Peace,Crime_Law_Justice, Disaster_Accident, Economy_Business_Finance, Education, Environment, Health, Human_Interest, Labor, Lifestyle_Leisure, Politics, Religion_Belief, Science_Technology, Society, Sports, Weather.
For the next develpment versions, it is possible (and desired) to extend the categorization scheme by appending sub-categories, identical or inspired by the IPTC. More refined categories can result in a more specific description of the topic of the document, but can raise problems with model fitting.
One corpus has been obtained form the ACM classification system . It consists of titles and abstracts of scientific papers published by ACM. Here is the file. Each row starts with CCS, which is the root category of the tree. Tab-separated records specify parths in the tree. The leaves are articles, given as title and abstract, tab-separated. Example of articles in category:
CCS -> General and reference -> Cross-computing tools and techniques -> Metrics
|Measured impact of crooked traceroute||Data collected using traceroute-based algorithms underpins research into the Internet's router-level topology, though it is possible to infer false links from this data. One source of false inference is the combination of per-flow load-balancing, in which more than one path is active from a given source to destination, and classic traceroute, which varies the UDP destination port number or ICMP checksum of successive probe packets, which can cause per-flow load-balancers to treat successive packets as distinct flows and forward them along different paths. Consequently, successive probe packets can solicit responses from unconnected routers, leading to the inference of false links....|
|Semantic mining on customer survey||Business intelligence aims to support better business decision-making. Customer survey is priceless asset for intelligent business decision-making. However, business analysts usually have to read hundreds of textual comments and tabular data in survey to manually dig out the necessary information to feed business intelligence models and tools. This paper introduces a business intelligence system to solve this problem by extensively utilizing Semantic Web technologies....|
|Predicting software complexity by means of evolutionary testing||One characteristic that impedes software from achieving good levels of maintainability is the increasing complexity of software. Empirical observations have shown that typically, the more complex the software is, the bigger the test suite is. Thence, a relevant question, which originated the main research topic of our work, has raised: "Is there a way to correlate the complexity of the test cases utilized to test a software product with the complexity of the software under test?". ...|
|Runtime monitoring of software energy hotspots||GreenIT has emerged as a discipline concerned with the optimization of software solutions with regards to their energy consumption. In this domain, most of the state-of-the-art solutions concentrate on coarse-grained approaches to monitor the energy consumption of a device or a process. However, none of the existing solutions addresses in-process energy monitoring to provide in-depth analysis of a process energy consumption....|
|Structured merge with auto-tuning: balancing precision and performance||Software-merging techniques face the challenge of finding a balance between precision and performance. In practice, developers use unstructured-merge (i.e., line-based) tools, which are fast but imprecise. In academia, many approaches incorporate information on the structure of the artifacts being merged. While this increases precision in conflict detection and resolution, it can induce severe performance penalties....|
- Multi-label large margin hierarchical perceptron, Woolam and Khan, Int. J. of Data Mining, Modelling and Management, 2008
- Large margin hierarchical classification, Dekel et al, ICML '04 Proceedings of the twenty-first international conference on Machine learning