We chose the IPTC categorization scheme, suitable for news articles. https://confluence.ontotext.com/display/GSC/Document+Classification+Corpora
To date, the categorizations comprizes 17 broad topics: Arts_Culture_Entertainment, Conflicts_War_Peace,Crime_Law_Justice, Disaster_Accident, Economy_Business_Finance, Education, Environment, Health, Human_Interest, Labor, Lifestyle_Leisure, Politics, Religion_Belief, Science_Technology, Society, Sports, Weather.
For the next develpment versions, it is possible (and desired) to extend the categorization scheme by appending sub-categories, identical or inspired by the IPTC. More refined categories can result in a more specific description of the topic of the document, but can raise problems with model fitting.
One corpus has been obtained form the ACM classification system . It consists of titles and abstracts of scientific papers published by ACM. Here is the file. Each row starts with CCS, which is the root category of the tree. Tab-separated records specify parths in the tree. The leaves are articles, given as title and abstract, tab-separated. Example:
CCS -> General and reference -> Cross-computing tools and techniques -> Metrics
|t Measured impact of crooked traceroute||Data collected using traceroute-based algorithms underpins research into the Internet's router-level topology, though it is possible to infer false links from this data. One source of false inference is the combination of per-flow load-balancing, in which more than one path is active from a given source to destination, and classic traceroute, which varies the UDP destination port number or ICMP checksum of successive probe packets, which can cause per-flow load-balancers to treat successive packets as distinct flows and forward them along different paths. Consequently, successive probe packets can solicit responses from unconnected routers, leading to the inference of false links. This paper examines the inaccuracies induced from such false inferences, both on macroscopic and ISP topology mapping. We collected macroscopic topology data to 365k destinations, with techniques that both do and do not try to capture load balancing phenomena. We then use alias resolution techniques to infer if a measurement artifact of classic traceroute induces a false router-level link. This technique detected that 2.71% and 0.76% of the links in our UDP and ICMP graphs were falsely inferred due to the presence of load-balancing. We conclude that most per-flow load-balancing does not induce false links when macroscopic topology is inferred using classic traceroute. The effect of false links on ISP topology mapping is possibly much worse, because the degrees of a tier-1 ISP's routers derived from classic traceroute were inflated by a median factor of 2.9 as compared to those inferred with Paris traceroute.|
|t Semantic mining on customer survey||Business intelligence aims to support better business decision-making. Customer survey is priceless asset for intelligent business decision-making. However, business analysts usually have to read hundreds of textual comments and tabular data in survey to manually dig out the necessary information to feed business intelligence models and tools. This paper introduces a business intelligence system to solve this problem by extensively utilizing Semantic Web technologies. Ontology based knowledge extraction is the key to extract interesting terms and understand the logic concept of them. All knowledge extracted forms a semantic knowledge base. Flexible user queries and intelligent analysis can be easily issued to the system over the semantic data store through standard protocol. Besides resolving problems in theory, we designed a flexible, intuitive user interaction interface to explain and present the analysis result for business analysts. Through the real usage of this system, it is validated that our system gives good solution for semantic mining on customer survey for business intelligence.|
|t Predicting software complexity by means of evolutionary testing||One characteristic that impedes software from achieving good levels of maintainability is the increasing complexity of software. Empirical observations have shown that typically, the more complex the software is, the bigger the test suite is. Thence, a relevant question, which originated the main research topic of our work, has raised: "Is there a way to correlate the complexity of the test cases utilized to test a software product with the complexity of the software under test?". This work presents a new approach to infer software complexity with basis on the characteristics of automatically generated test cases. From these characteristics, we expect to create a test case profile for a software product, which will then be correlated to the complexity, as well as to other characteristics, of the software under test. This research is expected to provide developers and software architects with means to support and validate their decisions, as well as to observe the evolution of a software product during its life-cycle. Our work focuses on object-oriented software, and the corresponding test suites will be automatically generated through an emergent approach for creating test data named as Evolutionary Testing.|
|t Runtime monitoring of software energy hotspots||GreenIT has emerged as a discipline concerned with the optimization of software solutions with regards to their energy consumption. In this domain, most of the state-of-the-art solutions concentrate on coarse-grained approaches to monitor the energy consumption of a device or a process. However, none of the existing solutions addresses in-process energy monitoring to provide in-depth analysis of a process energy consumption. In this paper, we therefore report on a fine-grained runtime energy monitoring framework we developed to help developers to diagnose energy hotspots with a better accuracy than the state-of-the-art. Concretely, our approach adopts a 2-layer architecture including OS-level and process-level energy monitoring. OS-level energy monitoring estimates the energy consumption of processes according to different hardware devices (CPU, network card). Process-level energy monitoring focuses on Java-based applications and builds on OS-level energy monitoring to provide an estimation of energy consumption at the granularity of classes and methods. We argue that this per-method analysis of energy consumption provides better insights to the application in order to identify potential energy hotspots. In particular, our preliminary validation demonstrates that we can monitor energy hotspots of Jetty web servers and monitor their variations under stress scenarios.|
|t Structured merge with auto-tuning: balancing precision and performance||Software-merging techniques face the challenge of finding a balance between precision and performance. In practice, developers use unstructured-merge (i.e., line-based) tools, which are fast but imprecise. In academia, many approaches incorporate information on the structure of the artifacts being merged. While this increases precision in conflict detection and resolution, it can induce severe performance penalties. Striving for a proper balance between precision and performance, we propose a structured-merge approach with auto-tuning. In a nutshell, we tune the merge process on-line by switching between unstructured and structured merge, depending on the presence of conflicts. We implemented a corresponding merge tool for Java, called JDime. Our experiments with 8 real-world Java projects, involving 72 merge scenarios with over 17 million lines of code, demonstrate that our approach indeed hits a sweet spot: While largely maintaining a precision that is superior to the one of unstructured merge, structured merge with auto-tuning is up to 12 times faster than purely structured merge, 5 times on average.|
- Multi-label large margin hierarchical perceptron, Woolam and Khan, Int. J. of Data Mining, Modelling and Management, 2008
- Large margin hierarchical classification, Dekel et al, ICML '04 Proceedings of the twenty-first international conference on Machine learning