compared with
Current by laura.tolosi
on Jun 03, 2015 17:09.

Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (9)

View Page History
h2. Categorization scheme

We chose the [IPTC | http://show.newscodes.org/index.html?newscodes=medtop&lang=en-GB] categorization scheme, suitable for news articles.
https://confluence.ontotext.com/display/GSC/Document+Classification+Corpora
А. A popular categorization is based on the the [IPTC | http://show.newscodes.org/index.html?newscodes=medtop&lang=en-GB] categorization scheme, suitable for news articles. Many of our competitors are basing their categorization on the IPTC standard. Advantage: describes well news. Disadvantage: it is a flat categorization, does not have levels. Too targeted to news.

To date, the categorizations comprizes 17 broad topics: Arts_Culture_Entertainment, Conflicts_War_Peace,Crime_Law_Justice, Disaster_Accident, Economy_Business_Finance, Education, Environment, Health, Human_Interest, Labor, Lifestyle_Leisure, Politics, Religion_Belief, Science_Technology, Society, Sports, Weather.

For the next development versions, it is possible (and desired) to extend the categorization scheme by appending sub-categories, identical or inspired by the IPTC. More refined categories can result in a more specific description of the topic of the document, but can raise problems with model fitting.

h2. Corpus
One corpus has been obtained form the [ACM classification system| http://dl.acm.org/ccs/ccs.cfm?CFID=514626755&CFTOKEN=67410232] . It consists of titles and abstracts of scientific papers published by ACM. [Here|^Links.txt] is the file. Each row starts with CCS, which is the root category of the tree. Tab-separated records specify parths in the tree. The leaves are articles, given as title and abstract, tab-separated. Example:
B. For unsupervised approaches, where the categories are not specified apriori, one can use ontology terms, such as dbpedia categories, of various degrees of specificity.

CCS -> General and reference -> Cross-computing tools and techniques -> Metrics
{csv:delimiter=tab}
Title Measured impact of crooked traceroute Abstract Data collected using traceroute-based algorithms underpins research into the Internet's router-level topology, though it is possible to infer false links from this data. One source of false inference is the combination of per-flow load-balancing, in which more than one path is active from a given source to destination, and classic traceroute, which varies the UDP destination port number or ICMP checksum of successive probe packets, which can cause per-flow load-balancers to treat successive packets as distinct flows and forward them along different paths. Consequently, successive probe packets can solicit responses from unconnected routers, leading to the inference of false links. This paper examines the inaccuracies induced from such false inferences, both on macroscopic and ISP topology mapping. We collected macroscopic topology data to 365k destinations, with techniques that both do and do not try to capture load balancing phenomena. We then use alias resolution techniques to infer if a measurement artifact of classic traceroute induces a false router-level link. This technique detected that 2.71% and 0.76% of the links in our UDP and ICMP graphs were falsely inferred due to the presence of load-balancing. We conclude that most per-flow load-balancing does not induce false links when macroscopic topology is inferred using classic traceroute. The effect of false links on ISP topology mapping is possibly much worse, because the degrees of a tier-1 ISP's routers derived from classic traceroute were inflated by a median factor of 2.9 as compared to those inferred with Paris traceroute.
Title Semantic mining on customer survey Abstract Business intelligence aims to support better business decision-making. Customer survey is priceless asset for intelligent business decision-making. However, business analysts usually have to read hundreds of textual comments and tabular data in survey to manually dig out the necessary information to feed business intelligence models and tools. This paper introduces a business intelligence system to solve this problem by extensively utilizing Semantic Web technologies. Ontology based knowledge extraction is the key to extract interesting terms and understand the logic concept of them. All knowledge extracted forms a semantic knowledge base. Flexible user queries and intelligent analysis can be easily issued to the system over the semantic data store through standard protocol. Besides resolving problems in theory, we designed a flexible, intuitive user interaction interface to explain and present the analysis result for business analysts. Through the real usage of this system, it is validated that our system gives good solution for semantic mining on customer survey for business intelligence.
Title Predicting software complexity by means of evolutionary testing Abstract One characteristic that impedes software from achieving good levels of maintainability is the increasing complexity of software. Empirical observations have shown that typically, the more complex the software is, the bigger the test suite is. Thence, a relevant question, which originated the main research topic of our work, has raised: "Is there a way to correlate the complexity of the test cases utilized to test a software product with the complexity of the software under test?". This work presents a new approach to infer software complexity with basis on the characteristics of automatically generated test cases. From these characteristics, we expect to create a test case profile for a software product, which will then be correlated to the complexity, as well as to other characteristics, of the software under test. This research is expected to provide developers and software architects with means to support and validate their decisions, as well as to observe the evolution of a software product during its life-cycle. Our work focuses on object-oriented software, and the corresponding test suites will be automatically generated through an emergent approach for creating test data named as Evolutionary Testing.
Title Runtime monitoring of software energy hotspots Abstract GreenIT has emerged as a discipline concerned with the optimization of software solutions with regards to their energy consumption. In this domain, most of the state-of-the-art solutions concentrate on coarse-grained approaches to monitor the energy consumption of a device or a process. However, none of the existing solutions addresses in-process energy monitoring to provide in-depth analysis of a process energy consumption. In this paper, we therefore report on a fine-grained runtime energy monitoring framework we developed to help developers to diagnose energy hotspots with a better accuracy than the state-of-the-art. Concretely, our approach adopts a 2-layer architecture including OS-level and process-level energy monitoring. OS-level energy monitoring estimates the energy consumption of processes according to different hardware devices (CPU, network card). Process-level energy monitoring focuses on Java-based applications and builds on OS-level energy monitoring to provide an estimation of energy consumption at the granularity of classes and methods. We argue that this per-method analysis of energy consumption provides better insights to the application in order to identify potential energy hotspots. In particular, our preliminary validation demonstrates that we can monitor energy hotspots of Jetty web servers and monitor their variations under stress scenarios.
Title Structured merge with auto-tuning: balancing precision and performance Abstract Software-merging techniques face the challenge of finding a balance between precision and performance. In practice, developers use unstructured-merge (i.e., line-based) tools, which are fast but imprecise. In academia, many approaches incorporate information on the structure of the artifacts being merged. While this increases precision in conflict detection and resolution, it can induce severe performance penalties. Striving for a proper balance between precision and performance, we propose a structured-merge approach with auto-tuning. In a nutshell, we tune the merge process on-line by switching between unstructured and structured merge, depending on the presence of conflicts. We implemented a corresponding merge tool for Java, called JDime. Our experiments with 8 real-world Java projects, involving 72 merge scenarios with over 17 million lines of code, demonstrate that our approach indeed hits a sweet spot: While largely maintaining a precision that is superior to the one of unstructured merge, structured merge with auto-tuning is up to 12 times faster than purely structured merge, 5 times on average.
h2. Corpora
A. A corpus consisting of long abstracts from dbpedia of articles that belong to the 17 IPTC categories, as shown here: https://confluence.ontotext.com/display/GSC/Document+Classification+Corpora . The corpus is available in EN and BG.

B. One corpus has been obtained form the [ACM classification system| http://dl.acm.org/ccs/ccs.cfm?CFID=514626755&CFTOKEN=67410232] . It consists of titles and abstracts of scientific papers published by ACM. [Here|^Links.txt] is the file. Each row starts with CCS, which is the root category of the tree. Tab-separated records specify parths in the tree. The leaves are articles, given as title and abstract, tab-separated. Example of articles in category:

*CCS* -> *General and reference* -> *Cross-computing tools and techniques* -> *Metrics*
{csv:delimiter=tab}Title Abstract
Measured impact of crooked traceroute Data collected using traceroute-based algorithms underpins research into the Internet's router-level topology, though it is possible to infer false links from this data...
Semantic mining on customer survey Business intelligence aims to support better business decision-making. Customer survey is priceless asset for intelligent business decision-making....
Predicting software complexity by means of evolutionary testing One characteristic that impedes software from achieving good levels of maintainability is the increasing complexity of software...
Runtime monitoring of software energy hotspots GreenIT has emerged as a discipline concerned with the optimization of software solutions with regards to their energy consumption....
Structured merge with auto-tuning: balancing precision and performance Software-merging techniques face the challenge of finding a balance between precision and performance...
{csv}
h2. Approaches


h2. Model
h2. Existing models

* Pipeline based on the dbpedia articles corpus (A) already available at S4: http://docs.s4.ontotext.com/display/S4docs/News+Classifier . Only for EN. Has low recall, rarely outputs more than 2 categories.
* An ensemble model, which combines a gazeteer and a classifier. The classifier outputs "yes" or "no" for each category. It is based on a small number of features, up to 30. Some reduced language model that hashcodes words to categories.
* An unsupervised model that works with tagged entities in the documents and tries to find dbpedia supercategories that cover well the entities. Unwanted aspect: very broad supercategories such as "Living_people" are very often output and are unspecific. The approach is promising, but some specificity score of the output categories mush be introduced.

Features:
* We are currently using: stopwords elimination, stemming and a bigram model for feature extraction
Algorithm:
* The multi-label classification is achieved by training K independent classifiers (perceptron, sigmoid perceptrons), corresponding to the K possible labels. For each classifier, the interpretation is: what is the likelihood that sample x has label l, against the alternative that it does not? After training all K classifiers, for each sample, the top highest likelihoods give the set of labels. A rule of thumb is used for deciding how many labels should be returned.