View Source

{toc}

h2. Introduction

Edlin is a collection of machine learning algorithms comprising a large number of state-of-the-art methods for classification and sequence tagging. Even though at their core they are general machine-learning approaches (perceptrons, logistic regression), the implementation is optimized for NLP learning tasks:
* inputs are represented as sparse document-term matrices
* parallel computation is used whenever possible (in order to deal with very large datasets)
* specific evaluation metrics such as Precision/Recall/F are being reported
* appropriate feature selection methods are added in order to reduce dimensionality, etc.

Edlin consists of four sub-projects(Basics, Edlin-Wrapper, Mallet-Wrapper and Feature Extraction). Below find technical details on each of the sub-projects. The source can be found [here|https://svn.ontotext.com/svn/kim/others/edlin/trunk/].


h2. Edlin Basics

Edlin Basics is the core of the tool, containing all ML algorithms divided into two general groups: classification and sequence (tagging).

h3. Algorithms for classification
h5. Maxent

Maximum Entropy (Maxent) is essentially multi-class logistic regression. It was first adapted and applied to NLP tasks by Berger, et al (1996) and Della Pietra, et al. 1997 ([Adam L. Berger , Stephen A. Della Pietra , Vincent J. Della Pietra, A Maximum Entropy approach to Natural Language Processing, Journal of Computational Linguistics, 1996, vol. 22, pp. 39-71|http://acl.ldc.upenn.edu/J/J96/J96-1002.pdf]). Maxent is a linear model, where the posterior class probabilities are modelled as a linear combination of the input features. In order to fit the model weights to the training observations, a loglikelihood loss function is minimized. The minimization is carried out by some gradient descent method.

* Gradient methods available:
** Gradient ascent
** Conjugate gradient
** Stochastic gradient ascent
* Parallelization:
* Modified objective for targeted optimization of particular Precision/Recall tradeoff:
* Regularization
** L1 regularization

h5. Perceptron
h5. Naive Bayes
h5. MIRA

h3. Algorithms for sequence
h5. CRF
h5. Perceptron

h2. Feature Extraction module

A module for feature extraction.

h2. Edlin-Wrapper(for GATE)

Edlin-Wrapper wraps the algorithms of Edlin, so that they can be used in [GATE|http://gate.ac.uk/] for multiple information extraction purposes.
The algorithms are wrapped as ProcessingResources and LanguageResources and can be applied directly in a pipeline.
More about [Edlin-Wrapper|Edlin-Wrapper]


h2. Mallet-Wrapper(for GATE)

Mallet-Wrapper wraps the algorithms of [Mallet|http://mallet.cs.umass.edu/], so that they can be used in [GATE|http://gate.ac.uk/] for multiple information extraction purposes.
The algorithms are wrapped as ProcessingResources and LanguageResources and can be applied directly in a pipeline.

h2. Document classification API(DAPI).

Currently not part of Edlin.