Edlin is a collection of machine learning algorithms comprising a large number of state-of-the-art methods for classification and sequence tagging. Even though at their core they are general machine-learning approaches (perceptrons, logistic regression), the implementation is optimized for NLP learning tasks:
* inputs are represented as sparse document-term matrices
* parallel computation is used whenever possible (in order to deal with very large datasets)
* specific evaluation metrics such as Precision/Recall/F are being reported
* appropriate feature selection methods are added in order to reduce dimensionality, etc.
Edlin consists of four sub-projects(Basics, Edlin-Wrapper, Mallet-Wrapper and Feature Extraction). Below find technical details on each of the sub-projects. The source can be found [here|https://svn.ontotext.com/svn/kim/others/edlin/trunk/].
h2. Edlin Basics
Edlin Basics is the core of the tool, containing all ML algorithms divided into two general groups: classification and sequence (tagging).
h3. Algorithms for classification
All algorithms are used for multi-class classification. Typically, for each input example, the raw model output is a list of scores, one for each class. As a final prediction, the class with largest score is chosen.
Maximum Entropy (Maxent) is essentially multi-class logistic regression. It was first adapted and applied to NLP tasks by Berger, et al (1996) and Della Pietra, et al. 1997 ([Adam L. Berger , Stephen A. Della Pietra , Vincent J. Della Pietra, A Maximum Entropy approach to Natural Language Processing, Journal of Computational Linguistics, 1996, vol. 22, pp. 39-71|http://acl.ldc.upenn.edu/J/J96/J96-1002.pdf]). Maxent is a linear model, where the posterior class probabilities are modelled as a linear combination of the input features. In order to fit the model weights to the training observations, a loglikelihood loss function is maximized. The maximization is carried out by some gradient ascent method. The output of the maxent model for a given example is a list of posterior class probabilities, out of which the largest is chosen as prediction.
Several parameters for maxent can be selected, such that the most appropriate and efficient model is trained to suite a particular dataset:
* Gradient method:
** Gradient ascent
** Conjugate gradient
** Stochastic gradient ascent (fast)
* Modified objective for targeted optimization of particular Precision/Recall trade-off:
** We implemented a weighted likelihood objective that allows for optimizing a specific F_beta, for a given beta, which means that we can specify a desired Precision/ Recall trade-off. In practice, we can therefore train models that have very high Precision, or very high Recall, at the expense of the complementary measure.
** L1 regularization is often used in practice for sparse models and reducing overfitting. An L1-regularized maxent can also serve as feature selection procedure.
The Perceptron is a very old algorithm for training a linear model invented by F. Rosenblatt (Rosenblatt, Frank (1957), The Perceptron--a perceiving and recognizing automaton. Report 85-460-1, Cornell Aeronautical Laboratory). It is an online algorithm, meaning that it is trained by observing one input example at a time, by updating upon mistakes.
** The online training of the perceptron allows for easy parallelization. Batches of data are fed independently to several 'temporary' perceptrons and then they are averaged into a single model. The procedure is repeated for a number of epochs (see ["Distributed Training Strategies for the Structured Perceptron", McDonald et al. (2010)|http://aclweb.org/anthology/N/N10/N10-1069.pdf])
** For many datasets, the perceptron is the fastest algorithm of the entire EDLIN collection.