EDLIN Technical Documentation

Introduction

Edlin is a collection of machine learning algorithms comprising a large number of state-of-the-art methods for classification and sequence tagging. Even though at their core they are general machine-learning approaches (perceptrons, logistic regression), the implementation is optimized for NLP learning tasks:

Edlin consists of four sub-projects(Basics, Edlin-Wrapper, Mallet-Wrapper and Feature Extraction). Below find technical details on each of the sub-projects. The source can be found here.

Edlin Basics

Edlin Basics is the core of the tool, containing all ML algorithms divided into two general groups: classification and sequence (tagging).

Algorithms for classification

Maxent

Maximum Entropy (Maxent) is essentially multi-class logistic regression. It was first adapted and applied to NLP tasks by Berger, et al (1996) and Della Pietra, et al. 1997 (Adam L. Berger , Stephen A. Della Pietra , Vincent J. Della Pietra, A Maximum Entropy approach to Natural Language Processing, Journal of Computational Linguistics, 1996, vol. 22, pp. 39-71). Maxent is a linear model, where the posterior class probabilities are modelled as a linear combination of the input features. In order to fit the model weights to the training observations, a loglikelihood loss function is maximized. The maximization is carried out by some gradient ascent method.

Several parameters for maxent can be selected, such that the most appropriate and efficient model is trained to suite a particular dataset:

Perceptron
Naive Bayes
MIRA

Algorithms for sequence

CRF
Perceptron

Feature Extraction module

A module for feature extraction.

Edlin-Wrapper(for GATE)

Edlin-Wrapper wraps the algorithms of Edlin, so that they can be used in GATE for multiple information extraction purposes.
The algorithms are wrapped as ProcessingResources and LanguageResources and can be applied directly in a pipeline.
More about [Edlin-Wrapper]

Mallet-Wrapper(for GATE)

Mallet-Wrapper wraps the algorithms of Mallet, so that they can be used in GATE for multiple information extraction purposes.
The algorithms are wrapped as ProcessingResources and LanguageResources and can be applied directly in a pipeline.

Document classification API(DAPI).

Currently not part of Edlin.