Clear annotation types
Having clear annotation types is another important prerequisite for the annotation process. The Subject Matter Expert (SME) or someone who is very much familiar with the specificities of the domain defines the annotation types, based on empirical observations over the data and the content.
The Annotation types (AT) are abstract descriptions of certain mentions, used for marking spans of text, t.e. recognising mentions of person, organisation, location, date, etc, within a text. An AT may have two parts:
- Type – the generic word that describes an entity
- Feature – more specific sub categories of the type
For example an annotation labeled “address” (the type) can be either “street address”, “city” or “country” (the features).
Initial Corpus
The corpus is a collection of documents, which can be in different formats. The ones we support here are XML, HTML,TXT, CSV. Depending on the annotation task, these texts should be sampled to be representative and balanced. It means that the corpus should contain all types of texts (categories) present in that particular domain (e.g. for the news domain, it should contain texts about general news, social life, economy, finance, religion, sport, celebrities, etc.) and the proportion of the text types should be based on their share in real-life usage.
We usually start a new annotation task by creating an initial corpus of a small number of documents. In this way we are able to see how well the annotation task and initial guidelines work and, if necessary, adjust the text analysis component/guidelines/text collection before adding more documents to our corpus.
There is no fixed number of how big the corpus needs to be in order to get good results as this will depend largely on how complex the annotation task is. But usually we use between 100 - 500 documents with examples for evaluation and 700 - 2000 docs. for the machine learning component.
Initial annotation guidelines
We need to create initial annotation guidelines, which will be used as guidance for our text analytics tasks. Depending on the domain and complexity of the task it can be done automatically or manually.
Automatic approach
Based on observation over the documents and the data, the text analysis expert creates the initial model for the phenomena (software text-analysis component (ML, rule-based)) associated with the problem task we are trying to solve. This way the first annotation guidelines are automatically available. They describe the way the corpus should be annotated with the features in the model.
Semantic Annotation Cycle
The semantic annotation cycle consists of the following steps:
Step 1: The initial set of documents is loaded and the project annotation schema (annotation types, features, values, etc.) is applied. Having a good annotation schema and accurate annotations are critical for the machine learning component, which relies on data outside of the text itself.
Step 2: Automatic annotation is performed, based on the initial model of the phenomena (the extraction pipeline). This creates a pre-annotated corpus augmented with higher-level information from components such as tokenizers, sentence splitters, part of speech taggers, gazetteers, PER/ORG/LOC grammars, etc. Adding such information to a corpus allows the computer to find features that can make the defined task easier and more accurate.
Step 3: The pre-annotated corpus is then sent to MA experts for curation. A well-defined manual curation process is essential to ensure that all automatically pre-annotated entries are handled in a consistent manner. This process consists of several steps:
- all annotated entries are checked against the initial annotation guidelines.
- all erroneous entries are corrected, if possible, or deleted.
- depending on the task or the project, omissions in the pre-processing stage are added as entries.
Step 4: Based on the observations on these pre-annotated documents and the data, and the cases in which entities appear in the text, or the context in which the mentions of annotation types appear, the MA experts revise the initial annotation guidelines and enrich them with specific use cases.
Step 5: A manually annotated corpus is created. It is further divided in two parts.
- One third or one forth of the documents is used to evaluate the performance of the model, and depending on the achieved results, the model can be revised and the whole cycle repeated.
- The other part of the corpus, which is the biggest portion, is used for training and development of ML algorithms on the data:
- The algorithms are trained and tested over the corpus
- The results of training and testing are evaluated in order to see where the algorithms performed well and where they made mistakes.
- The design of the model is revised and, if necessary, other annotation types are created.
- The whole cycle or some parts of it are repeated.