The amount of annotations per character in the table above may seem excessive at a glance. However, we need to estimate the GATE document size in memory in the peak of the semantic annotation process, in order to avoid running out of memory. In that peak:
- there are Token annotations over each token that is an average 3-4 chars in size
- there are split annotations between each sentence and Sentence annotations over them
- there are Lookup annotations from the gazetteer
- there are many temporary annotations, which represent intermediate results
This means that the size of the document in real situations depends on the design of the pipeline. To reduce the memory consumption, we recommend to:
- Use a tokenizer only if necessary. The Token annotations are the primary cause of the GATE document memory consumption.
- Clean temporary annotations as soon as possible, rather than at the end of the pipeline. Avoid keeping annotations in the document longer than needed.