The following section deals with the memory consumption for GATE documents, annotations and features.
All GATE document memory usage examples below result from synthetic tests over GATE 5.2.1 documents.
The numbers below are in megabytes, unless specified otherwise. They were taken on a 32bit Java 6u20 on Windows 7.
|Annotations per character /Document size||10kB||100kB||1MB||10MB|
- Each document should contain random words with an average of 10 symbols.
- Each annotation should be approximately 4 symbols with 4 features.
- Annotations should have different start and end point.
The amount of annotations per character in the table above may seem excessive at a glance. However, we need to estimate the GATE document size in memory in the peak of the semantic annotation process, in order to avoid running out of memory. In that peak:
- there are Token annotations over each token that is an average 3-4 chars in size
- there are split annotations between each sentence and Sentence annotations over them
- there are Lookup annotations from the gazetteer
- there are many temporary annotations, which represent intermediate results
This means that the size of the document in real situations depends on the design of the pipeline. To reduce the memory consumption, we recommend to:
- Use a tokenizer only if necessary. The Token annotations are the primary cause of the GATE document memory consumption.
- Clean temporary annotations as soon as possible, rather than at the end of the pipeline. Avoid keeping annotations in the document longer than needed.