Skip to end of metadata
Go to start of metadata

"Annotations per character" in details

The amount of annotations per character in the table above may seem excessive at a glance. However, we need to estimate the GATE document size in memory in the peak of the semantic annotation process, in order to avoid running out of memory. In that peak:

  • there are Token annotations over each token that is an average 3-4 chars in size
  • there are split annotations between each sentence and Sentence annotations over them
  • there are Lookup annotations from the gazetteer
  • there are many temporary annotations, which represent intermediate results

This means that the size of the document in real situations depends on the design of the pipeline. To reduce the memory consumption, we recommend to:

  • Use a tokenizer only if necessary. The Token annotations are the primary cause of the GATE document memory consumption.
  • Clean temporary annotations as soon as possible, rather than at the end of the pipeline. Avoid keeping annotations in the document longer than needed.
Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.