Skip to end of metadata
Go to start of metadata

Supported formats

Through using third-party libraries, the GATE document processing facilities and direct application integration, the populater supports a variety of formats. For more details, see the table below:

All documents must have the correct name extension, so their format can be recognized.

Format Extension Quality of text extraction Quality of formatting extraction Third-party library involved
Plain text .txt PERFECT N/A N/A
Microsoft Word (2003 or older; 2007 or newer) .doc .docx PERFECT GOOD* Apache POI 3.7 via Apache Tika 0.7
Microsoft Powerpoint (same) .ppt .pptx PERFECT GOOD* Apache POI 3.7 via Apache Tika 0.7
PDF .pdf VERY GOOD GOOD PDFBox 1.1 via Apache Tika 0.7
RTF .rtf GOOD N/A Java RTF Editor Kit
HTML and XHTML .htm(l) .xhtml PERFECT VERY GOOD Neko HTML Parser
XML .xml PERFECT PERFECT GATE XML Parser. All tags in the input file are converted to markup annotations.
GATE XML .xml PERFECT PERFECT GATE XML files that contain pre-annotated documents, created with GATE Developer or GATE Teamware are automatically detected. They will be imported "as is" - without being annotated again by KIM.
Compressed GATE XML .fi.xml.gz PERFECT PERFECT GATE XML Files like the above, but efficiently compressed.
Open Document Format .odt .ods .odp PERFECT VERY GOOD Apache Tika 0.7

If you need improved formatting extraction for Microsoft Word documents, please contact us for more information.

Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.