The KIM populater module is a tool for populating a KIM Server with documents. It initiates reading of the documents and the associated metadata files, representing them as an internal document object, and invoking text analysis components to run over the document content. The resulting document, enriched with metadata, is stored and indexed. This section describes how to set up the module, what the different running scenarios are, and how to load documents in standard or queue mode. The tool supports various types of logging, including logging through a Web service.
The population goes through the following main steps:
- finding and grouping files related to one KIM document (requests file, some copies of the file body, metadata, etc.)
- creating a document by using located file groups
- generating annotations for the document
- storing semantic information retrieved from the document
- storing the document in the document repository
- parallel logging of the events appearing in the population process (information, errors, etc.)
The module is controlled through the configuration file populater.xml . The tool supports both console and graphical mode. When Graphical User Interface (GUI) mode is used some of the configuration parameters can be changed through the interface. The logging is based on Logback. Its configuration is located in <KIM_HOME>/config/logback.xml.
After you install KIM, you need to configure the populater for your needs. You can start it directly with our evaluation corpus that comes with the download version of KIM or set it up to use it with your own documents.
|GUI mode||How to start the Populater in the graphical console|
|Batch mode||How to run the Populater in batch (console) mode|
|As a service for Linux||How to run the Populator as a service for Linux|
|As a service for Windows||How to run the Populator as a service for Windows|
|Standalone installation (For Windows only)||How to run the sandalone installation|
|Configuration parameters||How to setup the environment for the population process|
|Logging configuration||What is the logging routine of the Populater module|
Through using third-party libraries, the GATE document processing facilities and direct application integration, the populater supports a variety of formats. For more details, see the table below:
All documents must have the correct name extension, so their format can be recognized.
|Format||Extension||Quality of text extraction||Quality of formatting extraction||Third-party library involved|
|Microsoft Word (2003 or older; 2007 or newer)||.doc .docx||PERFECT||GOOD*||Apache POI 3.7 via Apache Tika 0.7|
|Microsoft Powerpoint (same)||.ppt .pptx||PERFECT||GOOD*||Apache POI 3.7 via Apache Tika 0.7|
|VERY GOOD||GOOD||PDFBox 1.1 via Apache Tika 0.7|
|RTF||.rtf||GOOD||N/A||Java RTF Editor Kit|
|HTML and XHTML||.htm(l) .xhtml||PERFECT||VERY GOOD||Neko HTML Parser|
|XML||.xml||PERFECT||PERFECT||GATE XML Parser. All tags in the input file are converted to markup annotations.|
|GATE XML||.xml||PERFECT||PERFECT||GATE XML files that contain pre-annotated documents, created with GATE Developer or GATE Teamware are automatically detected. They will be imported "as is" - without being annotated again by KIM.|
|Compressed GATE XML||.fi.xml.gz||PERFECT||PERFECT||GATE XML Files like the above, but efficiently compressed.|
|Open Document Format||.odt .ods .odp||PERFECT||VERY GOOD||Apache Tika 0.7|
If you need improved formatting extraction for Microsoft Word documents, please contact us for more information.
For every populated document, the populater loads the accompanying metadata. The metadata is stored as document features. The populater tries to access the metadata by looking for a file with the same name as the document, but in XML format.
For example, it will check for the metadata of the document "CompanyReport2003.html" in the file "CompanyReport2003.xml".
By default, KIM recognizes the following feature types: TITLE, SUBTITLE, AUTHORS, TIMESTAMP, SUBJECT, SOURCE, URL, ORIGIN.
This restriction to the feature schema allows better indexing and a more consistent user interface. The document features will be available immediately after loading the document. As a result, they can be used in the semantic annotation pipeline.
The TIMESTAMP feature will be parsed as a date. Typically, this is the date on which the document was created. The the documents dates are used in the Timelines section of the Web UI. Furthermore, developers can create queries that return documents from a specific time interval. See the Java RMI API or Web Service API for details.
A variety of date formats are recognized. For best results, use one of the sample date formats:
- Mon Jan 06 00:43:52 EET 2003
- Oct 20 00:00:00 EET 1950
If the date or the TIMESTAMP feature is not specified, the document date will be set to the day of processing the document at midnight in the current timezone.
If you intend to process documents larger than 1 MB, please make sure you update the populater memory configuration. Currently the tool is set to use up to 512MB of memory. This is sufficient to process the documents in the test corpus, which contains documents with a maximum size of 1MB.
- For the graphical or console mode, the parameter POP_MAX_JAVA_HEAP=512m is located in the file <KIM_HOME>/bin/config/pop-config. You can extend it up to 1300MB on 32bit operating systems and without any limit on 64bit operating systems.
- On Linux, if you prepend the command by assigning a new value to this variable, the new value will be used for the populater : POP_MAX_JAVA_HEAP=1024m populater.
- For Windows Service mode, the parameter -Xmx512m is located in the file <KIM_HOME>/config/service.conf. You can extend it the same way as described above.