Skip to end of metadata
Go to start of metadata
You are viewing an old version of this page. View the current version. Compare with Current  |   View Page History

{include:

Overview

The KIM populater module is a tool for populating a KIM Server with documents. It initiates reading of the documents and the associated metadata files, representing them as an internal document object, and invoking text analysis components to run over the document content. The resulting document, enriched with metadata, is stored and indexed. This section describes how to set up the module, what the different running scenarios are, and how to load documents in standard or queue mode. The tool supports various types of logging, including logging through a Web service.

The population goes through the following main steps:

  • finding and grouping files related to one KIM document (requests file, some copies of the file body, metadata, etc.)
  • creating a document by using located file groups
  • generating annotations for the document
  • storing semantic information retrieved from the document
  • storing the document in the document repository
  • parallel logging of the events appearing in the population process (information, errors, etc.)

The module is controlled through the configuration file populater.xml . The tool supports both console and graphical mode. When Graphical User Interface (GUI) mode is used some of the configuration parameters can be changed through the interface. The logging is based on Logback. Its configuration is located in <KIM_HOME>/config/logback.xml.

Populater installation and configuration

After you install KIM, you need to configure the populater for your needs. You can start it directly with our evaluation corpus that comes with the download version of KIM or set it up to use it with your own documents.

GUI mode How to start the Populater in the graphical console
Batch mode How to run the Populater in batch (console) mode
As a service for Linux How to run the Populator as a service for Linux
As a service for Windows How to run the Populator as a service for Windows
Standalone installation (For Windows only) How to run the sandalone installation
Configuration parameters How to setup the environment for the population process
Logging configuration What is the logging routine of the Populater module

Supported formats

Through using third-party libraries, the GATE document processing facilities and direct application integration, the populater supports a variety of formats. For more details, see the table below:

All documents must have the correct name extension, so their format can be recognized.

Format Extension Quality of text extraction Quality of formatting extraction Third-party library involved
Plain text .txt PERFECT N/A N/A
Microsoft Word (2003 or older; 2007 or newer) .doc .docx PERFECT GOOD* Apache POI 3.7 via Apache Tika 0.7
Microsoft Powerpoint (same) .ppt .pptx PERFECT GOOD* Apache POI 3.7 via Apache Tika 0.7
PDF .pdf VERY GOOD GOOD PDFBox 1.1 via Apache Tika 0.7
RTF .rtf GOOD N/A Java RTF Editor Kit
HTML and XHTML .htm(l) .xhtml PERFECT VERY GOOD Neko HTML Parser
XML .xml PERFECT PERFECT GATE XML Parser. All tags in the input file are converted to markup annotations.
GATE XML .xml PERFECT PERFECT GATE XML files that contain pre-annotated documents, created with GATE Developer or GATE Teamware are automatically detected. They will be imported "as is" - without being annotated again by KIM.
Compressed GATE XML .fi.xml.gz PERFECT PERFECT GATE XML Files like the above, but efficiently compressed.
Open Document Format .odt .ods .odp PERFECT VERY GOOD Apache Tika 0.7

If you need improved formatting extraction for Microsoft Word documents, please contact us for more information.

Document metadata

For every populated document, the populater loads the accompanying metadata. The metadata is stored as document features. The populater tries to access the metadata by looking for a file with the same name as the document, but in XML format.
For example, it will check for the metadata of the document "CompanyReport2003.html" in the file "CompanyReport2003.xml".

By default, KIM recognizes the following feature types: TITLE, SUBTITLE, AUTHORS, TIMESTAMP, SUBJECT, SOURCE, URL, ORIGIN.

Adding arbitrary elements to the .xml file will NOT add features to the document. If you want to customize the feature set, you need to edit the document repository configuration in <KIM_HOME>config/document.repository.rebuild. Add your custom features to the com.ontotext.kim.KIMConstants.DOCUMENT_FEAT_LIST option. Like all options in document.repository.rebuild, the document features list will be updated when you rebuild your documents storage.

This restriction to the feature schema allows better indexing and a more consistent user interface. The document features will be available immediately after loading the document. As a result, they can be used in the semantic annotation pipeline.

Notes:
  • All feature names (keys) will be converted to uppercase automatically.
  • If you want to develop a customized KIM application, make sure you configure the feature list before populating any documents.

The TIMESTAMP feature will be parsed as a date. Typically, this is the date on which the document was created. The the documents dates are used in the Timelines section of the Web UI. Furthermore, developers can create queries that return documents from a specific time interval. See the Java RMI API or Web Service API for details.

A variety of date formats are recognized. For best results, use one of the sample date formats:

  • Mon Jan 06 00:43:52 EET 2003
  • 2006-01-06
  • Oct 20 00:00:00 EET 1950
  • 06-01-2006

If the date or the TIMESTAMP feature is not specified, the document date will be set to the day of processing the document at midnight in the current timezone.

Known issues

If you intend to process documents larger than 1 MB, please make sure you update the populater memory configuration. Currently the tool is set to use up to 512MB of memory. This is sufficient to process the documents in the test corpus, which contains documents with a maximum size of 1MB.

  • For the graphical or console mode, the parameter POP_MAX_JAVA_HEAP=512m is located in the file <KIM_HOME>/bin/config/pop-config. You can extend it up to 1300MB on 32bit operating systems and without any limit on 64bit operating systems.
  • On Linux, if you prepend the command by assigning a new value to this variable, the new value will be used for the populater : POP_MAX_JAVA_HEAP=1024m populater.
  • For Windows Service mode, the parameter -Xmx512m is located in the file <KIM_HOME>/config/service.conf. You can extend it the same way as described above.

Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.