The document repository stores and indexes annotated documents. It provides keyword search, facet search and analysis, and combined search. You can use the search capabilities either through the user interface, or through the API.
By default, the document repository uses a combination of Lucene and the RDF database OWLIM SE to index and store the documents. The index is highly configurable and enables you to choose the best compromise between search capabilities and efficiency.
Non-default options may be less stable than the default ones.
To configure the document repository, edit the options in <KIM_HOME>/config/document.repository.properties .
The document repository type selects how annotated documents are stored and, optionally, indexed for full-text search. Note that in all cases, the index, provided by the repository, is augmented by the CORE_INDEX_ADDON, described below. Possible values are:
|lucene||stores the complete text of the documents and their annotations in an Apache Lucene index. This document repository provides full-text search capabilities. This is the recommended and default option.|
|only-store||stores the annotated documents as GATE XML files without indexing. Selecting a CORE_INDEX_ADDON is, thus, required for searching. This option is highly recommended if your application, based on the KIM 3.7, doesn't require full-text searching in the body of the document. In that case, selecting only-store will increase performance and decrease storage requirements significantly.|
|mimir||(experimental) stores the complete text of the documents and their annotations in Ontotext Mimir. In addition to a full-text index, Mimir enables searching for annotation patterns.|
|sar||(experimental) stores the text and the annotations of the document in the embedded OWLIM SE as RDF using the KM module scheme of the PROTON ontology . This document repository provides full-text search capabilities. Enabling the sar index requires additional configuration. Please contact support for assistance if you intend to use the Semantic Annotation Repository (SAR).|
|coredb||(deprecated) stores the text and the annotations of the document in an Oracle database. This repository provides both full-text and CORE queries support, therefore the CORE_INDEX_ADDON is not necessary. However, due to the high requirements of the Oracle database and the relatively low performance of the CORE queries in this case, this option is deprecated. See this page for details.|
|other||to implement your own document repository (i.e. integrate another indexing/storing library/system), refer to the KIM Developer's Guide.|
The CORE_INDEX_ADDON enhances the document repository index with support for CORE queries, and optionally, other queries. Possible options are:
|rdf||provides support for CORE queries without ranking, and full-text queries over document metadata (features). The index is stored in the embedded OWLIM as RDF, according to the schema of the KM module of the PROTON ontology.|
|rdfranked||provides support for CORE queries with ranking, in addition to the capabilities above, with significant performance cost, when working with 100 000 documents, or more. That performance cost, however, can be mitigated by assigning more memory to KIM, which allows it to cache most of the expensive queries. Thus, this is the recommended and default option. An advanced feature of rdfranked is that it recognizes semantic broader/narrower relations between terms, and accumulates the mentions of narrower terms in the rank of their broader term. Contact support for more details.|
|none||provides no support for CORE queries. Please select this, if CORE queries are not required, as it reduces the load on the embedded OWLIM SE significantly and increases indexing performance.|
|other||to implement your own CORE add-on, refer to the KIM Developer's Guide.|
Set the following options regardless of the type of document repository selected.
This sets the list of document feature names. The names must contain only letters and numbers. Letters are upper-cased internally, so they are not case-sensitive. Commas are treated as delimiters. Equal feature names are merged into one name. The retrieved feature structure is used as default in the document population process. If this parameter is empty, then the default set of document features is used. This set contains all features available for the documents in the KIM demonstration corpus.
Although it works out of the box, there are still some performance-related options that may need to be configured. The following three restrict the complexity of the queries to the Lucene storage engine. KIM enforces them because, if these thresholds are ignored, the stability of the system may be affected.
To learn more about this parameter check the details about Boolean clause restrictions.
When sar is used as a document repository module implementation, the annotated documents are saved as files in <KIM_HOME>/context/default/docs. The following option controls how the files are stored.
|gate-xml||stores files as GATE XML documents. They can be opened with GATE Developer directly or imported into GATE Teamware.|
|compressed-xml||stores files as GATE XML documents, compressed using Glassfish XML Fast Infoset encoding and then GZIP. Storage requirements are very low in this case, but the files need to be decompressed before use outside KIM 3.7.|
|simple-gzip||stores files as GATE XML document, compressed using the gzip compression algorithm.|
|xces-xml||stores files as XCES XML documents|
|kryo-gzip||stores files as GATE XML documents compressed with gzip and Kryo. This is the option that gives best compression results, but has several limitations when used.|
In KIM 3.7 we introduce new objects called document handlers. They are attached to the Document Repository module and are executed:
- before a document is added - one can add various features to the document or use to gain statistics
- after a document is added - mainly for statistics, but may be used to generate a detailed log for that document
- when a document is removed from the document repository
Document handlers are configured through the comma-separated list of values within <KIM_HOME>/config/install.properties called
More information about document handlers and their implementation can be found in the KIM Developer's Guide.
A KIM Extension can perform any procedure during KIM startup or register additional services that run within KIM server context. For example, KIM initiates basic caching using an extension, or exposes RdfCore administration interface as JMX MBean objects.
KIM Extensions, similar to KIM Document Handlers, are configured through the comma-separated list of values within <KIM_HOME>/config/install.properties called
More information on implementing a KIM extension can be found in the KIM Developer's Guide.