compared with
Current by Reneta Popova
on Mar 21, 2013 15:35.

Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (30)

View Page History
The Large Knowledge Base (LKB) gazetteer consists of a set of lists containing concepts (Persons, Locations, Organizations, etc.) loaded directly from the semantic repository you are going to use as a background knowledge, instead of predetermined and flat set of gazetteer lists. It means that certain annotated entities are linked to specific instances in the semantic repository. The LKB is part of the GATE distribution and provides efficient representation of very large vocabularies, as well as query-based selective loading from RDF databases.

It means that certain annotated entities are linked to specific instances in the semantic repository. The LKB is part of the GATE distribution and provides efficient representation of very large vocabularies, as well as query-based selective loading from RDF databases.

The Large Knowledge Base (KB) Gazetteer allows loading collections of identifiers and labels, and uses them for gazetteer lookup. It supports custom implementations of the "DictionaryFeeder" *{{DictionaryFeeder}}* interface to populate its dictionaries, which can utilize arbitrary dictionary data sources.

The LKB Gazetteer supports a Static dictionary loaded at component initialization and Dynamic dictionaries loaded on Document processing.

1. Static Dictionary highlights:

* Can hold huge amounts of data (millions of entries)
* Can be serialized - for faster second-time loading

2. Dynamic Dictionaries highlights:

* Are populated during Document processing and can hold just the needed.
* Can be loaded with up to the last minute data or Document dependent data.
The LKB makes use of a number of configuration files such as the set of SPARQL queries to be used on the ontology.


h2. Quick usage overview

To use the Large KB gazetteer, set up your dictionary first. The dictionary is a folder with some configuration files. Use the samples at GATE_HOME/plugins/Gazetteer_LKB/samples as a guide or download a prebuilt dictionary from http://ontotext.com/kim/lkb_gazetteer/dictionaries.
Load GATE_HOME/plugins/Gazetteer_LKB as a CREOLE plugin. See Section 3.5 for details.
1. To use the Large KB gazetteer, set up your dictionary first.
The dictionary is a folder with some configuration files. Use the samples at GATE_HOME/plugins/Gazetteer_LKB/samples as a guide.

2. Load *GATE_HOME/plugins/Gazetteer_LKB* as a CREOLE plugin.

3. Create a new ‘Large KB Gazetteer’ processing resource (PR). Put the folder of the dictionary you created in the ‘dictionaryPath’ *{{dictionaryPath}}* parameter. You can leave the rest of the parameters as defaults.

4. Add the PR to your GATE application. The gazetteer doesn’t require a tokenizer or the output of any other processing resources.
The gazetteer will create annotations with type ‘Lookup’ and two features; ‘inst’, which contains the URI of the ontology instance, and ‘class’ which contains the URI of the ontology class that instance belongs to.

h3. Dictionary setup

The dictionary is a folder with some configuration files. You can find samples at *GATE_HOME/plugins/Gazetteer_LKB/samples*.
Setting up your own dictionary is easy. You need to define your RDF ontology and then specify a SPARQL or SERQL query that will retrieve a subset of that ontology as a dictionary.
config.ttl is a Turtle RDF file which configures a local RDF ontology or connection to a remote Sesame RDF database.
If you want to see examples of how to use local RDF files, please check samples/dictionary_from_local_ontology/config.ttl. The Sesame repository configuration section configures a local Ontotext SwiftOWLIM database that loads a list of RDF files. Simply create a list of your RDF files and reuse the rest of the configuration. The sample configuration support datasets with 10,000,000 triples with acceptable performance. For working with larger datasets, advanced users can substitute SwiftOWLIM with another Sesame RDF engine. In that case, make sure you add the necessary JARs to the list in GATE_HOME/plugins/Gazetteer_LKB/creole.xml. For example, Ontotext BigOWL is a Sesame RDF engine that can load billions of triples on desktop hardware.
Since any Sesame repository can be configured in config.ttl, the Large KB Gazetteer can extract dictionaries from all significant RDF databases. See the page on database compatibility for more information.
query.txt contains a SPARQL query. You can write any query you like, as long as its projection contains at least two columns in the following order: label and instance. As an option, you can also add a third column for the ontology class of the RDF entity. Below you can see a sample query, which creates a dictionary from the names and the unique identifiers of 10,000 entertainers in DbPedia.

1. Define your RDF ontology and then specify a SPARQL or SERQL query that will retrieve a subset of that ontology as a dictionary.

*config.ttl* is a Turtle RDF file which configures a local RDF ontology or connection to a remote Sesame RDF database.

(/) If you want to see examples of how to use local RDF files, please check *samples/dictionary_from_local_ontology/config.ttl*. The Sesame repository configuration section configures a local Ontotext SwiftOWLIM database that loads a list of RDF files.

2. Create a list of your RDF files and reuse the rest of the configuration.

The sample configuration support datasets with 10,000,000 triples with acceptable performance. For working with larger datasets, advanced users can substitute SwiftOWLIM with another Sesame RDF engine. In that case, make sure you add the necessary JARs to the list in *GATE_HOME/plugins/Gazetteer_LKB/creole.xml*. For example, Ontotext BigOWL is a Sesame RDF engine that can load billions of triples on desktop hardware.

Since any Sesame repository can be configured in *config.ttl*, the Large KB Gazetteer can extract dictionaries from all significant RDF databases.

{excerpt}
*query.txt* contains a SPARQL query. You can write any query you like, as long as its projection contains at least two columns in the following order: label and instance. As an option, you can also add a third column for the ontology class of the RDF entity. Below you can see a sample query, which creates a dictionary from the names and the unique identifiers of 10,000 entertainers in DbPedia.

{code}
PREFIX opencyc: <http://sw.opencyc.org/2008/06/10/concept/en/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
FILTER (lang(?Name) = "en")
} LIMIT 10000
{code}

Try this query at the Linked Data Semantic Repository.
When you load the dictionary configuration in GATE for the first time, it creates a binary snapshot of the dictionary. Thereafter it will load only this binary snapshot. If the dictionary configuration is changed, the snapshot will be reinitialized automatically. For more information, please see the dictionary lifecycle specification.
{excerpt}

h3. Additional dictionary configuration
The following options can be set when the gazetteer PR is initialized:

dictionaryPath; {{dictionaryPath}} - the dictionary folder described above.
forceCaseSensitive; {{forceCaseSensitive}} - whether the gazetteer should return case-sensitive matches regardless of the loaded dictionary.

h3. Runtime configuration

{{annotationSetName}} - The annotation set, which will receive the generated lookup annotations.
annotationLimit - The maximum number of the generated annotations. NULL or 0 for no limit. Setting limit of the number of the created annotations will reduce the memory consumption of GATE on large documents. Note that GATE documents consume gigabytes of memory if there are tens of thousands of annotations in the document. All PRs that create large number of annotations like the gazetteers and tokenizers may cause an Out Of Memory error on large texts. Setting that option limits the amount of memory that the gazetteer will use.
{{annotationLimit}} - The maximum number of the generated annotations. NULL or 0 for no limit.

(!) Setting limit of the number of the created annotations will reduce the memory consumption of GATE on large documents. Note that GATE documents consume gigabytes of memory if there are tens of thousands of annotations in the document. All PRs that create large number of annotations like the gazetteers and tokenizers may cause an Out Of Memory error on large texts. Setting that option limits the amount of memory that the gazetteer will use.

h3. Semantic Enrichment PR

h2. The Shared Gazetteer for multithreaded processing

The DefaultGazetteer (and its subclasses such as the OntoRootGazetteer) compiles its gazetteer data into a finite state matcher at initialization time. For large gazetteers this FSM requires a considerable amount of memory. However, once the FSM has been built then (as long as you do not modify it dynamically using Gaze) it is accessed in a read-only manner at runtime. For a multi-threaded application that requires several identical copies of its processing resources (see section 7.14), GATE provides a mechanism whereby a single compiled FSM can be shared between several gazetteer PRs that can then be executed concurrently in different threads, saving the memory that would otherwise be required to load the lists several times.
This feature is not available in the GATE Developer GUI, as it is only intended for use in embedded code. To make use of it, first create a single instance of the regular DefaultGazetteer or OntoRootGazetteer:
(note) This feature is not available in the GATE Developer GUI, as it is only intended for use in embedded code. (note)

To make use of it, first create a single instance of the regular DefaultGazetteer or OntoRootGazetteer:

{code}
FeatureMap params = Factory.newFeatureMap();
"gate.creole.gazetteer.SharedDefaultGazetteer", params);
{code}

The SharedDefaultGazetteer instance will re-use the FSM that was built by the mainGazetteer instead of loading its own.