The Large Knowledge Base (LKB) gazetteer consists of a set of lists containing concepts (Persons, Locations, Organizations, etc.) loaded directly from the semantic repository you are going to use as a background knowledge, instead of predetermined and flat set of gazetteer lists.
It means that certain annotated entities are linked to specific instances in the semantic repository. The LKB is part of the GATE distribution and provides efficient representation of very large vocabularies, as well as query-based selective loading from RDF databases.
The Large Knowledge Base (KB) Gazetteer allows loading collections of identifiers and labels, and uses them for gazetteer lookup. It supports custom implementations of the DictionaryFeeder interface to populate its dictionaries, which can utilize arbitrary dictionary data sources.
The LKB Gazetteer supports a Static dictionary loaded at component initialization and Dynamic dictionaries loaded on Document processing.
1. Static Dictionary highlights:
- Can hold huge amounts of data (millions of entries)
- Can be serialized - for faster second-time loading
- Can be shared in memory by multiple Gaz instance working in parallel
2. Dynamic Dictionaries highlights:
- Are populated during Document processing and can hold just the needed.
- Can be loaded with up to the last minute data or Document dependent data.
- Are disposed at Document procesing end.
The Gazetteer can use just Static or just Dynamic or both dictionaries simultaneously.
The LKB makes use of a number of configuration files such as the set of SPARQL queries to be used on the ontology.
1. To use the Large KB gazetteer, set up your dictionary ﬁrst.
The dictionary is a folder with some conﬁguration ﬁles. Use the samples at GATE_HOME/plugins/Gazetteer_LKB/samples as a guide.
2. Load GATE_HOME/plugins/Gazetteer_LKB as a CREOLE plugin.
3. Create a new ‘Large KB Gazetteer’ processing resource (PR). Put the folder of the dictionary you created in the dictionaryPath parameter. You can leave the rest of the parameters as defaults.
4. Add the PR to your GATE application. The gazetteer doesn’t require a tokenizer or the output of any other processing resources.
The gazetteer will create annotations with type ‘Lookup’ and two features; ‘inst’, which contains the URI of the ontology instance, and ‘class’ which contains the URI of the ontology class that instance belongs to.
The dictionary is a folder with some conﬁguration ﬁles. You can ﬁnd samples at GATE_HOME/plugins/Gazetteer_LKB/samples.
1. Deﬁne your RDF ontology and then specify a SPARQL or SERQL query that will retrieve a subset of that ontology as a dictionary.
conﬁg.ttl is a Turtle RDF ﬁle which conﬁgures a local RDF ontology or connection to a remote Sesame RDF database.
If you want to see examples of how to use local RDF ﬁles, please check samples/dictionary_from_local_ontology/conﬁg.ttl. The Sesame repository conﬁguration section conﬁgures a local Ontotext SwiftOWLIM database that loads a list of RDF ﬁles.
2. Create a list of your RDF ﬁles and reuse the rest of the conﬁguration.
The sample conﬁguration support datasets with 10,000,000 triples with acceptable performance. For working with larger datasets, advanced users can substitute SwiftOWLIM with another Sesame RDF engine. In that case, make sure you add the necessary JARs to the list in GATE_HOME/plugins/Gazetteer_LKB/creole.xml. For example, Ontotext BigOWL is a Sesame RDF engine that can load billions of triples on desktop hardware.
Since any Sesame repository can be conﬁgured in conﬁg.ttl, the Large KB Gazetteer can extract dictionaries from all signiﬁcant RDF databases.
query.txt contains a SPARQL query. You can write any query you like, as long as its projection contains at least two columns in the following order: label and instance. As an option, you can also add a third column for the ontology class of the RDF entity. Below you can see a sample query, which creates a dictionary from the names and the unique identiﬁers of 10,000 entertainers in DbPedia.
Try this query at the Linked Data Semantic Repository.
When you load the dictionary conﬁguration in GATE for the ﬁrst time, it creates a binary snapshot of the dictionary. Thereafter it will load only this binary snapshot. If the dictionary conﬁguration is changed, the snapshot will be reinitialized automatically.
The conﬁg.ttl may contain additional dictionary conﬁguration. Such conﬁguration concerns only the initial loading of the dictionary from the RDF database. The options are still being determined and more will appear in future versions. They must be placed below the repository conﬁguration section as attributes of a dictionary conﬁguration. Here is a sample conﬁg.ttl ﬁle with additional conﬁguration.
The following options can be set when the gazetteer PR is initialized:
dictionaryPath - the dictionary folder described above.
forceCaseSensitive - whether the gazetteer should return case-sensitive matches regardless of the loaded dictionary.
annotationSetName - The annotation set, which will receive the generated lookup annotations.
annotationLimit - The maximum number of the generated annotations. NULL or 0 for no limit.
Setting limit of the number of the created annotations will reduce the memory consumption of GATE on large documents. Note that GATE documents consume gigabytes of memory if there are tens of thousands of annotations in the document. All PRs that create large number of annotations like the gazetteers and tokenizers may cause an Out Of Memory error on large texts. Setting that option limits the amount of memory that the gazetteer will use.
The Semantic Enrichment PR allows adding new data to semantic annotations by querying external RDF (Linked Data) repositories. It is a companion to the large KB gazetteer that showcases the usefulness of using Linked Data URI as identiﬁers.
Here a semantic annotation is an annotation that is linked to an RDF entity by having the URI of the entity in the ‘inst’ feature of the annotation. For all such annotation of a given type, this PR runs a SPARQL query against the deﬁned repository and puts a comma-separated list of the values mentioned in the query output in the ‘connections’ feature of the same annotation.
There is a sample pipeline that features the Semantic Enrichment PR.
inputASName - the annotation set, which annotation will be processed.
server - the URL of the Sesame 2 HTTP repository. Support for generic SPARQL endpoints can be implemented if required.
repositoryId - the ID of the Sesame repository.
annotationTypes - a list of types of annotation that will be processed.
query - a SPARQL query pattern. The query will be processed like this - String.format (query, uriFromAnnotation), so you can use parameters like %s or %1$s.
deleteOnNoRelations - whether we want to delete the annotation that weren’t enriched. Helps to clean up the input annotations.
The DefaultGazetteer (and its subclasses such as the OntoRootGazetteer) compiles its gazetteer data into a ﬁnite state matcher at initialization time. For large gazetteers this FSM requires a considerable amount of memory. However, once the FSM has been built then (as long as you do not modify it dynamically using Gaze) it is accessed in a read-only manner at runtime. For a multi-threaded application that requires several identical copies of its processing resources, GATE provides a mechanism whereby a single compiled FSM can be shared between several gazetteer PRs that can then be executed concurrently in diﬀerent threads, saving the memory that would otherwise be required to load the lists several times.
(note) This feature is not available in the GATE Developer GUI, as it is only intended for use in embedded code. (note)
To make use of it, ﬁrst create a single instance of the regular DefaultGazetteer or OntoRootGazetteer:
The SharedDefaultGazetteer instance will re-use the FSM that was built by the mainGazetteer instead of loading its own.