compared with
Current by Neli Hateva
on Apr 20, 2016 18:27.

Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (44)

View Page History
{panel}This page includes information about the gazetteer PR and the two LRs used by the gazetteer - TRIE Cache and Metadata.{panel}

{toc}

h3. Purpose of the Linked Data Gazetteer Processing Resource

## (optional) {{path2ignorewords}} \- the path to a plain text file containing words that should be ignored when filling the cache;
## (optional) {{path2rules}} \- the path to a groovy source code file containing rules for re-writing a label. The source must contain a workFlow method that returns a set of Strings. The groovy code is used for creating derivatives of a label. When a label is added to the cache, all its derivatives are added as well. For example, let's say you want to lookup the label "Obama, Barak" but in most articles the mention of the name is "Barak Obama". You can use a rule in the groovy code so when you pass "Obama, Barak" you get "Obama, Barak", "Barak Obama" and both labels are added to the cache for matching.
## {{queryFile}} \- URL of a file containing SPARQL queries that return a list of triplets - {{<URI instance, Literal label, URI type>}}. The SPARQL queries in the file are separated by {{@@@}}. They are executed one after another and the result of each is fed to the cache. If there is no file for deserialization in {{path2dic}}, the queries are executed on the SPARQL server defined in {{connectionString}}. SPARQL variables bound to elements of the triplets:
{code}
?concept <-> URI instance;
## (optional) {{updateable}} \- in some cases, after initial loading, the cache should not be available for updates. Set this to 'false', if needed.

Note: The SPARQL endpoint has to be opened or the security has to be off, because there are no parameters for user name and password.

h3. Metadata Language Resource

# Using the Resource: This resource represents a cache structure containing {{(identifier(instance URI, class URI)}} \-> list of semantic metadata features in the {{(identifier(feature value, feature name)) form)}}. It is used by the Linked Data Gazetteer Processing rResource to assign additional features to the generated Lookup annotations. The Metadata Language Resource and the TRIE Cache Language Resource are very tightly coupled as they should need to use common structures to identify the entity instances represented by the Lookup annotations.
# Initialization: This Language Resource has several initialization parameters, two of which are mandatory. Parameters The parameters include location of the SPARQL endpoint for cache loading, paths to the shared structures with the TRIE Cache Language Resource, and path to a file containing queries used to load the cache.
# Parameters:
## connectionString - SPARQL endpoint for the remote repository where semantic data is stored;
## indexFilePath - path to the filesystem location where cache will be serialized;
## pathToEntityPoolFolder - the location of the Entity Pool structure, shared with the TRIE Cache Language Resource. The value for of this parameter should must be exactly the same with as the value of the 'path2dic' parameter of the TRIE Cache.
## queryFilePath - the path to the file with queries to load the resource data.

h3. Linked Data Gazetteer Processing Resource

# Using the Resource: This Processing Resource runs over the document text and produces Lookup annotations with (optionally) semantic data features. It uses the TRIE Cache Language Resource and (optionally) the Metadata Language Resource.
# Runtime parameters:
## cacheLR - this is the TRIE Cache Language Resource that will be used for matching;
## (optional) inputAsName - the name of the annotation set where the Token annotations are, the default is <null>, i.e. the default annotation setting;
## (optional) metadataLR - the Metadata Language resource bound to the corresponding TRIE Cache Language resource.

h3. Step-by-step guide for creating and adding a Linked Data Gazetteer Processing Resource into a pipeline
h3. Step-by-step Guide for Creating and Adding a Linked Data Gazetteer Processing Resource into a Pipeline

# open a GATE Developer instance
# Open a GATE Developer instance.
# (optional) load (Optional) Load the GATE application where the Gazetteer PR is to be added.
# lLoad the gazetteer CREOLE plugin:
## File \-> Manage CREOLE plugins.
## click on + button
## Click on + button.
## tType in the directory location (prefixing it with 'file://') or use the 'Select a Directory' button.
## click 'OK', check the 'Load Now' option for the newly loaded plugin
## Click 'OK' and select the 'Load Now' option for the newly loaded plugin.
## cClick 'Apply All'
# prepare a helper application
# Prepare a helper application:
## rRight-click 'Applications' from the left-hand side menu, select 'Create New Application \-> Conditional Corpus Pipeline'.
## dDouble click the created pipeline and add an existing tokenization PR into it; if no tokenization Processing Resources are present, use the following procedure:
### lLoad the ANNIE CREOLE plugin from the 'Manage CREOLE plugins' screen.
### rRight-click 'Processing Resources \-> ANNIE English Tokeniser'.
### aAdd the newly created resource to the helper pipeline.
# rRight-click 'Language Resources \-> Trie TRIE Cache for Linked Data Gazetteer' to create a TRIE Cache for the Linked Data Gazetteer Language Resource.
# sSet parameter values, refer to the TRIE Cache Parameters section for more information.
# cClick 'OK' and the TRIE Cache Language resource should will begin preparing its cache by evaluating the queries against the SPARQL endpoint or deserializing existing data.
# (optional) right-click (Optional) Right-click 'Language Resources \-> Metadata LR' to create a Metadata Language Resource.
# (optional) set (Optional) Set parameter values, refer to the Metadata Language Resource Parameters section for more information.
# (optional) click 'OK' and the Metadata Language resource should begin preparing its cache by evaluating the queries or deserializing existing data
# (Optional) Click 'OK' and the Metadata Language resource will begin preparing its cache by evaluating the queries against the SPARQL endpoint or deserializing existing data.
# rRight-click 'Processing Resources \-> Linked Data Gazetteer' to create a Linked Data Gazetteer Processing Resource.
# oOpen the pipeline where the Linked Data Gazetteer should must be added and add the newly created instance at the desired position within the pipeline.
# sSelect the Linked Data Gazetteer Processing Resource and set its runtime parameters, refer to the Linked Data Gazetteer Processing Resource Runtime parameters section.