{toc}
h2. Overview
As we have already provided means to annotate documents through SPARQL, now it makes sense to continue the trend and expose more concept extraction oriented functionalities in the same fashion. This page provides a comprehensive definition of the embedded CES API service including control mechanisms for its administration and configuration.
h2. Semantic annotation
The document annotation is executed through a specially crafted SELECT SPARQL query. It takes a single triple pattern, which consists of a binding variable, a special predicate, and an [RDF Collection|http://www.w3.org/TR/rdf-sparql-query/#collections] holding the parameters.
The following example will get you started:
{code:language=html/xml}SELECT * WHERE {
?s <http://www.ontotext.com/owlim/ces#annotate> (
"""content=<?xml version="1.0" encoding="UTF-8"?>
<tns:document id="http://ontotext.com/publishing/document/215351"
xmlns:tns="http://www.ontotext.com/DocumentSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<tns:feature-set>
<tns:feature>
<tns:name type="xs:string">sourceUrl</tns:name>
<tns:value type="xs:string"></tns:value>
</tns:feature>
</tns:feature-set>
<tns:document-parts>
<tns:document-part id="1" part="TITLE">
<tns:content>
China economy on the rise
</tns:content>
</tns:document-part>
</tns:document-parts>
</tns:document>"""
"domain=urn:my.domain:namespace"
"content-type=application/vnd.ontotext.ces.document+xml"
"accept-type=application/vnd.ontotext.ces.document+xml") .
}
{code}
Where:
* <[http://www.ontotext.com/owlim/ces#annotate]> is a special predicate, which means that GraphDB listens for it and knows how to interpret it;
* "content=China economy on the rise" is the text of the document. However, human readable text form is accepted only if it is a single line (useful to show the idea);
* "domain-name=[http://www.ontotext.com/owlim/ces#default]" is a domain identifier - it explicitly denotes which extraction algorithm should be used. Different domains usually require different extraction techniques;
* "content type=text/plain" is the MIME type of the document, which is just a plain text in this example.
h3. Parameter reference
|| Parameter || Required || Supported values || Default value || Comment ||
| content | true | XML/JSON \\ {warning}The input should already be validated, no validation is performed at query parsing and processing level.{warning} | none | {note}The (de)serialising is done with the URL-safe flag turned on. \\
See [http://en.wikipedia.org/wiki/Base64#URL_applications] \\
and [http://commons.apache.org/codec/apidocs/org/apache/commons/codec/binary/Base64.html]\\ \\
\\ \\
\\
{note} |
| domain | true | URN | none | eg: [http://www.ontotext.com/owlim/ces#default] \\ |
| content-type | false | "application/vnd.ontotext.ces.document+xml", "application/vnd.ontotext.ces.document+json" | "application/vnd.ontotext.ces.document+xml" | This is the type of the encoded document. A markup aware GATE document is created from it. |
| accept-type | false \\ | "application/vnd.ontotext.ces.document+xml", "application/vnd.ontotext.ces.document+json" | "application/vnd.ontotext.ces.document+xml" | This parameter serves to indicate the preferred result type - XML or JSON. \\ |
| annotation-sets-to-preserve | false \\ | Comma separated list of the internal annotation set names (strings). I.e. to preserve the annotation set {code}<tns:annotation-set name="brendan" ref="2">{code} you need to add the following parameter {code}"annotation-sets-to-preserve=brendan|2"{code} An internal name is in the format <annotation set name>\|<ref id> \\ | none \\ | This parameter allows you to specify which annotation sets should be preserved. \\
Trimming is performed, so no white spaces are allowed at the start and end of a name. \\ |
h3. Results
The result of the semantic annotation is a JSON/XML document with annotations. The document is a valid instance of Ontotext's generic schema definition.
h2. Re-training
Not defined yet. It depends on the pipeline whether or not it contains re-trainable machine learning components.
h2. Administration and configuration
Concept extraction is disabled by default. Start/stop is achieved through SPARQL Update queries.
h3. Start
* The first time initialisation *DOES NOT* include gazetteer cache loading - see [Reload dictionary|#Reloaddictionary].
* On subsequent starts it loads the cache from the file system.
To start *ALL* registered concept extraction pipelines, use the following query:
{code:language=html/xml}
INSERT DATA {
[] <http://www.ontotext.com/owlim/ces#start> [].
}
{code}
To start a specific pipeline, include its specific name graph in the query. See an example with the default pipeline:
{code:lang=xml}
INSERT DATA {
GRAPH <http://www.ontotext.com/owlim/ces#default> {
[] <http://www.ontotext.com/owlim/ces#start> [].
}
}
{code}
h3. Stop
To stop *ALL* concept extraction pipelines, use:
{code:language=html/xml}
INSERT DATA {
[] <http://www.ontotext.com/owlim/ces#stop> [].
}
{code}
To stop a specific pipeline, use the named graph of the pipeline:
{code:lang=xml}
INSERT DATA {
GRAPH <http://www.ontotext.com/owlim/ces#default> {
[] <http://www.ontotext.com/owlim/ces#stop> [].
}
}
{code}
h3. Reload dictionary
Reload dictionary cleans the gazetteer cache from the file system and loads it again from the repository.
{note}In case the concept extraction service is not started, this SPARQL update operation will *NOT* schedule a dictionary reload (unlike before).{note}
The following query initiates a dictionary reload on all running pipelines. To specify a particular pipeline, use a named graph as shown in the Start/Stop sections of this page.
{code:language=html/xml}
INSERT DATA {
[] <http://www.ontotext.com/owlim/ces#reloadDictionary> [].
}
{code}
{code:language=html/xml}
INSERT DATA {
GRAPH <http://www.ontotext.com/owlim/ces#default> {
[] <http://www.ontotext.com/owlim/ces#reloadDictionary> [].
}
}
{code}
h3. Add/remove gazetteer configuration
Add/remove gazetteer configuration registers template queries for the different entity types via INSERT/DELETE DATA.
* <[http://www.ontotext.com/owlim/ces#gazetteerConfig]> is a special (interpretable) predicate that denotes a gazetteer template query entry;
* Each gazetteer configuration should be added in a separate named graph (per domain), i.e. the default pipeline uses <[http://www.ontotext.com/owlim/ces#default]>;
* The template queries are also executed for all sub-classes of the defined class;
* The configuration is stored as regular triples in the repository and is loaded on the concept extraction initialisation.
Example configuration that indicates how to load all rdfs:labels of all Agents, Locations and EconomicConcepts into the gazetteer dictionary.
{code:language=html/xml}
INSERT DATA { GRAPH <http://www.ontotext.com/owlim/ces#default> {
<http://ontotext.com/ontologies/core/Agent> <http://www.ontotext.com/owlim/ces#gazetteerConfig> "select ?label ?inst where {<%s> a <http://ontotext.com/ontologies/core/Agent> . <%s> rdfs:label ?label.}" .
<http://ontotext.com/ontologies/location/Location> <http://www.ontotext.com/owlim/ces#gazetteerConfig> "select ?label ?inst where {<%s> a <http://ontotext.com/ontologies/location/Location> . <%s> rdfs:label ?label.}" .
<http://ontotext.com/ontologies/economy/EconomicConcept> <http://ontotext.com/owlim/ces#gazetteerConfig> "select ?label ?inst where {<%s> a <http://ontotext.com/ontologies/economy/EconomicConcept> . <%s> rdfs:label ?label.}" . }
}
{code}
{info}Adding/removing gazetteer configuration does not take full effect immediately. For example, the result of adding a new template query, is that the CES plugin starts to listen for entities of its corresponding type. However, it does not load already existing entities of the same type. In order to do this, you should trigger a dictionary reload.{info}
h2. FAQ
h3. How to deploy a pipeline?
Just unpack your pipeline into _$\{info.aduna.platform.appdata.basedir\}/repositories/$\{repository.name\}/storage/ces/pipelines/_ and it will be discovered automatically. You can confirm it is discovered by finding an MBean, called _PipelineManager_, and checking its _AvailablePipelines_ property, which lists the URIs of all deployed pipelines. Note that in order to usе the pipeline you need to start it.
h3. How to preserve annotation sets?
Check out the _annotation-sets-to-preserve_ parameter of the annotation query above. No annotation sets are preserved by default.
h2. Overview
As we have already provided means to annotate documents through SPARQL, now it makes sense to continue the trend and expose more concept extraction oriented functionalities in the same fashion. This page provides a comprehensive definition of the embedded CES API service including control mechanisms for its administration and configuration.
h2. Semantic annotation
The document annotation is executed through a specially crafted SELECT SPARQL query. It takes a single triple pattern, which consists of a binding variable, a special predicate, and an [RDF Collection|http://www.w3.org/TR/rdf-sparql-query/#collections] holding the parameters.
The following example will get you started:
{code:language=html/xml}SELECT * WHERE {
?s <http://www.ontotext.com/owlim/ces#annotate> (
"""content=<?xml version="1.0" encoding="UTF-8"?>
<tns:document id="http://ontotext.com/publishing/document/215351"
xmlns:tns="http://www.ontotext.com/DocumentSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<tns:feature-set>
<tns:feature>
<tns:name type="xs:string">sourceUrl</tns:name>
<tns:value type="xs:string"></tns:value>
</tns:feature>
</tns:feature-set>
<tns:document-parts>
<tns:document-part id="1" part="TITLE">
<tns:content>
China economy on the rise
</tns:content>
</tns:document-part>
</tns:document-parts>
</tns:document>"""
"domain=urn:my.domain:namespace"
"content-type=application/vnd.ontotext.ces.document+xml"
"accept-type=application/vnd.ontotext.ces.document+xml") .
}
{code}
Where:
* <[http://www.ontotext.com/owlim/ces#annotate]> is a special predicate, which means that GraphDB listens for it and knows how to interpret it;
* "content=China economy on the rise" is the text of the document. However, human readable text form is accepted only if it is a single line (useful to show the idea);
* "domain-name=[http://www.ontotext.com/owlim/ces#default]" is a domain identifier - it explicitly denotes which extraction algorithm should be used. Different domains usually require different extraction techniques;
* "content type=text/plain" is the MIME type of the document, which is just a plain text in this example.
h3. Parameter reference
|| Parameter || Required || Supported values || Default value || Comment ||
| content | true | XML/JSON \\ {warning}The input should already be validated, no validation is performed at query parsing and processing level.{warning} | none | {note}The (de)serialising is done with the URL-safe flag turned on. \\
See [http://en.wikipedia.org/wiki/Base64#URL_applications] \\
and [http://commons.apache.org/codec/apidocs/org/apache/commons/codec/binary/Base64.html]\\ \\
\\ \\
\\
{note} |
| domain | true | URN | none | eg: [http://www.ontotext.com/owlim/ces#default] \\ |
| content-type | false | "application/vnd.ontotext.ces.document+xml", "application/vnd.ontotext.ces.document+json" | "application/vnd.ontotext.ces.document+xml" | This is the type of the encoded document. A markup aware GATE document is created from it. |
| accept-type | false \\ | "application/vnd.ontotext.ces.document+xml", "application/vnd.ontotext.ces.document+json" | "application/vnd.ontotext.ces.document+xml" | This parameter serves to indicate the preferred result type - XML or JSON. \\ |
| annotation-sets-to-preserve | false \\ | Comma separated list of the internal annotation set names (strings). I.e. to preserve the annotation set {code}<tns:annotation-set name="brendan" ref="2">{code} you need to add the following parameter {code}"annotation-sets-to-preserve=brendan|2"{code} An internal name is in the format <annotation set name>\|<ref id> \\ | none \\ | This parameter allows you to specify which annotation sets should be preserved. \\
Trimming is performed, so no white spaces are allowed at the start and end of a name. \\ |
h3. Results
The result of the semantic annotation is a JSON/XML document with annotations. The document is a valid instance of Ontotext's generic schema definition.
h2. Re-training
Not defined yet. It depends on the pipeline whether or not it contains re-trainable machine learning components.
h2. Administration and configuration
Concept extraction is disabled by default. Start/stop is achieved through SPARQL Update queries.
h3. Start
* The first time initialisation *DOES NOT* include gazetteer cache loading - see [Reload dictionary|#Reloaddictionary].
* On subsequent starts it loads the cache from the file system.
To start *ALL* registered concept extraction pipelines, use the following query:
{code:language=html/xml}
INSERT DATA {
[] <http://www.ontotext.com/owlim/ces#start> [].
}
{code}
To start a specific pipeline, include its specific name graph in the query. See an example with the default pipeline:
{code:lang=xml}
INSERT DATA {
GRAPH <http://www.ontotext.com/owlim/ces#default> {
[] <http://www.ontotext.com/owlim/ces#start> [].
}
}
{code}
h3. Stop
To stop *ALL* concept extraction pipelines, use:
{code:language=html/xml}
INSERT DATA {
[] <http://www.ontotext.com/owlim/ces#stop> [].
}
{code}
To stop a specific pipeline, use the named graph of the pipeline:
{code:lang=xml}
INSERT DATA {
GRAPH <http://www.ontotext.com/owlim/ces#default> {
[] <http://www.ontotext.com/owlim/ces#stop> [].
}
}
{code}
h3. Reload dictionary
Reload dictionary cleans the gazetteer cache from the file system and loads it again from the repository.
{note}In case the concept extraction service is not started, this SPARQL update operation will *NOT* schedule a dictionary reload (unlike before).{note}
The following query initiates a dictionary reload on all running pipelines. To specify a particular pipeline, use a named graph as shown in the Start/Stop sections of this page.
{code:language=html/xml}
INSERT DATA {
[] <http://www.ontotext.com/owlim/ces#reloadDictionary> [].
}
{code}
{code:language=html/xml}
INSERT DATA {
GRAPH <http://www.ontotext.com/owlim/ces#default> {
[] <http://www.ontotext.com/owlim/ces#reloadDictionary> [].
}
}
{code}
h3. Add/remove gazetteer configuration
Add/remove gazetteer configuration registers template queries for the different entity types via INSERT/DELETE DATA.
* <[http://www.ontotext.com/owlim/ces#gazetteerConfig]> is a special (interpretable) predicate that denotes a gazetteer template query entry;
* Each gazetteer configuration should be added in a separate named graph (per domain), i.e. the default pipeline uses <[http://www.ontotext.com/owlim/ces#default]>;
* The template queries are also executed for all sub-classes of the defined class;
* The configuration is stored as regular triples in the repository and is loaded on the concept extraction initialisation.
Example configuration that indicates how to load all rdfs:labels of all Agents, Locations and EconomicConcepts into the gazetteer dictionary.
{code:language=html/xml}
INSERT DATA { GRAPH <http://www.ontotext.com/owlim/ces#default> {
<http://ontotext.com/ontologies/core/Agent> <http://www.ontotext.com/owlim/ces#gazetteerConfig> "select ?label ?inst where {<%s> a <http://ontotext.com/ontologies/core/Agent> . <%s> rdfs:label ?label.}" .
<http://ontotext.com/ontologies/location/Location> <http://www.ontotext.com/owlim/ces#gazetteerConfig> "select ?label ?inst where {<%s> a <http://ontotext.com/ontologies/location/Location> . <%s> rdfs:label ?label.}" .
<http://ontotext.com/ontologies/economy/EconomicConcept> <http://ontotext.com/owlim/ces#gazetteerConfig> "select ?label ?inst where {<%s> a <http://ontotext.com/ontologies/economy/EconomicConcept> . <%s> rdfs:label ?label.}" . }
}
{code}
{info}Adding/removing gazetteer configuration does not take full effect immediately. For example, the result of adding a new template query, is that the CES plugin starts to listen for entities of its corresponding type. However, it does not load already existing entities of the same type. In order to do this, you should trigger a dictionary reload.{info}
h2. FAQ
h3. How to deploy a pipeline?
Just unpack your pipeline into _$\{info.aduna.platform.appdata.basedir\}/repositories/$\{repository.name\}/storage/ces/pipelines/_ and it will be discovered automatically. You can confirm it is discovered by finding an MBean, called _PipelineManager_, and checking its _AvailablePipelines_ property, which lists the URIs of all deployed pipelines. Note that in order to usе the pipeline you need to start it.
h3. How to preserve annotation sets?
Check out the _annotation-sets-to-preserve_ parameter of the annotation query above. No annotation sets are preserved by default.