
Overview
We already have provided means to annotate documents through SPARQL and it makes sense to continue the trend and expose more concept extraction oriented functionality in the same fashion. The idea is to expose control mechanisms for administration and configuration of the embedded CES service. This page provides a comprehensive definition of this API.
Semantic annotation
The document annotation is executed through a specially crafted SELECT SPARQL query. It takes a single triple pattern, which consists of a binding variable, a special predicate and an RDF Collection holding the parameters.
We will jump ahead with an example to get started:
Here's the intuition behind the query:
- <http://www.ontotext.com/owlim/ces#annotate> is a special predicate, which means that OWLIM listens for it and knows how to interpret it.
- "content=China economy on the rise" is the text of the document. However, human readable text form is accepted only if it is a single line (useful to show the idea).
- "domain-name=http://www.ontotext.com/owlim/ces#default" is a domain identifier - it explicitly denotes which extraction algorithm should be used. Different domains usually require different extraction techniques.
- "content type=text/plain" is the MIME type of the document, which is just plain text in this example.
Parameter reference
Parameter | Required | Supported values | Default value | Comment | ||||
---|---|---|---|---|---|---|---|---|
content | true | XML/JSON
|
none |
|
||||
domain | true | URN | none | eg:
http://www.ontotext.com/owlim/ces#default
|
||||
content-type | false | "application/vnd.ontotext.ces.document+xml", "application/vnd.ontotext.ces.document+json" | "application/vnd.ontotext.ces.document+xml" | This is the type of the encoded document. We create a markup aware GATE document from it. | ||||
accept-type | false |
"application/vnd.ontotext.ces.document+xml", "application/vnd.ontotext.ces.document+json" | "application/vnd.ontotext.ces.document+xml" | This parameter serves to indicate the preferred result type - XML or JSON. |
||||
annotation-sets-to-preserve | false |
Comma separated list of the internal annotation set names (strings). I.e. to preserve the annotation set |
none |
This parameter allows you to specify which annotation sets should be preserved. Trimming is performed, so no whitespaces are allowed at the start and end of a name. |
Results
Return a JSON/XML document with annotations. The document is a valid instance of Ontotext's generic schema definition.
Re-training
Not defined yet. Depends on the pipeline as well, whether or not it contains re-trainable machine learning components.
Administration and configuration
Concept extraction is disabled by default. Start/stop is achieved through SPARQL Update queries.
Start
- First time initialization DOES NOT include Gazetteer cache loading - see Reload dictionary
- On subsequent starts will load the cache from the file system
To start ALL registered concept extraction pipelines use the following query:
To start a specific pipeline, include it's specific name graph in the query. See an example with the default pipeline:
Stop
Stops ALL concept extraction service, nothing special here.
Again, to stop a specific pipeline use the named graph of the pipeline:
Reload dictionary
Cleans the Gazetteer cache from the file system and loads it again from the repository.
![]() | In case the concept extraction service is not started, this SPARQL update operation will NOT schedule a dictionary reload (unlike before). |
The following query will initiate a dictionary reload on all running pipelines. To specify a particular pipeline use a named graph like shown in the Start/Stop sections of this page.
Add/remove Gazetteer configuration
Registers template queries for different entity types via INSERT/DELETE DATA.
- <http://www.ontotext.com/owlim/ces#gazetteerConfig> is a special (interpretable) predicate, which denotes a Gazetteer template query entry.
- Each Gazetteer configuration should be added in a separate named graph (per domain), i.e. the default pipeline uses <http://www.ontotext.com/owlim/ces#default>
- The template queries are also executed for all sub-classes of the defined class
- The configuration is stored as regular triples in the repository and is loaded on concept extraction initialization
Example configuration which indicates to load all rdfs:labels of all Agents, Locations and EconomicConcepts into the Gazetteer dictionary.
![]() | Adding/removing Gazetteer configuration doesn't take full effect immediately. For example, adding a new template query results in the CES plugin starting to listen for entities of its corresponding type. However, it does not load already existing entities of the same type. In order to achieve that you should trigger a dictionary reload. |
FAQ
How to deploy a pipeline?
Just unpack your pipeline package into ${info.aduna.platform.appdata.basedir}/repositories/${repository.name}/storage/ces/pipelines/ and the it will be discovered automatically. You can confirm it is discovered by finding an MBean called PipelineManager and checking its AvailablePipelines property, which lists the URIs of all the deployed pipelines. Note that in order to start using the pipeline you need to start it.
How to preserve annotation sets?
Check out the annotation-sets-to-preserve parameter of the annotation query above. No annotation sets are preserved by default.