Concept Extraction Plug-in (CES)

compared with
Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (18)

View Page History
h2. Overview

As we have already provided means to annotate documents through SPARQL, now it makes sense to continue the trend and expose more concept extraction oriented functionality functionalities in the same fashion. This page provides a comprehensive definition of the embedded CES API service - including control mechanisms for its administration and configuration.


h2. Semantic annotation

The document annotation is executed through a specially crafted SELECT SPARQL query. It takes a single triple pattern, which consists of a binding variable, a special predicate, and an [RDF Collection|http://www.w3.org/TR/rdf-sparql-query/#collections] holding the parameters.

An example to get started:
The following example will get you started:
{code:language=html/xml}SELECT * WHERE {
?s <http://www.ontotext.com/owlim/ces#annotate> (
Where:
* <[http://www.ontotext.com/owlim/ces#annotate]> is a special predicate, which means that GraphDB listens for it and knows how to interpret it;
* "content=China economy on the rise" is the text of the document. However, human readable text form is accepted only if it is a single line (useful to show the idea).;
* "domain-name=[http://www.ontotext.com/owlim/ces#default]" is a domain identifier - it explicitly denotes which extraction algorithm should be used. Different domains usually require different extraction techniques.;
* "content type=text/plain" is the MIME type of the document, which is just a plain text in this example.

h3. Results

Return The result of the semantic annotation is a JSON/XML document with annotations. The document is a valid instance of Ontotext's generic schema definition.


h2. Re-training

Not defined yet. Whether or not it contains re-trainable machine learning components, depends on the pipeline.
Not defined yet. It depends on the pipeline whether or not it contains re-trainable machine learning components.


h3. Start

* First The first time initialisation *DOES NOT* include gazetteer cache loading - see [Reload dictionary|#Reloaddictionary].
* On subsequent starts it loads the cache from the file system.

To start *ALL* registered concept extraction pipelines, use the following query:


h3. Stop

To stop *ALL* concept extraction pipelines, use:


{code}

To stop a specific pipeline, use the named graph of the pipeline:
{code:lang=xml}
INSERT DATA {

{note}In case the concept extraction service is not started, this SPARQL update operation will *NOT* schedule a dictionary reload (unlike before).{note}
The following query initiates a dictionary reload on all running pipelines. To specify a particular pipeline use a named graph like shown in the Start/Stop sections of this page.

The following query initiates a dictionary reload on all running pipelines. To specify a particular pipeline, use a named graph as shown in the Start/Stop sections of this page.

{code:language=html/xml}
INSERT DATA {
{code}

{info}Adding/removing gazetteer configuration does not take full effect immediately. For example, the result of adding a new template query, results in is that the CES plugin starts to listen for entities of its corresponding type. However, it does not load already existing entities of the same type. In order to achieve that do this, you should trigger a dictionary reload.{info}

h2. FAQ
h3. How to deploy a pipeline?

Just unpack your pipeline into _$\{info.aduna.platform.appdata.basedir\}/repositories/$\{repository.name\}/storage/ces/pipelines/_ and it will be discovered automatically. You can confirm it is discovered by finding an MBean, called _PipelineManager_, and checking its _AvailablePipelines_ property, which lists the URIs of all the deployed pipelines. Note that in order to start using usе the pipeline you need to start it.

h3. How to preserve annotation sets?