Concept Extraction Plug-in (CES)

compared with
Version 4 by reneta.popova
on Sep 16, 2014 15:20.

Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (20)

View Page History
h2. Overview

We have already provided means to annotate documents through SPARQL and it makes sense to continue the trend and expose more concept extraction oriented functionality in the same fashion. The idea is to expose control mechanisms for administration and configuration of the embedded CES service. This page provides a comprehensive definition of this API.
As we already provided means to annotate documents through SPARQL, now it makes sense to continue the trend and expose more concept extraction oriented functionality in the same fashion. This page provides a comprehensive definition of the embedded CES API service - control mechanisms for its administration and configuration.


The document annotation is executed through a specially crafted SELECT SPARQL query. It takes a single triple pattern, which consists of a binding variable, a special predicate and an [RDF Collection|http://www.w3.org/TR/rdf-sparql-query/#collections] holding the parameters.

We will jump ahead with an example to get started:
An example to get started:
{code:language=html/xml}SELECT * WHERE {
?s <http://www.ontotext.com/owlim/ces#annotate> (
<tns:document id="http://ontotext.com/publishing/document/215351"
xmlns:tns="http://www.ontotext.com/DocumentSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

<tns:feature-set>
<tns:feature>
</tns:feature>
</tns:feature-set>

<tns:document-parts>
<tns:document-part id="1" part="TITLE">
}
{code}
Here's the intuition behind the query:
Where:
* <[http://www.ontotext.com/owlim/ces#annotate]> is a special predicate, which means that OWLIM GraphDB listens for it and knows how to interpret it.;
* "content=China economy on the rise" is the text of the document. However, human readable text form is accepted only if it is a single line (useful to show the idea).
* "domain-name=[http://www.ontotext.com/owlim/ces#default]" is a domain identifier - it explicitly denotes which extraction algorithm should be used. Different domains usually require different extraction techniques.
* "content type=text/plain" is the MIME type of the document, which is just a plain text in this example.

h3. Parameter reference

|| Parameter || Required || Supported values || Default value || Comment ||
| content | true | XML/JSON \\ {warning}We expect already validated input, so no validation is performed at query parsing and processing level.{warning} | none | {note}We are (de)serialising with the URL-safe flag turned on \\
| content | true | XML/JSON \\ {warning}The input should already be validated, no validation is performed at query parsing and processing level.{warning} | none | {note}The (de)serialising is done with the URL-safe flag turned on. \\
See [http://en.wikipedia.org/wiki/Base64#URL_applications] \\
and [http://commons.apache.org/codec/apidocs/org/apache/commons/codec/binary/Base64.html]\\ \\
\\
{note} |
| domain | true | URN | none | eg: [http://www.ontotext.com/owlim/ces#default] \\ |
| content-type | false | "application/vnd.ontotext.ces.document+xml", "application/vnd.ontotext.ces.document+json" | "application/vnd.ontotext.ces.document+xml" | This is the type of the encoded document. We create a A markup aware GATE document is created from it. |
| accept-type | false \\ | "application/vnd.ontotext.ces.document+xml", "application/vnd.ontotext.ces.document+json" | "application/vnd.ontotext.ces.document+xml" | This parameter serves to indicate the preferred result type - XML or JSON. \\ |
| annotation-sets-to-preserve | false \\ | Comma separated list of the internal annotation set names (strings). I.e. to preserve the annotation set {code}<tns:annotation-set name="brendan" ref="2">{code} you need to add the following parameter {code}"annotation-sets-to-preserve=brendan|2"{code} An internal name is in the format <annotation set name>\|<ref id> \\ | none \\ | This parameter allows you to specify which annotation sets should be preserved. \\
Trimming is performed, so no white spaces are allowed at the start and end of a name. \\ |

h3. Results
h2. Re-training

Not defined yet. Depends on the pipeline as well, whether or not it contains re-trainable machine learning components.


h3. Start

* First time initializsation *DOES NOT*&nbsp;include Gazetteer cache loading - see [Reload dictionary|#Reloaddictionary]
* On subsequent starts will load the cache from the file system