Skip to end of metadata
Go to start of metadata

Annotate

Introduction

This section describes how to annotate documents with CES (Concept Extraction Service).

Annotating a document is the process of adding a set of meta data about words or phrases in an unstructured text.

A mention is a slice of text with attached meta data features. Mentions always have:

  • type - the type of annotation, usually Person, Organization or Location, but other types can be returned
  • startOffset, endOffset - 0-based offsets in the original text
  • features map containing any number of properties/features depending on the mention origin

A mention is usually (but not necessarily) associated with a concept - a concept is a real-world entity that we recognized, a mention is a reference to that concept.

For example, annotating the text Hello London will yield a mention similar to the one below. The only mention has offsets within the original text and is associated with the concept http://dbpedia.org/resource/London

Notation

All URLs in this document are of the form http://worker-base/endpoint where http://worker-base is the host:port/context of a deployed CES worker and endpoint is the specific worker call. For example, if you worker is deployed at http://192.168.0.1/extractor-web and this guide mentions http://worker-base/extract then the URL to query will be http://192.168.0.1/extractor-web/extract

Annotation request

Annotation requests go to http://worker-base/extract. There are two ways to invoke annotation:

It's also advisable to specify Accept header with the desired output mime type. The default will usually be application/vnd.ontotext.ces+json, see output formats for more.

Supported input formats

  • the standard web text formats such as text/xml, text/html, text/plain
  • Ontotext's generic document schema in either JSON (application/vnd.ontotext.ces.document+json) or XML (application/vnd.ontotext.ces.document+xml)
  • formats supported by Apache Tika should also work fine most of the time

Supported output formats

If Accept header is not specified, the simple mentions JSON format is returned (application/vnd.ontotext.ces+json)
  • Ontotext's generic document schema in either JSON (application/vnd.ontotext.ces.document+json) or XML (application/vnd.ontotext.ces.document+xml)
  • the "simple mentions" JSON format (application/vnd.ontotext.ces or application/vnd.ontotext.ces+json). Described in more details below

Typical mention features

Mention features can vary wildly depending on the subsystem that generated the mention. Most mentions however will have

  • inst - a URI for this mention's concept. This might "point out" to a concept database (freebase, dbpedia, etc) or be generated by machine learning subsystems
  • class - generally related to the type of the mention, the class is a URI of class name within the concept database
  • string - the slice of text associated with this mention, it is the text between startOffset and endOffset
  • id - numeric id of the mention, unique within the document

Other returned features may include confidence (how sure the annotator feels about this mention), ambiguityRank, etc.
Other features are database and type dependant, for example locations such as London can have a featClass, featCode, countryCode, etc, giving more information about the concept

Examples

Posting plain text

Request:

Response:

Posting and receiving generic document

Request:

Response:

Simple mentions format

JSON
Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.