View Source

An important step in the IE process is identifying entity names in the text based on lists. This is the role of the gazetteer. The gazetteer lists are plain text files, with one entry per line. Each list represents a set of names, such as names of cities, organizations, days of the week, etc.
KIM uses a more complex gazetteer, which extracts its lists from the semantic repository. Its default behavior is to get the labels of all entities that meet three requirements:
* *have at least one alias (label)*
KIM uses two models to associate entities with their labels - aliases and labels. Aliases are separate objects and allow having some metadata associated with a concrete label. Labels are just datatype properties - more simple and compact. The model is set via the {{com.ontotext.kim.KIMConstants.ENTITY_DESCR}} property in *<KIM_HOME>/config/install.properties*.

* *are of type that is subclass of {{protons:Entity}}*

Example:

{code}
wkb:Person_Aristotle a protont:Person .
protont:Person rdfs:subClassOf protons:Entity .
{code}

* *are marked as Trusted*
An entity is trusted when the semantic repository statements reflect that this entity was generated by a trusted source.

Example:

{code}
wkb:Gazetteer a protons:Trusted .
wkb:Person_Aristotle protons:generatedBy wkb:Gazetteer .
{code}

The default KB defines some trusted sources, but new ones can always be defined.

There are many approaches for adding new entities to KIM IE. Some of the most common are:

* *using the existing PROTON classes*

This is the quickest and easiest way. KIM already knows about most of PROTON's classes and has the grammars to create meaningful annotations over them. So, for example, if we want to recognize "Aristotle" as a person in the analyzed documents, a new person instance has to be defined like this:

{code}
customkb:Person_Aristotle a protont:Person ;
protons:hasMainAlias customkb:Person_Aristotle.1 .
customkb:Person_Aristotle.1 a protons:Alias;
rdfs:label "Aristotle" .
{code}

(/) The format of the URI is not strict. The only requirement is that it is unique.
(/) An entity can have multiple aliases and one main alias (labels respectively).

The gazetteer will create Lookup annotations, which serve as input for other resources and rules. That is why we want to make them meaningful. In KIM there are rules that will match a Lookup annotation with class feature class=[http://proton.semanticweb.org/2006/05/protont#Person] and create a Person annotation over it. That is why a new Person annotation will be created over Aristotle.

The drawback of this approach is that the new ontology is very close to PROTON. It works for extending the instance base of the already existing classes, but not the best choice for extending KIM with a completely new ontology.

* *default gazetteer with custom Jape rules*

If we have a complete ontology we want to adapt, for example:

{code}
customkb:Person_Aristotle a customkb:Person .
{code}

then the rest of the requirements for an entity to be included in the gazetteer lists are:

* *define aliases*

{code}
customkb:Person_Aristotle a protont:Person ;
protons:hasMainAlias customkb:Person_Aristotle.1 .
customkb:Person_Aristotle.1 a protons:Alias;
rdfs:label "Aristotle" .
{code}

* *make the class subclass directly or indirectly protons:Entity*

{code}
customkb:Person rdfs:subClassOf protont:Person .
{code}

* *mark the entity as Trusted*

{code}
customkb:Person_Aristotle protons:generatedBy wkb:Gazetteer .
{code}

These steps will make the gazetteer create Lookup annotations whenever it meets "Aristotle" in the text. To make these lookups useful, we can write a Jape grammar to create Person annotations. The rules will be similar to the example below:
{code}
Rule: customkb_person
(
{Lookup.class == "http://customkb#Person"}
)
:person
-->
:person.Person = {rule = "customkb_person",
class = :person.Lookup.class, inst = :person.Lookup.inst, originalName = :person.Lookup.originalName }
{code}

We can write a similar rule for every class of the custom ontology.

* *define a new gazetteer*

If we have a complete ontology, another way to adding new entities is to create a new gazetteer, dedicated to recognizing instances from this ontology. Its query will look like this:

{code}
select LA, I, DC from
{I} rdf:type customkb:EntitiesRoot ,
{I} serql:directType {DC},
{I} customkb:name {LA}
{code}

In this case, the new entities do not need to meet the above requirements (to have aliases, to be of type protons:Entity, to be Trusted). All the results from the query will be used to fill the gazetteer's dictionary.

After the new gazetteer has created lookups, we can use Jape rules like the one above.