An important step in the IE process is identifying entity names in the text based on lists. This is the role of the gazetteer. The gazetteer lists are plain text ﬁles, with one entry per line. Each list represents a set of names, such as names of cities, organizations, days of the week, etc.
KIM uses a more complex gazetteer, which extracts its lists from the semantic repository. Its default behavior is to get the labels of all entities that meet three requirements:
- have at least one alias (label)
KIM uses two models to associate entities with their labels - aliases and labels. Aliases are separate objects and allow having some metadata associated with a concrete label. Labels are just datatype properties - more simple and compact. The model is set via the com.ontotext.kim.KIMConstants.ENTITY_DESCR property in <KIM_HOME>/config/install.properties.
- are of type that is subclass of protons:Entity
- are marked as Trusted
An entity is trusted when the semantic repository statements reflect that this entity was generated by a trusted source.
The default KB defines some trusted sources, but new ones can always be defined.
There are many approaches for adding new entities to KIM IE. Some of the most common are:
- using the existing PROTON classes
This is the quickest and easiest way. KIM already knows about most of PROTON's classes and has the grammars to create meaningful annotations over them. So, for example, if we want to recognize "Aristotle" as a person in the analyzed documents, a new person instance has to be defined like this:
The format of the URI is not strict. The only requirement is that it is unique.
An entity can have multiple aliases and one main alias (labels respectively).
The gazetteer will create Lookup annotations, which serve as input for other resources and rules. That is why we want to make them meaningful. In KIM there are rules that will match a Lookup annotation with class feature class=http://proton.semanticweb.org/2006/05/protont#Person and create a Person annotation over it. That is why a new Person annotation will be created over Aristotle.
The drawback of this approach is that the new ontology is very close to PROTON. It works for extending the instance base of the already existing classes, but not the best choice for extending KIM with a completely new ontology.
- default gazetteer with custom Jape rules
If we have a complete ontology we want to adapt, for example:
then the rest of the requirements for an entity to be included in the gazetteer lists are:
- define aliases
- make the class subclass directly or indirectly protons:Entity
- mark the entity as Trusted
These steps will make the gazetteer create Lookup annotations whenever it meets "Aristotle" in the text. To make these lookups useful, we can write a Jape grammar to create Person annotations. The rules will be similar to the example below:
We can write a similar rule for every class of the custom ontology.
- define a new gazetteer
If we have a complete ontology, another way to adding new entities is to create a new gazetteer, dedicated to recognizing instances from this ontology. Its query will look like this:
In this case, the new entities do not need to meet the above requirements (to have aliases, to be of type protons:Entity, to be Trusted). All the results from the query will be used to fill the gazetteer's dictionary.
After the new gazetteer has created lookups, we can use Jape rules like the one above.