
Rule-Based Inference
There are two principle strategies for rule-based inference called 'forward-chaining' and 'backward-chaining'. They can be briefly explained as follows:
- Forward-chaining: involves applying the inference rules to the known facts (explicit statements) to generate new facts. The rules can then be re-applied to the combination of original facts and inferred facts to produce yet more new facts. The process is iterative and continues until no new facts can be generated. This kind of reasoning can have diverse objectives, for instance: to compute the inferred closure, to answer a particular query, to infer a particular sort of knowledge (e.g. the class taxonomy), etc. The advantage of this approach is that when all inferences have been computed query answering can proceed extremely quickly. The disadvantages come from greater initialisation costs (inference computed at load time) and space/memory usage (especially when the number of inferred facts is very large);
- Backward-chaining: involves starting with a fact to be proved or a query to be answered. Typically, the reasoner examines the knowledge base to see if the fact to be proved is present and if not it examines the rule set to see which rules could be used to prove it. For the latter case, a check is made to see what other 'supporting' facts would need to be present to 'fire' these rules. The reasoner searches for proofs of each of these 'supporting' facts in the same way and iteratively maps out a search tree. The process terminates when either all of the leaves of the tree have proofs or no new candidate solutions can be found. Query processing is similar, but only stops when all search paths have been explored. The purpose in query answering is to find not just one, but all possible substitutions in the query expression. The advantages of this approach are that there are no inferencing costs at start-up and minimal space requirements. The disadvantage is that inference must be done each and every time a query is answered and for complex search graphs this can be computationally expensive and slow.
Both of these strategies have their advantages and disadvantages, which have been well studied in the history of KR and expert systems. Attempts to overcome the weak points have led to the development of various hybrid strategies (involving partial forward- and backward-chaining) which have proven efficient in many contexts
Reasoning and materialization are discussed in some detail in the OWLIM Primer .
OWLIM-SE's Logical Formalism
RDFS inference is achieved via a set of axiomatic triples and entailment rules. These rules allow the full set of valid inferences using RDFS semantics to be determined. Herman ter Horst in [25] defines RDFS extensions for more general rule support and a fragment of OWL, which is more expressive than DLP and fully compatible with RDFS. First, he defines R-entailment, which extends RDFS-entailment in the following ways:
- it can operate on the basis of any set of rules R (i.e. allows for extension or replacement of the standard set, defining the semantics of RDFS);
- it operates over so-called generalized RDF graphs, where blank nodes can appear as predicates (a possibility disallowed in RDF);
- rules without premises are used to declare axiomatic statements;
- rules without consequences are used to detect inconsistencies (integrity constraints).
OWLIM uses a notation almost identical to R-Entailment defined by Horst. OWLIM-SE performs reasoning based on forward-chaining of entailment rules defined using RDF triple patterns with variables. OWLIM-SE's reasoning strategy is one of 'materialisation', which is introduced in the OWLIM Primer in the Reasoning Strategies topic
Rule Format and Semantics
The rule format and the semantics enforced is analogous to R-entailment (see the Rule-Based Inference topic on page and [25]) with the following differences:
- Free variables in the head (without binding in the body) are treated as blank nodes. This feature must be used with extreme care, because custom rule-sets can easily be created that recursively infer an infinite number of statements making the semantics intractable;
- Variable inequality constraints can be specified in addition to the triple patterns (they can be placed after any premise or consequence). This leads to lower complexity compared to R-entailment;
- the [cut] operator can be associated with rule premises. This is an optimisation that tells the rule compiler not to generate a variant of the rule with the identified rule premise as the first triple pattern;
- Context can be used for both rule premises and rule consequences allowing more expressive constructions that utilise 'intermediate' statements contained within the given context URI;
- Consistency checking rules do not have consequences and will indicate an inconsistency when the premises are satisfied;
- Axiomatic triples can be provided as a set of statements, although these are not modelled as rules with empty bodies.
OWLIM-SE can be configured via "rule-sets" – sets of axiomatic triples, consistency checks and entailment rules - that determine the applied semantics. The implementation of OWLIM-SE relies on a compile stage, during which the rules are compiled into Java source code that is then further compiled using the Java compiler and merged together with the inference engine.
The Rule Language
A rule-set file has three sections named Prefices, Axioms, and Rules. All sections are mandatory and must appear sequentially in this order. Comments are allowed anywhere and follow the Java convention, i.e. "/* ... */" for block comments and "//" for end of line comments.
Prefices
This section defines abbreviations for the namespaces used in the rest of the file. The syntax is:
shortname : URI
A typical prefices section might look like this:
Prefices { rdf : http://www.w3.org/1999/02/22-rdf-syntax-ns# rdfs : http://www.w3.org/2000/01/rdf-schema# owl : http://www.w3.org/2002/07/owl# xsd : http://www.w3.org/2001/XMLSchema# }
Axioms
This section is used to assert 'axiomatic triples', which are usually used to describe the meta-level primitives used to define the schema, such as rdf:type, rdfs:Class, etc. This section contains a list of the (variable free) triples, one per line. For example, the RDF axiomatic triples are defined thus:
Axioms { // RDF axiomatic triples <rdf:type> <rdf:type> <rdf:Property> <rdf:subject> <rdf:type> <rdf:Property> <rdf:predicate> <rdf:type> <rdf:Property> <rdf:object> <rdf:type> <rdf:Property> <rdf:first> <rdf:type> <rdf:Property> <rdf:rest> <rdf:type> <rdf:Property> <rdf:value> <rdf:type> <rdf:Property> <rdf:nil> <rdf:type> <rdf:List> }
![]() | Axiomatic statements are considered to be inferred for the purpose of query-answering, because they are a result a semantic interpretation defined by the chosen rule-set. |
Rules
This section is used to define entailment rules and consistency checks, which share a similar format. Each definition consists of premises and corollaries that are RDF statements defined with subject, predicate, object and optional context components. The subject, predicate and object can each be a variable, blank node, literal, full URI or the short name for a URI. If given, the context must be a full URI or a short name for a URI. Variables are alpha-numeric and must begin with a letter.
If the context is provided, then the statements produced as rule consequences are not 'visible' during normal query answering. Instead, they can only be used as input to this or other rules and then only when the rule premise explicitly uses the given context. See the example below.
Furthermore, inequality constraints can be used to state that the values of the variables in a statement must not be equal to some specific full URI (or its short name) or to the value of another variable within the same rule. The behaviour of an inequality constraint depends whether it is placed in the body or head of a rule. If it is placed in the body of a rule, then the whole rule will not 'fire' is the constraint fails, i.e. the constraint can be next to any statement pattern in the body of a rule with the same behaviour (the constraint does not have to be placed next to the variables it references). If the constraint is in the head, then its location is significant, because a constraint that does not hold will prevent only the statement it is adjacent to from being inferred.
Entailment rules
The syntax of a rule definition is as follows:
Id: <rule_name> <premises> <optional_constraints> ------------------------------- <consequences> <optional_constraints>
Where each premise and consequence is on a separate line. The following example helps to illustrate the possibilities:
Rules { Id: rdf1_rdfs4a_4b x a y ------------------------------- x <rdf:type> <rdfs:Resource> a <rdf:type> <rdfs:Resource> y <rdf:type> <rdfs:Resource> Id: rdfs2 x a y [Constraint a != <rdf:type>] a <rdfs:domain> z [Constraint z != <rdfs:Resource>] ------------------------------- x <rdf:type> z Id: owl_FunctProp p <rdf:type> <owl:FunctionalProperty> x p y [Constraint y != z, p != <rdf:type>] x p z [Constraint z != y] [Cut] ------------------------------- y <owl:sameAs> z }
The symbols p, x, y, z and a are variables. The second rule contains two constraints that reduce the number of bindings for each premise, i.e. they 'filter out' those statements where the constraint does not hold.
In a forward chaining inference step, a rule is interpreted as meaning that for all possible ways of satisfying the premises, the bindings for the variables are used to populate the consequences of the rule. This generates new statements that will manifest themselves in the repository, for example, by being returned as query results.
The last rule contains an example of using the [Cut] operator, which is an optimisation hint for the rule compiler. When rules are compiled, a different variant of the rule is created for each premise, so that each premise occurs as the first triple pattern in one of the variants. This is done so that incoming statements can be efficiently matched to appropriate inferences rules. However, when a rule contains two or more premises that match identical triples patterns, but using different variable names, then the extra variant(s) are redundant and better efficiency can be achieved by simply not creating the extra rule variant(s). In the above example, rule owl_FunctProp would by default be compiled in to three variants:
p <rdf:type> <owl:FunctionalProperty> x p y x p z ------------------------------- y <owl:sameAs> z x p y p <rdf:type> <owl:FunctionalProperty> x p z ------------------------------- y <owl:sameAs> z x p z p <rdf:type> <owl:FunctionalProperty> x p y ------------------------------- y <owl:sameAs> z
As can be seen, the last two variants are identical apart from the rotation of variables y and z, so one of these variants is not needed. The use of the [Cut] operator above tells the rule compiler to eliminate this last variant, i.e. the one beginning with the premise x p z.
The use of context in rule bodies and rule heads is also best explained by example. The following three rules implement the OWL2-RL property chain rule (prp-spo2) [31] and are inspired by the Rule Interchange Format (RIF) implementation [27]:
Id: prp-spo2_1 p <owl:propertyChainAxiom> pc start pc last [Context <onto:_checkChain>] ---------------------------- start p last Id: prp-spo2_2 pc <rdf:first> p pc <rdf:rest> t [Constraint t != <rdf:nil>] start p next next t last [Context <onto:_checkChain>] ---------------------------- start pc last [Context <onto:_checkChain>] Id: prp-spo2_3 pc <rdf:first> p pc <rdf:rest> <rdf:nil> start p last ---------------------------- start pc last [Context <onto:_checkChain>]
The RIF rules that implement prp-spo2 use a relation (unrelated to the input or generated triples) called _checkChain. The OWLIM implementation maps this relation to the 'invisible' context of the same name with the addition of [Context <onto:_checkChain>] to certain statement patterns. Generated statements with this context can only used for bindings to rule premises when the exact same context is specified in the rule premise. The generated statements with this context will not be used for any other rules.
Consistency checks
Consistency checks are used to ensure that the data model is in a consistent state and are applied whenever an update transaction is committed. The syntax is similar to that of rules, except that Consistency replaces the Id tag that introduces normal rules. Also consistency checks do not have any consequences and will indicate an inconsistency whenever their premises can be satisfied, e.g.
Consistency: something_can_not_be_nothing x rdf:type owl:Nothing ------------------------------- Consistency: both_sameAs_and_differentFrom_is_forbidden x owl:sameAs y x owl:differentFrom y -------------------------------
When a transaction is committed and consistency checking is switched on (it is off by default - see the configuration section) then if any consistency check(s) fail then:
- A message is logged with the details of what consistency checks failed
- An exception is thrown with the same details
- The whole transaction is rolled back
Materialization
An OWLIM repository will use the configured rule-set to compute all inferred statements at load time. To some extent, this process increases processing cost and time taken to load a repository with a large amount of data. However, it has the desirable advantage that subsequent query evaluation can proceed extremely quickly.
Apart from a number of optimisations, the approach taken by OWLIM is one of 'total materialization', where the inference rules are applied repeatedly to the asserted (explicit) statements until no further inferred (implicit) statements are produced.
Retraction of assertions
OWLIM stores explicit and implicit statements, i.e. those statements inferred (materialized) from the explicit statements. It follows therefore, that when explicit statements are removed from the repository, any implicit statements that rely on the removed statement must also be removed.
In previous versions of OWLIM, this was achieved with a re-computation of the full closure (minimal model), i.e. applying the entailment rules to all explicit statements and computing the inferences. This approach guarantees correctness, but does not scale - the computation is increasingly slow and computationally expensive in proportion to the number of explicit statements and the complexity of the entailment rule-set.
Removal of explicit statements is now achieved in a more efficient manner, by invalidating only those inferred statements that can no longer be derived in any way.
One approach is to maintain track information for every statement, typically the list of statements that can be inferred from this statement. The list is built up during inference as the rules are applied and the statements inferred by the rules are added to the lists of all statements that triggered the inferences. The drawback of this technique is that track information inflates more rapidly than the inferred closure - in the case of large datasets up to 90% of the storage is required just to store the track information.
Another approach is to perform backward chaining. Backward chaining does not require track information, since it essentially re-computes the tracks as required. Instead, a flag for each statement is used in order that the algorithm can detect when a statement has been previously visited and so avoid an infinite recursion. The algorithm used in OWLIM-SE works as following:
- Apply a 'visited' flag to all statements which is 'false' by default;
- Store the statements to be deleted in the list L;
- For each statement in L that is not visited yet mark it as visited and apply the forward chaining rules. Statements marked as visited become invisible, which is why the statement must be first marked and then used for forward chaining;
- If there are no more unvisited statements in L then END;
- Store all inferred statements in the list L1;
- For each element in L1 check the following:
- If the statement is purely implicit statement (a statement can be both explicit and implicit and if so then it is not considered purely implicit) then mark it as deleted (prevent it from being returned by the iterators) and check whether it is supported by other statements. The isSupported() method uses queries which contain the premises of the rules and the variables of the rules are preliminarily bound using the statement in question. That is to say the isSupported() method starts from the projection of the query and then checks whether the query will return results (at least one), i.e. this method performs backward chaining.
- If a result is returned by any query (every rule is represented by a query) in isSupported() then this statement can be still derived from other statements in the repository, so it must not be deleted (its status is returned to 'inferred').
- If all queries return no results then this statement can no longer be derived from any other statements, so its status remains 'deleted' and the number of statements counter is updated;
- L := L1 and GOTO 3.
Special care is taken is taken when retracting owl:sameAs statements, so that the algorithm still works correctly when modifying equivalence classes.
![]() | One consequence of this algorithm, is that deletion can still have poor performance when deleting schema statements, due to the (probably) large number of implicit statements inferred from them. |
![]() | The forward chaining part of the algorithm terminates as soon as it detects that a statement is read only, because if it cannot be deleted there is no need to look for statements derived from it. For this reason, performance can be greatly improved when all schema statements are made read-only by importing ontologies (and OWL/RDFS vocabularies) using the imports repository parameter. |
Schema update transactions
In situations when fast statement retraction is required, but it is also necessary to update schemas, a special statement pattern can be used. By including a statement with the following form in the update:
?subject <http://www.ontotext.com/owlim/system#schemaTransaction> ?object
where ?subject and ?object can be anything, OWLIM will use the smooth-delete algorithm, but will also traverse read-only statements and allow them to be deleted/inserted. Such transactions are likely to be be much more computationally expensive to achieve, but are intended for the occasional, offline update to otherwise read-only schemas. The advantage is that fast-delete can still be used, but a repository export and import is not required when making a modification to a schema.
For any transaction that includes the above special predicate:
- Read-only (explicit or inferred) statements can be deleted
- New explicit statements are marked as read-only
- New inferred statements are marked:
- Read-only if all the premises that fired the rule are read-only
- Normal otherwise
Predefined Rule-Sets
There are a number of pre-defined rule-sets provided with OWLIM-SE that cover various well known knowledge representation formalisms. The following table gives the details:
Rule set | Description |
---|---|
empty | No reasoning at all, i.e. OWLIM operates as a plain RDF store |
rdfs | Supports the standard model-theoretic RDFS semantics |
owl-horst | OWL dialect close to OWL Horst – essentially pD* |
owl-max | RDFS and that part of OWL-Lite that can be captured in rules (deriving functional and inverse functional properties, all-different, subclass by union/enumeration; min/max cardinality constraints, etc) |
owl2-ql | The OWL2 QL profile – a fragment of OWL2 Full designed so that sound and complete query answering is LOGSPACE with respect to the size of the data. This OWL2 profile is based on DL-LiteR, a variant of DL-Lite that does not require the unique name assumption. |
owl2-rl | The OWL2 RL profile – an expressive fragment of OWL2 Full that is amenable for implementation on rule-engines |
![]() | Note that all rule-sets do not support data-type reasoning, which is the main reason that OWL-Horst is not the same as pD*. The rule-set to be used for a specific repository is defined through the ruleset parameter. There are optimized versions of all rule-sets that avoid some little used inferences. |
OWL2 QL non-conformance
The implementation of OWL2 QL is non-conformant with the W3C OWL2 profiles recommendation [31] as shown in Table 3:
Conformant behaviour | Implemented behaviour |
---|---|
Given a list of disjoint (data or object) properties and an entity that is related with these properties to objects {a, b, c, d,...} then infer an owl:AllDifferent restriction on an anonymous list of these objects. | For each pair {p, q} (p != q) of disjoint (data or object) properties then infer the triple: p owl:propertyDisjointWith q Which is more likely to be useful for query answering. |
For each class C in the knowledge base infer the existence of an anonymous class that is the union of a list of classes containing only C. | Not supported. Even if this infinite expansion were possible in a forward-chaining rule-based implementation, the resulting statements are of no use during query evaluation. |
If a instance of C1, and b instance of C2, and C1 and C2 disjoint then infer: a owl:differentFrom b |
Impractical for knowledge bases with many members of pairs of disjoint classes, e.g. Wordnet. Instead this is implemented as a consistency check: If x instance of C1 and C2, and C1 and C2 disjoint then inconsistent. |
Custom Rule-Sets
OWLIM has an internal rule compiler that can be configured with a custom set of inference rules and axioms. The user may define a custom rule-set (see 'The Rule Language' on page ) in a .pie file (e.g. MySemantics.pie). The easiest way to create a custom rule-set is to start modifying one of the .pie files that were used to build the precompiled rule-sets. All of these are provided as part of the OWLIM-SE distribution.
If the code generation or compilation cannot be completed successfully, a Java exception is thrown with an indication of the problem. It will state either the Id of the rule or the complete line from the source file where the problem is located. Line information is not preserved during the parsing of the rule file.
The user should specify the custom rule-set via the ruleset configuration parameter. The value of the ruleset parameter is interpreted as a filename and '.pie' is appended when not present. This file is processed to create Java source code that is compiled using the compiler from the Java Development Kit (JDK). The compiler is invoked using the mechanism provided by the JDK version 1.6 (or later). Therefore a prerequisite for using custom rule-sets is that the Java Virtual Machine (JVM) from a JDK version 1.6 (or later) is used to run the application. If all goes well, the class is loaded dynamically and instantiated for further use by OWLIM-SE during inference. The intermediate files are created in the folder that is pointed by the java.io.tmpdir system property. The JVM should have sufficient rights to read and write to this directory.
Performance Optimizations in RDFS and OWL Support
There are several features in the RDFS and OWL specifications that result in rather inefficient entailment rules and axioms, which can have a significant impact on the performance of a reasoning engine. Such examples are:
- The consequence X rdf:type rdfs:Resource for each URI node in the RDF graph;
- The system should be able to infer that URIs are classes and properties if they appear in schema-defining statements like X rdfs:subClassOf Y and X rdfs:subPropertyOf Y;
- The individual equality property in OWL is reflexive, i.e. the statement X owl:sameAs X holds for every OWL individual;
- All OWL classes are sub-classes of owl:Thing and for all individuals X rdf:type owl:Thing should hold;
- A class is inferred as being a rdfs:Class whenever an instance of the class is defined with I rdf:type C.
Although the above inferences are correct and important for the completeness of the formal semantics, users rarely execute queries whose results are affected by the existence of such statements. Moreover, these inferences generate so many inferred statements that performance and scalability can be severely degraded.
For this reason, optimized versions of standard rule-sets are provided. These have '-optimized' appended to the rule-set name, e.g. owl-horst-optimized, and use the optimizations listed in Table 4.
Optimization | Affects |
Remove axiomatic triples | <any> <any> <rdfs:Resource> <rdfs:Resource> <any> <any> <any> <rdfs:domain> <rdf:Property> <any> <rdfs:range> <rdf:Property> <owl:sameAs> <rdf:type> <owl:SymmetricProperty> <owl:sameAs> <rdf:type> <owl:TransitiveProperty> |
Remove rule conclusions | <any> <any> <rdfs:Resource> |
Remove rule constraints | [Constraint <variable> != <rdfs:Resource>] |
These optimization were previously achieved using the partialRDFS parameter, but are now achieved by using a previously optimized built-in rule-set, see the ruleset parameter in the configuration section for a complete list.
owl:sameAs Optimization
The performance of a OWLIM-SE repository is greatly improved with a specific optimisation that allows it to handle owl:sameAs statements efficiently. owl:sameAs is an OWL predicate that declares that two different URIs identify one and the same resource. Most often, it is used to align different identifiers of the same real-world entity used in different data sources. For example, in DBPedia, the URI of Vienna is http://dbpedia.org/page/Vienna, while in Geonames it is http://sws.geonames.org/2761369/. DBpedia contains the statement
(S1) dbpedia:Vienna owl:sameAs geonames:2761369
which declares that the two URIs are equivalent. owl:sameAs is probably the most important OWL predicate when it comes to merging data from different data sources.
Following the formal definition of OWL (OWL2 RL, to be more specific), whenever two URIs are declared equivalent, all statements that involve one of the URIs should be "replicated" with the other URI at the same position. For instance, in Geonames, the city of Vienna is defined as part of http://www.geonames.org/2761367/ (the first-order administrative division in Austria with the same name), which in turn, is part of Austria http://www.geonames.org/2782113:
(S2) geonames:2761369 gno:parentFeature geonames:2761367 (S3) geonames:2761367 gno:parentFeature geonames:2782113
Since gno:parentFeature is a transitive relationship, it will be inferred that the city of Vienna is also part of Austria:
(S4) geonames:2761369 gno:parentFeature geonames:2782113
Due to the semantics of owl:sameAs from (S1) it should also be inferred that statements (S2) and (S4) also hold for Vienna's DBpedia URI:
(S5) dbpedia:Vienna gno:parentFeature geonames:2761367 (S6) dbpedia:Vienna gno:parentFeature geonames:2782113
These implicit statements must hold no matter which one of the equivalent URIs is used, i.e. if a query is evaluated, the same results will be returned. When we consider that Austria, too, has an equivalent URI in DBpedia:
(S7) geonames:2782113 owl:sameAs dbpedia:Austria
we should also infer that:
(S8) dbpedia:Vienna gno:parentFeature dbpedia:Austria (S9) geonames:2761369 gno:parentFeature dbpedia:Austria (S10) geonames:2761367 gno:parentFeature dbpedia:Austria
In the above example, we had two alignment statements (S1 and S7), two statements carrying specific factual knowledge (S2 and S3), one statement inferred due to a transitive property (S4), and seven statements inferred as a result of owl:sameAs alignment (S5, S7, S8, S9, S10, as well as the inverse statements of S1 and S7). As we see, inference without owl:sameAs inflated the dataset by 25% (one new statement on top of 4 explicit), while the presence of the owl:sameAs statements increased the full closure by 175% (7 new statements). Considering that Vienna has a URI also in UMBEL, which is also declared equivalent to the one in DBpedia, the addition of one more explicit statement for this alignment, will cause the inference of 4 new implicit statements (duplicates of S1, S5, S6, and S8). Although this is a small example, it provides a indication about the performance implications of using owl:sameAs alignment statements from Linked Open Data. Also, because owl:sameAs is a transitive, reflexive, and symmetric relationship, for a set of N equivalent URIs N2 (N squared) owl:sameAs statements will be generated for each pair of URIs (in reality there are not that many examples of large owl:sameAs equivalence classes). Thus, although owl:sameAs is useful for interlinking RDF datasets, its semantics causes considerable inflation of the number of implicit statements that should be considered during inference and query evaluation (either through forward- or through backward-chaining).
To overcome this problem, OWLIM-SE handles owl:sameAs in a specific manner. In its indices, each set of equivalent URIs (equivalence class with respect to owl:sameAs) is represented by a single super-node. This way, OWLIM-SE does not inflate the indices and, at the same time, retains the ability to enumerate all statements that should be inferred using the equivalence during retrieval requests, i.e. during inference or query evaluation. Special care is taken to ensure that this optimisation does not hinder the ability to distinguish explicit from implicit statements.
![]() | The handling of owl:sameAs is technically a kind of backward chaining that occurs at query time, when equivalent URIs are enumerated and substituted in to query results. It should be noted that this will occur even when the 'empty' (no inference) rule-set is selected, i.e. even with no semantics selected owl:sameAs is still interpreted in a special way. However, the owl:sameAs optimisation can be disabled completely using the disable-sameAs configuration parameter, see the configuration section for details. |