There are two principle strategies for rule-based inference called 'forward-chaining' and 'backward-chaining'. They can be briefly explained as follows:
Forward-chaining: involves applying the inference rules to the known facts (explicit statements) to generate new facts. The rules can then be re-applied to the combination of original facts and inferred facts to produce yet more new facts. The process is iterative and continues until no new facts can be generated. This kind of reasoning can have diverse objectives, for instance: to compute the inferred closure, to answer a particular query, to infer a particular sort of knowledge (e.g. the class taxonomy), etc. The advantage of this approach is that when all inferences have been computed query answering can proceed extremely quickly. The disadvantages come from greater initialisation costs (inference computed at load time) and space/memory usage (especially when the number of inferred facts is very large);
Backward-chaining: involves starting with a fact to be proved or a query to be answered. Typically, the reasoner examines the knowledge base to see if the fact to be proved is present and if not it examines the rule set to see which rules could be used to prove it. For the latter case, a check is made to see what other 'supporting' facts would need to be present to 'fire' these rules. The reasoner searches for proofs of each of these 'supporting' facts in the same way and iteratively maps out a search tree. The process terminates when either all of the leaves of the tree have proofs or no new candidate solutions can be found. Query processing is similar, but only stops when all search paths have been explored. The purpose in query answering is to find not just one, but all possible substitutions in the query expression. The advantages of this approach are that there are no inferencing costs at start-up and minimal space requirements. The disadvantage is that inference must be done each and every time a query is answered and for complex search graphs this can be computationally expensive and slow.
Both of these strategies have their advantages and disadvantages, which have been well studied in the history of KR and expert systems. Attempts to overcome the weak points have led to the development of various hybrid strategies (involving partial forward- and backward-chaining) which have proven efficient in many contexts.
Reasoning and materialization are discussed in some detail in the GraphDB Primer .
GraphDB-Lite Logical Formalism
RDFS inference is achieved via a set of axiomatic triples and entailment rules. These rules allow the full set of valid inferences using RDFS semantics to be determined. Herman ter Horst in [21] defines RDFS extensions for more general rule support and a fragment of OWL, which is more expressive than DLP and fully compatible with RDFS. First, he defines R-entailment, which extends RDFS-entailment in the following ways:
it can operate on the basis of any set of rules R (i.e. allows for extension or replacement of the standard set, defining the semantics of RDFS);
it operates over so-called generalised RDF graphs, where blank nodes can appear as predicates (a possibility disallowed in RDF);
rules without premises are used to declare axiomatic statements;
rules without consequences are used to detect inconsistency (integrity constraints).
GraphDB uses a notation almost identical to R-Entailment defined by Horst. One major difference is that consistency checking rules are not supported. GraphDB performs reasoning based on forward-chaining of entailment rules defined using RDF triple patterns with variables. GraphDB reasoning strategy is 'total materialisation', which is introduced in the GraphDB Primer in the Reasoning Strategies topic.
Rule Format and Semantics
The rule format and the semantics enforced is analogous to R-entailment (see the Rule-Based Inference topic on page and [21]) with the following differences:
Free variables in the head (without binding in the body) are treated as blank nodes. This feature must be used with extreme care, because custom rule-sets can easily be created that recursively infer an infinite number of statements making the semantics intractable;
Variable inequality constraints can be specified in addition to the triple patterns (they can be placed after any premise or consequence). This leads to lower complexity compared to R-entailment;
the [Cut] operator can be associated with rule premises. This is an optimisation that tells the rule compiler not to generate a variant of the rule with the identified rule premise as the first triple pattern;
Axiomatic triples can be provided as a set of statements, although these are not modelled as rules with empty bodies.
GraphDB can be configured via "rule-sets" – sets of axiomatic triples and entailment rules - that determine the applied semantics. The implementation of GraphDB relies on a compile stage, during which the rules are compiled into Java source code that is then further compiled using the Java compiler and merged together with the inference engine.
The Rule Language
A rule-set file has three sections named Prefices, Axioms, and Rules. All sections are mandatory and must appear sequentially in this order. Comments are allowed anywhere and follow the Java convention, i.e. "/* ... */" for block comments and "//" for end of line comments.
Prefices
This section defines abbreviations for the namespaces used in the rest of the file. The syntax is:
This section is used to assert 'axiomatic triples', which are usually used to describe the meta-level primitives used to define the schema, such as rdf:type, rdfs:Class, etc. This section contains a list of the (variable free) triples, one per line. For example, the RDF axiomatic triples are defined thus:
This section is used to define entailment rules. Each definition consists of premises and corollaries that are RDF statements defined with subject, predicate and object components. Any component can be a variable, blank node, literal, full URI or the short name for a URI. Variables are alpha-numeric and must begin with a letter.
Furthermore, inequality constraints can be used to state that the values of the variables in a statement must not be equal to some specific full URI (or its short name) or to the value of another variable within the same rule. The behaviour of an inequality constraint depends whether it is placed in the body or head of a rule. If it is placed in the body of a rule, then the whole rule will not 'fire' is the constraint fails, i.e. the constraint can be next to any statement pattern in the body of a rule with the same behaviour (the constraint does not have to be placed next to the variables it references). If the constraint is in the head, then its location is significant, because a constraint that does not hold will prevent only the statement it is adjacent to from being inferred.
The syntax of a rule definition is as follows:
Where each premise and consequence is on a separate line. The following example helps to illustrate the possibilities:
Rules
{
Id: rdf1_rdfs4a_4b
x a y
-------------------------------
x <rdf:type> <rdfs:Resource>
a <rdf:type> <rdfs:Resource>
y <rdf:type> <rdfs:Resource>
Id: rdfs2
x a y [Constraint a != <rdf:type>]
a <rdfs:domain> z [Constraint z != <rdfs:Resource>]
-------------------------------
x <rdf:type> z
Id: owl_FunctProp
p <rdf:type> <owl:FunctionalProperty>
x p y [Constraint y != z, p != <rdf:type>]
x p z [Constraint z != y] [Cut]
-------------------------------
y <owl:sameAs> z
}
The symbols p, x, y, z and a are variables. The second rule contains two constraints that reduce the number of bindings for each premise, i.e. they 'filter out' those statements where the constraint does not hold.
In a forward chaining inference step, a rule is interpreted as meaning that for all possible ways of satisfying the premises, the bindings for the variables are used to populate the consequences of the rule. This generates new statements that will manifest themselves in the repository, for example, by being returned as query results.
The last rule contains an example of using the [Cut] operator, which is an optimisation hint for the rule compiler. When rules are compiled, a different variant of the rule is created for each premise, so that each premise occurs as the first triple pattern in one of the variants. This is done so that incoming statements can be efficiently matched to appropriate inferences rules. However, when a rule contains two or more premises that match identical triples patterns, but using different variable names, then the extra variant(s) are redundant and better efficiency can be achieved by simply not creating the extra rule variant(s). In the above example, rule owl_FunctProp would by default be compiled in to three variants:
p <rdf:type> <owl:FunctionalProperty>
x p y
x p z
-------------------------------
y <owl:sameAs> z
x p y
p <rdf:type> <owl:FunctionalProperty>
x p z
-------------------------------
y <owl:sameAs> z
x p z
p <rdf:type> <owl:FunctionalProperty>
x p y
-------------------------------
y <owl:sameAs> z
As can be seen, the last two variants are identical apart from the rotation of variables y and z, so one of these variants is not needed. The use of the [Cut] operator above tells the rule compiler to eliminate this last variant, i.e. the one beginning with the premise x p z.
Materialisation
A GraphDB repository will use the configured rule-set to compute all inferred statements at load time. To some extent, this process increases processing cost and time taken to load a repository with a large amount of data. However, it has the desirable advantage that subsequent query evaluation can proceed extremely quickly.
Apart from a number of optimisations, the approach taken by GraphDB is one of 'total materialisation', where the inference rules are applied repeatedly to the asserted (explicit) statements until no further inferred (implicit) statements are produced.
Retraction of assertions
GraphDB stores explicit and implicit statements, i.e. those statements inferred (materialised) from the explicit statements. It follows therefore, that when explicit statements are removed from the repository, any implicit statements that rely on the removed statement must also be removed.
This is achieved with a re-computation of the full closure (minimal model), i.e. applying the entailment rules to all explicit statements and computing the inferences. This approach guarantees correctness, but does not scale - the computation is increasingly slow and computationally expensive in proportion to the number of explicit statements and the complexity of the entailment rule-set.
Predefined Rule-Sets
There are a number of pre-defined rule-sets provided with GraphDB-Lite that cover various well known knowledge representation formalisms. The following table gives the details:
Rule set
Description
empty
No reasoning at all, i.e. GraphDB operates as a plain RDF store
rdfs
Supports the standard model-theoretic RDFS semantics
owl-horst
OWL dialect close to OWL Horst – essentially pD*
owl-max
RDFS and that part of OWL Lite that can be captured in rules (deriving functional and inverse functional properties, all-different, subclass by union/enumeration; min/max cardinality constraints, etc.)
owl2-ql
The OWL2 QL profile – a fragment of OWL2 Full designed so that sound and complete query answering is LOGSPACE with respect to the size of the data. This OWL2 profile is based on DL-LiteR, a variant of DL-Lite that does not require the unique name assumption.
owl2-rl-reduced
The OWL2 RL profile – an expressive fragment of OWL2 Full that is amenable for implementation on rule-engines, but without the prp-key rule for efficiency reasons.
owl2-rl-conf
The conformant, but less efficient, OWL2 RL profile that includes the prp-key rule.
Note that all rule-sets do not support data-type reasoning, which is the main reason that OWL-Horst is not the same as pD*. The rule-set to be used for a specific repository is defined through the ruleset parameter. There are optimized versions of all rule-sets that avoid some little used inferences.
OWL2 RL non-conformance
The GraphDB-Lite reasoner does not support the quad patterns that are available in GraphDB-SE. Therefore, it manages auxiliary ternary predicates (associated with the expansion of LIST structures in the bodies of GraphDB2-RL entailment rules) using blank nodes. This is much less efficient than GraphDB-SE, so it is provided in two versions. owl2-rl-conf contains all the rules, but is not recommended for datasets larger than a few tens of thousands of statements. owl2-rl-reduced is identical, except that the prp-key rule has been removed. This makes it suitable for datasets of several hundreds of thousands of statements.
OWL2 QL non-conformance
The implementation of OWL2 QL is non-conformant with the W3C OWL2 profiles recommendation [31] as shown in Table 3:
Conformant behaviour
Implemented behaviour
Given a list of disjoint (data or object) properties and an entity that is related with these properties to objects {a, b, c, d,...} then infer an owl:AllDifferent restriction on an anonymous list of these objects.
For each pair {p, q} (p != q) of disjoint (data or object) properties then infer the triple:
p owl:propertyDisjointWith q
Which is more likely to be useful for query answering.
For each class C in the knowledge base infer the existence of an anonymous class that is the union of a list of classes containing only C.
Not supported. Even if this infinite expansion is possible in a forward-chaining rule-based implementation, the resulting statements are of no use during query evaluation.
If a instance of C1, and b instance of C2, and C1 and C2 disjoint then infer:
a owl:differentFrom b
Impractical for knowledge bases with many members of pairs of disjoint classes, e.g. Wordnet. Instead this is implemented as a consistency check:
If x instance of C1 and C2, and C1 and C2 disjoint then inconsistent.
Custom Rule-Sets
GraphDB has an internal rule compiler that can be used to configure a custom set of inference rules and axioms. The user may define a custom rule-set (see section 7.1.2) in a .pie file (e.g. MySemantics.pie). The easiest way to create a custom rule-set is to start modifying one of the .pie files that were used to build the precompiled rule-sets. All of these are provided as part of the GraphDB-Lite distribution.
If the code generation or compilation cannot be completed successfully, a Java exception is thrown with an indication of the problem. It will state either the Id of the rule or the complete line from the source file where the problem is located. Line information is not preserved during the parsing of the rule file.
The user should specify the custom rule-set via the ruleset configuration parameter. The value of the ruleset parameter is interpreted as a filename and '.pie' is appended when not present. This file is processed to create Java source code that is compiled using the compiler from the Java Development Kit (JDK). The compiler is invoked using the mechanism provided by the JDK version 1.6 (or later). Therefore, a prerequisite for using custom rule-sets is that the Java Virtual Machine (JVM) from a JDK version 1.6 (or later) is used to run the application. If all goes well, the class is loaded dynamically and instantiated for further use by GraphDB during inference. The intermediate files are created in the folder that is pointed by the java.io.tmpdir system property. The JVM should have sufficient rights to read and write to this directory.
In environments that use custom class-loader schemes, such as the OSGI frameworks, the default method for locating .class file does not work when compiling rule files. To get around this, the rule compiler uses the values of two Java system properties, namely, -Dowlim-lite.X.Y.jar.file and --Dopenrdf-model.jar.file, for the exact locations of the GraphDB and open RDF model libraries that should be used during the compilation process. The values should contain the exact file system path to both files.
An important issue with custom rule sets is that GraphDB requires that at least one of the following RDF(S) resources be specified/mentioned within an axiomatic triple or a rule: rdf:type, rdfs:rangedfs:doman, rdfs:subClassOf, rdfs:Class and rdfs:subPropertyOf. These are used in specific implementations related to basic RDF support and are hard-coded into GraphDB.
Due to some of the optimisation techniques used in GraphDB, the set of rules in a custom rule set require at least one of the rules to derive the following triple patterns: x rdf:type y, x rdfs:subClassOf y and x rdfs:subPropertyOf y (the actual variable names do not matter).
Performance Optimisations in RDFS and OWL Support
There are several features in the RDFS and OWL specifications that result in rather inefficient entailment rules and axioms, which can have a significant impact on the performance of a reasoning engine. Such examples are:
The consequence Xrdf:type rdfs:Resource for each URI node in the RDF graph;
The system should be able to infer that URIs are classes and properties if they appear in schema-defining statements like X rdfs:subClassOf Y and X rdfs:subPropertyOf Y;
The individual equality property in OWL is reflexive, i.e. the statement X owl:sameAs X holds for every OWL individual;
All OWL classes are sub-classes of owl:Thing and for all individuals X rdf:type owl:Thing should hold;
A class is inferred as being a rdfs:Class whenever an instance of the class is defined with I rdf:type C.
Although the above inferences are correct and important for the completeness of the formal semantics, users rarely execute queries whose results are affected by the existence of such statements. Moreover, these inferences generate so many inferred statements that performance and scalability can be severely degraded.
For this reason, optimized versions of standard rule-sets are provided. These have '-optimized' appended to the rule-set name, e.g. owl-horst-optimized, and use the optimisations listed in Table 4.
These optimisations were previously achieved using the partialRDFS parameter, but are now achieved by using a previously optimised built-in rule-set, see the ruleset parameter in the configuration section for a complete list.