GraphDB-SE Reasoner

compared with
Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (10)

View Page History
The performance of a GraphDB-SE repository is greatly improved by a specific optimisation that handles {{owl:sameAs}} statements efficiently. {{owl:sameAs}} declares that two different URIs identify the same resource. Most often, it is used to align identifiers of the same real-world entity used in different data sets. For example, in DBPedia the URI of Vienna is {{http://dbpedia.org/page/Vienna}}, while in Geonames it is {{http://sws.geonames.org/2761369}}. DBpedia contains the statement
{noformat}(S1) dbpedia:Vienna owl:sameAs geonames:2761369{noformat}
that declares the two URIs as equivalent. {{owl:sameAs}} is probably the most important OWL predicate when it comes to integrating data from different data sources.
that declares the two URIs as equivalent.

{{owl:sameAs}} is probably the most important OWL predicate when it comes to integrating data from different data sources and interlinking RDF datasets. However, its semantics causes explosion of the number of inferred statements. Following the formal definition of OWL (OWL2 RL, to be more specific), whenever two URIs are declared equivalent, all statements that involve one of the URIs should be "replicated" using the other URI in the same position. For instance, in Geonames the city of Vienna is defined as part of {{http://www.geonames.org/2761367}} (the first-order administrative division in Austria with the same name), which in turn, is part of Austria {{http://www.geonames.org/2782113}}:
{noformat}
(S2) geonames:2761369 gno:parentFeature geonames:2761367

In the above example, we had two alignment statements (S1 and S7), two statements carrying specific factual knowledge (S2 and S3), one statement inferred due to a transitive property (S4), and seven statements inferred as a result of {{owl:sameAs}} expansion (S5, S7, S8, S9, S10, as well as the inverse statements of S1 and S7).
As we see, inference without {{owl:sameAs}} increased the dataset by 25% (one new statement on top of 4 explicit), while {{owl:sameAs}} inference increased the full closure by 175% (7 new statements). But Vienna also has a URI in UMBEL: if we add an {{owl:sameAs}} for this alignment, it will cause the inference of 4 new implicit statements (duplicates of S1, S5, S6, and S8). Although this is a small example, it provides a indication about the performance implications of using {{owl:sameAs}} alignment statements from Linked Open Data.

Although this is a small example, it provides a indication about the performance implications of using {{owl:sameAs}} alignment statements from Linked Open Data. Since {{owl:sameAs}} is an equivalence relation (transitive, reflexive, and symmetric), for a set of N equivalent URIs, N{^}2^ (N squared) {{owl:sameAs}} statements will be generated. Although {{owl:sameAs}} is useful for interlinking RDF datasets, its semantics causes considerable inflation of the number of implicit statements that should be considered during inference and query evaluation (either through forward- or backward-chaining).
Furthermore, {{owl:sameAs}} is an equivalence relation (transitive, reflexive, and symmetric). Thus for a set of N equivalent URIs, N{^}2^ (N squared) {{owl:sameAs}} statements should be considered.

To overcome this problem, GraphDB handles {{owl:sameAs}} in a specific manner. In its indices, each set of equivalent URIs (equivalence class with respect to {{owl:sameAs}}) is represented by a single super-node. GraphDB does not inflate the statement indices, and at the same time retains the ability to enumerate all statements that should be inferred during retrieval. Special care is taken to ensure that this optimisation does not hinder the ability to distinguish explicit from implicit statements.
GraphDB handles {{owl:sameAs}} in a special manner, avoiding both problems. It does not explode the statement indices, nor stores a quadratic number of {{owl:sameAs}} statements per cluster (equivalence class). Instead, each {{owl:sameAs}} cluster is represented by a single super-node, and all statements are recorded against a selected representative of each cluster. During query evaluation, GraphDB uses a kind of backward chaining by enumerating equivalent URIs, guaranteeing completeness of inference and query results. Special care is taken to ensure that this optimisation does not hinder the ability to distinguish between explicit and implicit statements.

The handling of {{owl:sameAs}} is technically a kind of backward chaining that occurs at query time, when equivalent URIs are enumerated and substituted in to query results.

{info}This occurs even when the 'empty' (no inference) rule-set is selected, i.e. even with no semantics selected, {{owl:sameAs}} is still interpreted in a special way.
However, {info}This occurs even when the 'empty' (no inference) rule-set is selected. The {{owl:sameAs}} optimisation can be disabled completely using the {{disable-sameAs}} configuration parameter, see [GraphDB-SE Configuration] for details.{info}