GraphDB-SE Performance Tuning

compared with
Current by Vladimir Alexiev
on Dec 13, 2014 16:14.

Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (30)

View Page History
This section is intended for system administrators, which are already familiar with GraphDB-SE and the Sesame openRDF infrastructure, who wish to configure their system for optimal performance. Best performance is typically measured by the shortest load time and fastest query answering. Many factors affect the performance of a GraphDB-SE repository in many different ways. This section is an attempt to bring together all factors that affect performance, however it is measured.
{excerpt}Hints about tuning the storage and retrieval performance of GraphDB{excerpt}
See [GraphDB-SE Rule Profiling] for hints related to rulesets and reasoning.

{toc}

h1. Introduction
This page is intended for system administrators that are already familiar with GraphDB-SE and the Sesame OpenRDF infrastructure, who wish to configure their system for optimal performance. Best performance is typically measured by the shortest load time and fastest query answering. Many factors affect the performance of a GraphDB-SE repository in many different ways.
This page attempts to bring together all factors that affect performance, however it is measured. See [GraphDB-SE Rule Profiling] for hints related to rulesets and reasoning.

h1. Memory configuration

In this situation, the {{in-memory-literal-properties}} configuration parameters can be set to {{true}}, causing the data values with language tags to be cached.

h2. Enumerating {{owl:sameAs}}
h2. Not enumerating sameAs
During query answering, all URIs from each equivalence class produced by the [owl:sameAs Optimisation|GraphDB-SE Reasoner#sameAs Optimisation] are enumerated. You can use the {{onto:disable-sameAs}} pseudo-graph (see [Other special query behaviour|GraphDB-SE Query Behaviour#Other special query behaviour]) to reduce dramatically these (in effect) duplicate results, instead returning a single representative from each equivalence class.

The presence of many {{owl:sameAs}} statements -- such as when using several LOD datasets and link sets -- causes an explosion in the number of inferred statements. For a simple example, if A is a city in country X, and B and C are alternative names for A, and Y an Z are alternative names for X, then the inference engine should infer: B in X, C in X, B in Y, C in Y, B in Z, C in Z also.
As described in the GraphDB-SE user guide, GraphDB-SE avoids the inferred statement explosion, caused by having many {{owl:sameAs}} statements, by grouping equivalent URIs in to a single master node, and using this for inference and statement retrieval. This is in effect a kind of backward chaining, which allows all the sound and complete statements to be computed at query time.
This optimisation can save a large amount of space for two reasons:
# A single node is used for all N URIs in an equivalence class, which avoids storing N {{owl:sameAs}} statements;
# If there are N equivalent URIs in one equivalence class, then the reasoning engine infers that all URIs in this equivalence class are equivalent to each other (and themselves), i.e. another N{^}2^ {{owl:sameAs}} statements can be avoided.
Consider these example queries executed against the [FactForge|http://factforge.net/] combined dataset. The default is to enumerate:
{noformat}
During query answering, all members of each equivalence appearing in a query are substituted to generate sound and complete query results. However, even though the mechanism to store equivalences is standard and cannot be switched off, it is possible to prevent the enumeration of equivalent classes during query answering. When using a dataset with many {{owl:sameAs}} statements, turning off the enumeration can dramatically reduce the number of _duplicated_ query results, where a single URI from each equivalence class is chosen to be representative.
To turn off the enumeration of equivalent URIs, a special pseudo-graph name is used:
{{FROM/FROM NAMED <[http://www.ontotext.com/disable-SameAs]>}}
Two different versions of a query are shown below with and without this special graph name. The queries are executed against the [factforge|http://factforge.net/] combined dataset:
{noformat}PREFIX dbpedia: <http://dbpedia.org/resource/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT * WHERE { ?c rdfs:subClassOf dbpedia:Airport}
WHERE { ?c rdfs:subClassOf dbpedia:Airport .* }
{noformat}Gives results:
{noformat}dbpedia:Air_strip
giving many results:
{noformat}
dbpedia:Air_strip
http://sw.cyc.com/concept/Mx4ruQS1AL_QQdeZXf-MIWWdng
umbel-sc:CommercialAirport
dbpedia:Airstrips
dbpedia:Airport
dbpedia:Airporgt
fb:guid.9202a8c04000641f800000000004ae12
opencyc-en:CommercialAirport
{noformat}Whereas the same query with the pseudo-graph that prevents equivalence class enumeration:
{noformat}

If we specify the {{onto:disable-sameAs}} pseudo-graph:
{noformat}
PREFIX onto: <http://www.ontotext.com/>
{noformat}PREFIX dbpedia: <http://dbpedia.org/resource/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT * FROM onto:disable-sameAs
FROM <http://www.ontotext.com/disable-SameAs>
WHERE { ?c rdfs:subClassOf dbpedia:Airport .* }
{noformat}Gives results:
WHERE {?c rdfs:subClassOf dbpedia:Airport}
{noformat}dbpedia:Air_strip
Then only two results are returned:
{noformat}
dbpedia:Air_strip
opencyc-en:CommercialAirport
{noformat}

h1. Reasoning complexity
The *Expand results over equivalent URIs* checkbox in GraphDB Workbench SPARQL editor plays a similar role, but the meaning is reversed.

The complexity of the rule set has a large effect on loading performance and on the overall size of the repository after loading. The complexity of the standard rule sets increases as follows:
* none (lowest complexity, best performance)
* rdfs
* rdfs-optimized
* owl-horst-optimized
* owl-horst
* owl-max-optimized
* owl-max
* owl2-ql-optimized
* owl2-ql
* owl2-rl-optimized
* owl2-rl (highest complexity, worst performance)
BEWARE: if the query uses a filter over the textual representation of a URI, eg
{noformat}filter(strstarts(str(?x),"http://dbpedia.org/ontology")){noformat}
this may skip some valid solutions since not all URIs within an equivalence class are matched against the filter.

It should be noted that all rules affect the loading speed, even if they never actually infer any new statements. This is because as new statements are added, they are pushed in to every rule to see if anything is inferred. This can often result in many joins being computed, even though the rule never 'fires'.

h2. Custom rulesets

If better load performance is required and it is known that the dataset does not contain anything that will apply to certain rules, then they can be omitted from the ruleset. To do this, copy the appropriate '.pie' file included in the distribution and remove the unused rules. Then set the ruleset configuration parameter to the full pathname to this custom rule set.
If custom rules are being used to specify semantics not covered by the included standard rulesets, then some care must be taken for the following reasons:
* Recursive rules can lead to an explosion in the number of inferred statements;
* Rules with unbound variables in the head cause new blank nodes to be created -- the inferred statements can never be retracted and can cause other rules to fire.

h2. Long transitive chains

SwiftOWLIM version 2.9 contained a special optimisation that prevents the materialisation of inferred statements as a result of transitive chains. Instead, these inferences were computed during query answering. However, such an optimisation is NOT available in GraphDB-SE due to the nature of the indexing structures. Therefore, GraphDB-SE attempts to materialise all inferred statements at load time. When a transitive chain is long, then this can cause a very large number of inferences to be made. For example, for a chain of N rdfs:subClassOf relationships, GraphDB-SE infers (and materialises) a further (N{^}2^\-N)/2 statements. If the relationship is also symmetric, e.g. in a family ontology with a predicate such as relatedTo, then there will be N{^}2^\-N inferred statements.
Administrators should therefore take great care when managing datasets that have long chains of transitive relationships. If performance becomes a problem then it may be necessary to:
# Modify the schema, either by removing the symmetry of certain transitive relationships or chaining the transitive nature of certain properties altogether;
# Reducing the complexity of inference by choosing a less sophisticated ruleset.

h1. Strategy