View Source

{excerpt}Hints about tuning the storage and retrieval performance of GraphDB{excerpt}
See [GraphDB-SE Rule Profiling] for hints related to rulesets and reasoning.


h1. Introduction
This page is intended for system administrators that are already familiar with GraphDB-SE and the Sesame OpenRDF infrastructure, who wish to configure their system for optimal performance. Best performance is typically measured by the shortest load time and fastest query answering. Many factors affect the performance of a GraphDB-SE repository in many different ways.
This page attempts to bring together all factors that affect performance, however it is measured. See [GraphDB-SE Rule Profiling] for hints related to rulesets and reasoning.

h1. Memory configuration

Memory configuration is the single most important factor for optimising the performance of GraphDB-SE. In every respect, the more memory available, the better the performance. The only question, is how to divide up the available memory between the various GraphDB-SE data structures in order to achieve the best overall behaviour.

h2. Setting the maximum Java heap space

The maximum amount of heap space used by a Java virtual machine (JVM) is specified using the {{\-Xmx}} virtual machine parameter. The value should be no higher than the amount of free memory available in the target system multiplied by some factor to allow for extra runtime overhead, say approximately \~90%.
For example, if a system has 16GB total RAM and 1GB is used by the operating system, services etc, then ideally the JVM that hosts the application using GraphDB-SE should have a maximum heap size of 15GB (16-1) and can be set using the JVM argument: {{\-Xmx15g}}.

h2. Data structures

The heap space available is used by:
* the JVM, the application and GraphDB-SE workspace (byte code, stacks, etc.);
* data structures for storing entities affected by specifying {{entity-index-size}};
* data structures for indexing statements specified using {{cache-memory}}.

In other words, the memory required for storing entities is determined by the number of entities in the dataset, where the memory required is 4 bytes per slot, allocated by {{entity-index-size}}, plus 12 bytes for each stored entity.
However, the memory required for the indices (cache types) depends on which indices are being used. The {{SPO}} and {{PSO}} indices are always used. Optional indices include {{predicateLists}}, the context indices {{PCSO}} / {{PSOC}}, and the FTS (full-text search) indices.
The memory allocated to these cache types can be calculated automatically by GraphDB-SE, however some of them can be specified in a more fine-grained way. The following configuration parameters are relevant:

cache-memory = tuple-index-memory + predicate-memory + fts-memory

A more complete view of memory use for GraphDB-SE is given here:

h2. Running in a servlet container

If the GraphDB-SE repository is being hosted by the Sesame HTTP servlet, then the maximum heap space applies to the servlet container (tomcat). In which case, allow some more heap memory for the runtime overhead, especially if running at the same time as other servlets. Also, some configuration of the servlet container might improve performance, e.g. increasing the permanent generation, which by default is 64MB. Quadrupling (for tomcat) with {{\-XX:MaxPermSize=256m}} might help. Further information can be found in the tomcat documentation.

h1. Delete operations

GraphDB-SE's inference policy is based on materialisation, where implicit statements are inferred from explicit statements as soon as they are inserted in to the repository, using the specified semantics {{ruleset}}. This approach has the advantage that query answering can be achieved very quickly, since no inference needs to be done at query time.
However, no justification information is stored for inferred statements, therefore deleting a statement normally requires a full re-computation of all inferred statements, which can take a very long time for large datasets.
GraphDB-SE uses a special technique for handling the deletion of explicit statements and their inferences, called *smooth delete*.

h2. Algorithm

The algorithm used to identify and remove those inferred statements that can no longer be derived, using the explicit statements being deleted, is as follows:
# Use forward chaining to determine what statements can be inferred from the statements marked for deletion;
# Use backward chaining to see if these statements are still supported by other means;
# Delete explicit statements and the no longer supported inferred statements.

h2. Problem

The difficulty with the current algorithm is that almost all delete operations follow inference paths that touch schema statements, which then lead to almost all other statements in the repository. This can lead to *smooth delete* taking a very long time indeed.

h2. Solution

What can stop the algorithm touching too many (possibly all) statements, however, is that the algorithm will not go further if a visited statement is marked read-only. Since a read-only statement cannot be deleted, there is no reason to find what statements are inferred from it (such inferred statements might still get deleted, but they will be found by following other inference paths).
Statements are marked as read-only if they occur in {{ruleset}} files (standard or custom) or are loaded at initialisation time via the {{imports}} configuration parameter.
Therefore, when using *smooth delete*, it is recommended to load all ontology/schema/vocabulary statements using the {{imports}} configuration parameter.

h2. Example

Consider the following statements:
<foaf:name> <rdfs:domain> <owl:Thing> .
<MyClass> <rdfs:subClassOf> <owl:Thing> .

<wayne_rooney> <foaf:name> "Wayne Rooney" .
<Reviewer40476> <rdf:type> <MyClass> .
<Reviewer40478> <rdf:type> <MyClass> .
<Reviewer40480> <rdf:type> <MyClass> .
<Reviewer40481> <rdf:type> <MyClass> .
When using the {{owl-horst}} rule set the removal of the statement:
{noformat}<wayne_rooney> <foaf:name> "Wayne Rooney"{noformat}
will cause the following sequence of events:
x a y - (x=<wayne_rooney>, a=foaf:name, y="Wayne Rooney")
a rdfs:domain z (a=foaf:name, z=owl:Thing)
x rdf:type z - The inferred statement [<wayne_rooney> rdf:type owl:Thing] is to be removed.
x a u - (x=<wayne_rooney>, a=rdf:type, u=owl:Thing)
a rdfs:range z (a=rdf:type, z=rdfs:Class)
u rdf:type z - The inferred statement [owl:Thing rdf:type rdfs:Class] is to be removed.
x rdf:type rdfs:Class - (x=owl:Thing)
x rdfs:subClassOf x - The inferred statement [owl:Thing rdfs:subClassOf owl:Thing] is to be removed.
y q z - (y=owl:Thing, q=rdfs:subClassOf, z=owl:Thing)
p protons:transitiveOver q - (p=rdf:type, q=rdfs:subClassOf)
x p y - (x=[<Reviewer40476>, <Reviewer40478>, <Reviewer40480>, <Reviewer40481>], p=rdf:type, y=owl:Thing)
x p z - The inferred statements [<Reviewer40476> rdf:type owl:Thing], etc., are to be removed.
{noformat}Statements {{\[<Reviewer40476> rdf:type owl:Thing\]}}, etc, exist because of the statements {{\[<Reviewer40476> rdf:type <MyClass>\]}} and {{\[<MyClass> rdfs:subClassOf owl:Thing\]}}.

In large datasets there are typically millions of statements {{\[X rdf:type owl:Thing\]}}, and they are all visited by the algorithm.
The {{\[X rdf:type owl:Thing\]}} statements are not the only problematic statements that are considered for removal. Every class that has millions of instances leads to similar behaviour.
One check to see if a statement is still supported requires around 30 query evaluations with {{owl-horst}}, hence the slow removal.
If {{\[owl:Thing rdf:type owl:Class\]}} is marked as an axiom (because it is derived by statements from the schema, which must be axioms), then the process stops when reaching this statement. So in the current version, the schema (the system statements) must necessarily be imported through the {{imports}} configuration parameter in order to mark the schema statements as axioms.

h2. Schema transactions

As mentioned above, ontologies and schemas, imported at initialisation time using the 'imports' configuration parameter, are flagged as read-only. However, there are times when it is necessary to change a schema and this can be done inside a 'system transaction'.
The user instructs GraphDB that the transaction is a system transaction by including a dummy statement with the special schemaTransaction predicate, i.e.
_:b1 <> _:b2
This statement is not inserted in to the database, rather it serves as a flag that tells GraphDB that it can ignore the read-only flag for imported statements.

h1. Optional indices

h2. Predicate lists

Predicate lists are two indices ({{SP}} and {{OP}}) that can improve performance in two separate situations:
* Loading/querying datasets that have a large number of predicates;
* Executing queries or retrieving statements that use a wildcard in the predicate position, for example using the statement pattern: {{dbpedia:Human ?predicate dbpedia:Land}}.

As a rough guideline, a dataset with more than about 1000 predicates will benefit from using these indices for both loading and query answering. Predicate list indices are not enabled by default, but can be switched on using the {{enablePredicateList}} configuration parameter.

h2. Context indices

Two further indices {{PCSO}} and {{PSOC}} can also be used for providing better performance when executing queries that use contexts. These are enabled using the {{enable-context-index}} configuration parameter.

h1. Query performance

h2. Query optimisation

GraphDB-SE uses a number of query optimisation techniques by default. These can be disabled by using the {{enable-optimization}} configuration parameter set to {{false}}, however there is rarely any need to do this.
See [GraphDB-SE Explain Plan] for a way to view query plans and applied optimisations.

h2. Caching literal language tags

This optimisation applies when the repository contains a large number of literals with language tags, and it is necessary to execute queries that filter based on language, e.g. using the following SPARQL query construct:
{{FILTER ( lang(?name) = "en" )}}
In this situation, the {{in-memory-literal-properties}} configuration parameter can be set to {{true}}, causing the data values with language tags to be cached.

h2. Not enumerating sameAs
During query answering, all URIs from each equivalence class produced by the [owl:sameAs Optimisation|GraphDB-SE Reasoner#sameAs Optimisation] are enumerated. You can use the {{onto:disable-sameAs}} pseudo-graph (see [Other special query behaviour|GraphDB-SE Query Behaviour#Other special query behaviour]) to reduce dramatically these (in effect) duplicate results, instead returning a single representative from each equivalence class.

Consider these example queries executed against the [FactForge|] combined dataset. The default is to enumerate:
PREFIX dbpedia: <>
PREFIX rdfs: <>
SELECT * WHERE { ?c rdfs:subClassOf dbpedia:Airport}
giving many results:

If we specify the {{onto:disable-sameAs}} pseudo-graph:
PREFIX onto: <>
PREFIX dbpedia: <>
PREFIX rdfs: <>
SELECT * FROM onto:disable-sameAs
WHERE {?c rdfs:subClassOf dbpedia:Airport}
Then only two results are returned:

The *Expand results over equivalent URIs* checkbox in GraphDB Workbench SPARQL editor plays a similar role, but the meaning is reversed.

BEWARE: if the query uses a filter over the textual representation of a URI, eg
this may skip some valid solutions since not all URIs within an equivalence class are matched against the filter.

h1. Strategy

The life-cycle of a repository instance typically starts with the initial loading of datasets, followed by the processing of queries and updates. The loading of a large dataset can take a long time -- 12 hours for a billion statements with inference is not unusual. Therefore, it is often useful to use a different configuration during loading than during normal use. Furthermore, if a dataset is frequently loaded, because it changes gradually over time, then the loading configuration can be evolved as the administrator gets more familiar with the behaviour of GraphDB-SE with this dataset. Many properties of the dataset only become apparent after the initial load (such as the number of unique entities) and this information can be used to optimise the loading step the next time round or to improve the normal use configuration.

h2. Dataset loading

A typical initialisation life-cycle is like this:
# Configure a repository for best loading performance with many parameters estimated;
# Load data;
# Examine dataset properties;
# Refine loading configuration;
# Reload data and measure improvement.

Unless the repository needs to answer queries during the initialisation phase, the repository can be configured with the minimum number of options and indices, with a large portion of the available heap space given over to the cache memory:

enablePredicateList = false (unless the dataset has a large number of predicates)
enable-context-index = false
in-memory-literal-properties = false
cache-memory = approximately 50% of total heap space (-Xmx value)

h2. Normal operation

The optional indices can be built at a later time when the repository is used for query answering. The details of all optional indices, caches and optimisations have been covered previously in this document. Some experimentation is required using typical query patterns from the user environment.
The size of the data structures used to index entities is directly related to the number of unique entities in the loaded dataset. These data structures are always kept in memory. In order to get an upper bound on the number of unique entities loaded and to find the actual amount of RAM used to index them, some knowledge of the contents of the storage folder are useful.
Briefly, the total amount of memory needed to index entities is equal to the sum of the sizes of the files {{entities.index}} and {{entities.hash}}. This value can be used to determine how much memory is used and therefore how to divide the remaining between the cache-memory, etc.
An upper bound on the number of unique entities is given by the size of {{entities.hash}} divided by 12 (memory is allocated in pages and therefore the last page will likely not be full).
The file {{entities.index}} is used to look up entries in the file {{entities.hash}} and its size is equal to the value of the {{entity-index-size}} parameter multiplied by 4. Therefore the {{entity-index-size}} parameter has less to do with efficient use of memory and more to do with the performance of entity indexing and lookup. The larger this value, the less collisions occur in the {{entities.hash}} table. A reasonable size for this parameter is at least half the number of unique entities. However, the size of this data structure is never changed once the repository is created, so this knowledge can only be used to adjust this value for the next clean load of the dataset with a new (empty) repository.
The following parameters can be adjusted:
|| parameter || Comment ||
| {{entity-index-size}} | set to a large enough value as described above |
| {{enablePredicateList}} | can speed up queries (and loading) |
| {{enable-context-index}} | |
| {{in-memory-literal-properties}} | |
| {{cache-memory}} | |
| {{tuple-index-memory}} | |
| {{predicate-memory}} | if predicate lists are enabled |
| {{fts-memory}} | if using Node Search |

Don't forget that:
cache-memory = tuple-index-memory + predicate-memory + fts-memory

If any one of these is missed out, it will be calculated. If two or more are unspecified, then the remaining cache memory is divided evenly between them.
Furthermore, the inference semantics can be adjusted by choosing a different rule set. However, this will require a reload of the whole repository, otherwise some inferences can remain when they should not.