GraphDB-SE Performance Tuning

Version 1 by barry.bishop
on Jun 01, 2012 10:14.

compared with
Version 2 by dimitar.manov
on Jul 17, 2014 17:29.

Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (15)

View Page History
This section is intended for system administrators that are already familiar with OWLIM-SE GraphDB-SE and the Sesame openRDF infrastructure, who wish to configure their system for optimal performance. Best performance is typically measured by the shortest load time and fastest query answering. Many factors affect the performance of a OWLIM-SE GraphDB-SE repository in many different ways. This section is an attempt to bring together all factors that affect performance, however it is measured.

{toc}
h1. Memory configuration

Memory configuration is the single most important factor for optimising the performance of OWLIM-SE. GraphDB-SE. In every respect, the more memory available the better the performance. The only question, is how to divide up the available memory between the various OWLIM-SE GraphDB-SE data structures in order to achieve the best overall behaviour.

h2. Setting the maximum Java heap space

The maximum amount of heap space used by a Java virtual machine (JVM) is specified using the {{\-Xmx}} virtual machine parameter. The value should be no higher than the amount of free memory available in the target system multiplied by some factor to allow for extra runtime overhead, say approximately \~90%.
For example, if a system has 16GB total RAM and 1GB is used by the operating system, services etc, then ideally the JVM that hosts the application using OWLIM-SE GraphDB-SE would have a maximum heap size of 15GB (16-1) and would be set using the JVM argument: {{\-Xmx15g}}

h2. Data structures

The heap space available is used by:
* the JVM, the application and OWLIM-SE GraphDB-SE workspace (byte code, stacks, etc)
* data structures for storing entities affected by specifying {{entity-index-size}}
* data structures for indexing statements specified using {{cache-memory}}
Simplistically, the memory required for storing entities is determined by the number of entities in the dataset, where the memory required is 4 bytes per slot allocated by {{entity-index-size}} plus 12 bytes for each stored entity.
However, the memory required for the indices (cache types) depends on which indices are being used. The {{SPO}} and {{PSO}} indices are always used. Optional indices include {{predicateLists}}, the context indices {{PCSO}} / {{PSOC}} and the FTS (full-text search) indices.
The memory allocated to these cache types can be calculated automatically by OWLIM-SE, GraphDB-SE, however some of them can be specified in a more fine-grained way. The following configuration parameters are relevant

{noformat}
{noformat}

A more complete view of memory use for OWLIM-SE GraphDB-SE is given here:
!memory_layout.png!

h2. Running in a servlet container

If the OWLIM-SE GraphDB-SE repository is being hosted by the Sesame HTTP servlet then the maximum heap space will apply to the servlet container (tomcat). In which case, allow some more heap memory for the runtime overhead, especially if running at the same time as other servlets. Also, some configuration of the servlet container might improve performance, e.g. increasing the permanent generation, which by default is 64MB. Quadrupling (for tomcat) with {{\-XX:MaxPermSize=256m}} might help. Further information can be found in the tomcat documentation.

h1. Delete operations

OWLIM-SE's GraphDB-SE's inference policy is based on materialisation, where implicit statements are inferred from explicit statements as soon as they are inserted in to the repository using the specified semantics {{ruleset}}. This approach has the advantage that query answering can be achieved very quickly, since no inference needs to be done at query time.
However, no justification information is stored for inferred statements, therefore deleting a statement would normally require a full re-computation of all inferred statements, which can take a very long time for large datasets.
OWLIM-SE GraphDB-SE uses a special technique for handling the deletion of explicit statements and their inferences called *smooth delete*.

h2. Algorithm

As mentioned above, ontologies and schemas imported at initialisation time using the 'imports' configuration parameter are flagged as read-only. However, there are times when it is necessary to change a schema and this can be done inside a 'system transaction'.
The user instructs OWLIM GraphDB that the transaction is a system transaction by including a dummy statement with the special schemaTransaction predicate, i.e.
{noformat}
_:b1 <http://www.ontotext.com/owlim/system#schemaTransaction> ""
{noformat}
This statement is not inserted in to the database, rather it serves as a flag that tells OWLIM GraphDB that it can ignore the read-only flag for imported statements.

h1. Optional indices
h2. Query optimisation

OWLIM-SE GraphDB-SE uses a number of query optimisation techniques by default. These can be disabled by using the {{enable-optimization}} configuration parameter set to {{false}}, however there is rarely any need to do this.

h2. Caching literal language tags

The presence of many {{owl:sameAs}} statements -- such as when using several LOD datasets and link sets -- causes an explosion in the number of inferred statements. For a simple example, if A is a city in country X, and B and C are alternative names for A, and Y an Z are alternative names for X, then the inference engine should infer: B in X, C in X, B in Y, C in Y, B in Z, C in Z also.
As described in the OWLIM-SE GraphDB-SE user guide, OWLIM-SE GraphDB-SE avoids the inferred statement explosion caused by having many {{owl:sameAs}} statements by grouping equivalent URIs in to a single master node and using this for inference and statement retrieval. This is in effect a kind of backward chaining that allows all the sound and complete statements to be computed at query time.
This optimisation can save a large amount of space for two reasons:
# A single node is used for all N URIs in an equivalence class, which avoids storing N {{owl:sameAs}} statements;
h2. Long transitive chains

SwiftOWLIM SwiftGraphDB version 2.9 contained a special optimisation that prevents the materialisation of inferred statements as the result of transitive chains. Instead, these inferences were computed during query answering. However, such an optimisation is NOT available in OWLIM-SE GraphDB-SE due to the nature of the indexing structures. Therefore, OWLIM-SE GraphDB-SE will attempt to materialise all inferred statements at load time. When a transitive chain is long then this can cause a very large number of inferences to be made. For example, for a chain of N rdfs:subClassOf relationships, OWLIM-SE GraphDB-SE will infer (and materialise) a further (N{^}2^\-N)/2 statements. If the relationship is also symmetric, e.g. in a family ontology with a predicate such as relatedTo, then there will be N{^}2^\-N inferred statements.
Administrators should therefore take great care when managing datasets that have long chains of transitive relationships. If performance becomes a problem then it may be necessary to:
# Modify the schema, either by removing the symmetry of certain transitive relationships or chaining the transitive nature of certain properties altogether
h1. Strategy

The life-cycle of a repository instance typically starts with the initial loading of datasets followed by the processing of queries and updates. The loading of a large dataset can take a long time - 12 hours for a billion statements with inference is not unusual. Therefore, it is often useful to use a different configuration during loading than during normal use. Furthermore, if a dataset is frequently loaded, because it changes gradually over time, then the loading configuration can be evolved as the administrator gets more familiar with the behaviour of OWLIM-SE GraphDB-SE with this dataset. Many properties of the dataset only become apparent after the initial load (such as the number of unique entities) and this information can be used to optimise the loading step the next time round or to improve the normal use configuration.

h2. Dataset loading