Hints about tuning the storage and retrieval performance of GraphDB
This page is intended for system administrators that are already familiar with GraphDB-SE and the Sesame OpenRDF infrastructure, who wish to configure their system for optimal performance. Best performance is typically measured by the shortest load time and fastest query answering. Many factors affect the performance of a GraphDB-SE repository in many different ways.
Memory configuration is the single most important factor for optimising the performance of GraphDB-SE. In every respect, the more memory available, the better the performance. The only question, is how to divide up the available memory between the various GraphDB-SE data structures in order to achieve the best overall behaviour.
The maximum amount of heap space used by a Java virtual machine (JVM) is specified using the -Xmx virtual machine parameter. The value should be no higher than the amount of free memory available in the target system multiplied by some factor to allow for extra runtime overhead, say approximately ~90%.
The heap space available is used by:
In other words, the memory required for storing entities is determined by the number of entities in the dataset, where the memory required is 4 bytes per slot, allocated by entity-index-size, plus 12 bytes for each stored entity.
A more complete view of memory use for GraphDB-SE is given here:
If the GraphDB-SE repository is being hosted by the Sesame HTTP servlet, then the maximum heap space applies to the servlet container (tomcat). In which case, allow some more heap memory for the runtime overhead, especially if running at the same time as other servlets. Also, some configuration of the servlet container might improve performance, e.g. increasing the permanent generation, which by default is 64MB. Quadrupling (for tomcat) with -XX:MaxPermSize=256m might help. Further information can be found in the tomcat documentation.
GraphDB-SE's inference policy is based on materialisation, where implicit statements are inferred from explicit statements as soon as they are inserted in to the repository, using the specified semantics ruleset. This approach has the advantage that query answering can be achieved very quickly, since no inference needs to be done at query time.
The algorithm used to identify and remove those inferred statements that can no longer be derived, using the explicit statements being deleted, is as follows:
The difficulty with the current algorithm is that almost all delete operations follow inference paths that touch schema statements, which then lead to almost all other statements in the repository. This can lead to smooth delete taking a very long time indeed.
What can stop the algorithm touching too many (possibly all) statements, however, is that the algorithm will not go further if a visited statement is marked read-only. Since a read-only statement cannot be deleted, there is no reason to find what statements are inferred from it (such inferred statements might still get deleted, but they will be found by following other inference paths).
Consider the following statements:
When using the owl-horst rule set the removal of the statement:
will cause the following sequence of events:
Statements [<Reviewer40476> rdf:type owl:Thing], etc, exist because of the statements [<Reviewer40476> rdf:type <MyClass>] and [<MyClass> rdfs:subClassOf owl:Thing].
In large datasets there are typically millions of statements [X rdf:type owl:Thing], and they are all visited by the algorithm.
As mentioned above, ontologies and schemas, imported at initialisation time using the 'imports' configuration parameter, are flagged as read-only. However, there are times when it is necessary to change a schema and this can be done inside a 'system transaction'.
This statement is not inserted in to the database, rather it serves as a flag that tells GraphDB that it can ignore the read-only flag for imported statements.
Predicate lists are two indices (SP and OP) that can improve performance in two separate situations:
As a rough guideline, a dataset with more than about 1000 predicates will benefit from using these indices for both loading and query answering. Predicate list indices are not enabled by default, but can be switched on using the enablePredicateList configuration parameter.
Two further indices PCSO and PSOC can also be used for providing better performance when executing queries that use contexts. These are enabled using the enable-context-index configuration parameter.
GraphDB-SE uses a number of query optimisation techniques by default. These can be disabled by using the enable-optimization configuration parameter set to false, however there is rarely any need to do this.
This optimisation applies when the repository contains a large number of literals with language tags, and it is necessary to execute queries that filter based on language, e.g. using the following SPARQL query construct:
During query answering, all URIs from each equivalence class produced by the owl:sameAs Optimisation are enumerated. You can use the onto:disable-sameAs pseudo-graph (see Other special query behaviour) to reduce dramatically these (in effect) duplicate results, instead returning a single representative from each equivalence class.
Consider these example queries executed against the FactForge combined dataset. The default is to enumerate:
giving many results:
If we specify the onto:disable-sameAs pseudo-graph:
Then only two results are returned:
The Expand results over equivalent URIs checkbox in GraphDB Workbench SPARQL editor plays a similar role, but the meaning is reversed.
BEWARE: if the query uses a filter over the textual representation of a URI, eg
this may skip some valid solutions since not all URIs within an equivalence class are matched against the filter.
The life-cycle of a repository instance typically starts with the initial loading of datasets, followed by the processing of queries and updates. The loading of a large dataset can take a long time – 12 hours for a billion statements with inference is not unusual. Therefore, it is often useful to use a different configuration during loading than during normal use. Furthermore, if a dataset is frequently loaded, because it changes gradually over time, then the loading configuration can be evolved as the administrator gets more familiar with the behaviour of GraphDB-SE with this dataset. Many properties of the dataset only become apparent after the initial load (such as the number of unique entities) and this information can be used to optimise the loading step the next time round or to improve the normal use configuration.
A typical initialisation life-cycle is like this:
Unless the repository needs to answer queries during the initialisation phase, the repository can be configured with the minimum number of options and indices, with a large portion of the available heap space given over to the cache memory:
The optional indices can be built at a later time when the repository is used for query answering. The details of all optional indices, caches and optimisations have been covered previously in this document. Some experimentation is required using typical query patterns from the user environment.
Don't forget that:
If any one of these is missed out, it will be calculated. If two or more are unspecified, then the remaining cache memory is divided evenly between them.
Skip to end of metadata Go to start of metadata