This section is intended for system administrators that are already familiar with OWLIM-SE and the Sesame openRDF infrastructure, who wish to configure their system for optimal performance. Best performance is typically measured by the shortest load time and fastest query answering. Many factors affect the performance of a OWLIM-SE repository in many different ways. This section is an attempt to bring together all factors that affect performance, however it is measured.
- Memory configuration
- Delete operations
- Optional indices
- Query performance
- Reasoning complexity
Memory configuration is the single most important factor for optimising the performance of OWLIM-SE. In every respect, the more memory available the better the performance. The only question, is how to divide up the available memory between the various OWLIM-SE data structures in order to achieve the best overall behaviour.
The maximum amount of heap space used by a Java virtual machine (JVM) is specified using the -Xmx virtual machine parameter. The value should be no higher than the amount of free memory available in the target system multiplied by some factor to allow for extra runtime overhead, say approximately ~90%.
For example, if a system has 16GB total RAM and 1GB is used by the operating system, services etc, then ideally the JVM that hosts the application using OWLIM-SE would have a maximum heap size of 15GB (16-1) and would be set using the JVM argument: -Xmx15g
The heap space available is used by:
- the JVM, the application and OWLIM-SE workspace (byte code, stacks, etc)
- data structures for storing entities affected by specifying entity-index-size
- data structures for indexing statements specified using cache-memory
Simplistically, the memory required for storing entities is determined by the number of entities in the dataset, where the memory required is 4 bytes per slot allocated by entity-index-size plus 12 bytes for the actual number of entities.
However, the memory required for the indices (cache types) depends on which indices are being used. The SPO and PSO indices are always used. Optional indices include predicateLists, PCSOT, PTSOC and FTS (full-text search) indices.
The memory allocated to these cache types can be calculated automatically by OWLIM-SE, however some of them can be specified in a more fine-grained way. The following configuration parameters are relevant
A more complete view of memory use for OWLIM-SE is given here:
During the initial load, some speed up can be achieved by switching journaling off, by setting the journaling configuration parameter to false. All this does is to stop the transaction log being written to disk (in case of failure), but a noticeable speed up is possible. During normal use, it is recommended to enable journaling for more reliable recovery after an unexpected termination.
If the OWLIM-SE repository is being hosted by the Sesame HTTP servlet then the maximum heap space will apply to the servlet container (tomcat). In which case, allow some more heap memory for the runtime overhead, especially if running at the same time as other servlets. Also, some configuration of the servlet container might improve performance, e.g. increasing the permanent generation, which by default is 64MB. Quadrupling (for tomcat) with -XX:MaxPermSize=256m might help. Further information can be found in the tomcat documentation.
OWLIM-SE's inference policy is one of total materialisation, where implicit statements are inferred from explicit statements as soon as they are inserted in to the repository suing the specified semantics ruleset. This approach has the advantage that query answering can be achieved very quickly, since no inference needs to be done at query time.
However, no justification information is stored for inferred statements, therefore deleting a statement would normally require a full re-computation of all inferred statements, which can take a very long time for large datasets.
OWLIM-SE uses a special technique for handling the deletion of explicit statements and their inferences called smooth delete. This is switched on by default, but can be overridden using the enableSmoothDelete configuration parameter set to false.
The algorithm used to identify and remove those inferred statements that can no longer be derived using the explicit statements being deleted is as follows:
- Use forward chaining to determine what statements can be inferred from the statements marked for deletion
- Use backward chaining to see if these statements are still supported by other means
- Delete explicit statements and the no longer supported inferred statements
The difficulty with the current algorithm is that almost all delete operations will follow inference paths that touch schema statements, which then lead to almost all other statements in the repository. This can lead to smooth delete taking a very long time indeed.
What can stop the algorithm touching too many (possibly all) statements, however, is that the algorithm will not go further if a visited statement is marked read-only. Since a read-only statement cannot be deleted, there is no reason to find what statements are inferred from it (such inferred statements might still get deleted, but they will be found by following other inference paths).
Statements are marked as read-only if they occur in ruleset files (standard or custom) or are loaded at initialisation time via the imports configuration parameter.
Therefore, when using smooth delete, it is recommended to load all ontology/schema/vocabulary statements using the imports configuration parameter.
Consider the following statements:
When using the owl-horst rule set the removal of the statement:
will cause the following sequence of events:
Statements [<Reviewer40476> rdf:type owl:Thing], etc, exist because of the statements [<Reviewer40476> rdf:type <MyClass>] and [<MyClass> rdfs:subClassOf owl:Thing].
In large datasets there are typically millions of statements [X rdf:type owl:Thing], and they will all be visited by the algorithm.
The [X rdf:type owl:Thing] statements are not the only problematic statements that will be considered for removal. Every class that has millions of instances will lead to similar behaviour.
One check to see if a statement is still supported requires around 30 query evaluations with owl-horst, hence the slow removal.
If [owl:Thing rdf:type owl:Class] was marked as an axiom (because it is derived by statements from the schema, which must be axioms), then the process would stop when reaching this statement. So in the current version the schema (the system statements) must necessarily be imported through the imports configuration parameter in order to mark the schema statements as axioms.
Predicate lists are two indices (SP and OP) that can improve performance in two separate situations:
- Loading/querying datasets that have a large number of predicates
- Executing queries or retrieving statements that use a wildcard in the predicate position, for example using the statement pattern: dbpedia:Human ?predicate dbpedia:Land
As a rough guideline, a dataset with more than about 1000 predicates will benefit from using these indices for both loading and query answering. Predicate list indices are not enabled by default, but can be switched on using the enablePredicateList configuration parameter.
Two further indices can also be used for providing better performance when executing queries that use context and/or tripleset ids. (Tripleset ids are not exposed through the Sesame interface, so can be ignored).
The PCSOT index can be enabled using the build-pcsot configuration parameter. This index will improve performance when using statement patterns where the context and subject are bound, e.g.
?s skos:broader ?o onto:dataset1
The PTSOC index can be enabled with the build-ptsoc configuration parameter. This index will improve performance when using statement patterns where the predicate and subject are bound, e.g.
dbpedia:Human rdfs:subClassOf ?superclass ?context
OWLIM-SE uses a number of query optimisation techniques by default. These can be disabled by using the enable-optimization configuration parameter set to false, however there is rarely any need to do this.
This optimization applies when the repository contains a large number of literals with language tags and it is necessary to execute queries that filter based on language, e.g. using the following SPARQL query construct:
FILTER ( lang(?name) = "ES" )
In this situation, the in-memory-literal-properties configuration parameters can be set to true, causing the data values with language tags to be cached.
The presence of many owl:sameAs statements – such as when using several LOD datasets and link sets – causes an explosion in the number of inferred statements. For a simple example, if A is a city in country X, and B and C are alternative names for A, and Y an Z are alternative names for X, then the inference engine should infer: B in X, C in X, B in Y, C in Y, B in Z, C in Z also.
As described in the OWLIM-SE user guide, OWLIM-SE avoids the inferred statement explosion caused by having many owl:sameAs statements by grouping equivalent URIs in to a single master node and using this for inference and statement retrieval. This is in effect a kind of backward chaining that allows all the sound and complete statements to be computed at query time.
This optimisation can save a large amount of space for two reasons:
- A single node is used for all N URIs in an equivalence class, which avoids storing N owl:sameAs statements;
- If there are N equivalent URIs in one equivalence class then the reasoning engine should infer that all URIs in this equivalence class are the equivalent to each other (and themselves), i.e. another N2 owl:sameAs statements can be avoided.
During query answering, all members of each equivalence appearing in a query are substituted to generate sound and complete query results. However, even though the mechanism to store equivalences is standard and cannot be switched off, it is possible to prevent the enumeration of equivalence classes during query answering. When using a dataset with many owl:sameAs statements, turning of the enumeration can dramatically reduce the number of duplicated query results, where a single URI from each equivalence class is chosen to be representative.
To turn off the enumeration of equivalent URIs, a special pseudo-graph name can be used:
FROM/FROM NAMED <http://www.ontotext.com/disable-SameAs>
Two different versions of a query are shown below with and without this special graph name. The queries are executed against the factforge combined dataset:
Wheras the same query with the pseudo-graph that prevents equivalence class enumeration:
The complexity of the rule set has a large effect on loading performance and on the overall size of the repository after loading. The complexity of the standard rule sets increases as follows:
- none (lowest complexity, best performance)
- owl2-rl (highest complexity, worst performance)
It should be noted that all rules affect the loading speed, even if they never actually infer any new statements. This is because as new statements are added, they are pushed in to every rule to see if anything is inferred. Often this can result in many joins being computed even though the rule never 'fires'.
If better load performance is required and it is known that the dataset does not contain anything that will apply to certain rules then they can be omitted from the ruleset. To do this, copy the appropriate '.pie' file included in the distribution and remove the unused rules. Then set the ruleset configuration parameter to the full pathname to this custom rule set.
If custom rules are being used to specify semantics not covered by the included standard rulesets, then some care must be taken for the following reasons:
- Recursive rules can lead to an explosion in the number of inferred statements
- Rules with unbound variables in the head cause new blank nodes to be created – the inferred statements can never be retracted and can cause other rules to fire
SwiftOWLIM version 2.9 contained a special optimisation that prevents the materialisation of inferred statements as the result of transitive chains. Instead, these inferences were computed during query answering. However, such an optimisation is NOT available in OWLIM-SE due to the nature of the indexing structures. Therefore, OWLIM-SE will attempt to materialise all inferred statements at load time. When a transitive chain is long then this can cause a very large number of inferences to be made. For example, for a chain of N rdfs:subClassOf relationships, OWLIM-SE will infer (and materialise) a further (N2-N)/2 statements. If the relationship is also symmetric, e.g. in a family ontology with a predicate such as relatedTo, then there will be N2-N inferred statements.
Administrators should therefore take great care when managing datasets that have long chains of transitive relationships. If performance becomes a problem then it may be necessary to:
- Modify the schema, either by removing the symmetry of certain transitive relationships or chaining the transitive nature of certain properties altogether
- Reducing the complexity of inference by choosing a less sophisticated ruleset
The life-cycle of a repository instance typically starts with the initial loading of datasets followed by the processing of queries and updates. The loading of a large dataset can take a long time - 12 hours for a billion statements with inference is not unusual. Therefore, it is often useful to use a different configuration during loading than during normal use. Furthermore, if a dataset is frequently loaded, because it changes gradually over time, then the loading configuration can be evolved as the administrator gets more familiar with the behaviour of OWLIM-SE with this dataset. Many properties of the dataset only become apparent after the initial load (such as the number of unique entities) and this information can be used to optimise the loading step the next time round or to improve the normal use configuration.
A typical initialisation life-cycle would be like this:
- Configure a repository for best loading performance with many parameters estimated
- Load data
- Examine dataset properties
- Refine loading configuration
- Reload data and measure improvement
Unless the repository needs to answer queries during the initialisation phase, the repository can be configured with the minimum number of options and indices, with a large portion of the available heap space given over to the cache memory:
The optional indices can be built at a later time when the repository is used for query answering. The details of all optional indices, caches and optimisations have been covered previously in this document. Some experimentation is required using typical query patterns from the user environment.
The size of the data structures used to index entities is directly related to the number of unique entities in the loaded dataset. These data structures are always kept in memory. In order to get an upper bound on the number of unique entities loaded and to find the actual amount of RAM used to index them, some knowledge of the contents of the storage folder are useful.
Briefly, the total amount of memory needed to index entities is equal to the sum of the sizes of the files entities.index and entities.hash. This value can be used to determine how much memory is used and therefore how to divide the remaining between the cache-memory, etc.
An upper bound on the number of unique entities is given by the size of entities.hash divided by 12 (memory is allocated in pages and therefore the last page will likely not be full).
The file entities.index is used to look up entries in the file entities.hash and its size is equal to the value of the entity-index-size parameter multiplied by 4. Therefore the entity-index-size parameter has less to do with efficient use of memory and more to do with the performance of entity indexing and lookup. The larger this value, the less collisions occur in the entities.hash table. A reasonable size for this parameter is at least half the number of unique entities. However, the size of this data structure is never changed once the repository is created, so this knowledge can only be used to adjust this value for the next clean load of the dataset with a new (empty) repository.
The following parameters can be adjusted:
|entity-index-size||set to a large enough value as described above|
|enablePredicateList||can speed up queries (and loading)|
|journaling||Should be set to true|
|predicate-memory||if predicate lists are enabled|
|fts-memory||if using Node Search|
Don't forget that:
If any one of these is missed out, it will be calculated. If two or more are unspecified then the remaining cache memory is divided evenly between them.
Furthermore, the inference semantics can be adjusted by choosing a different rule set. However, this will require a reload of the whole repository, otherwise some inferences can remain when they should not.