{toc}
h1. Comparison of GraphDB-SE and GraphDB-Lite
The major differences between GraphDB-SE and GraphDB-Lite are their performance and scalability. Both GraphDB editions deliver identical functionality for RDF storage, inference and query answering and they both implement Sesame's SAIL APIs. This guarantees that all essential functions of a semantic repository are supported by GraphDB in a standard, consistent, and interoperable manner.
Compared to GraphDB-Lite, GraphDB-SE adds additional features and augments many aspects related to performance, which in most cases is a matter of special indexing strategies that allow more efficient retrieval. Functionally, the differences can be classified in to two groups:
* Do the same better: the corresponding feature does not allow the user to do more things with GraphDB-SE -- it rather makes it work better in specific circumstances. This is the case with [predicate lists|GraphDB-SE Indexing Specifics#Predicate Lists] and the [owl:sameAs optimisation|GraphDB-SE Reasoner]
* Do more: deliver a new type of functionality, which is not available in GraphDB-Lite. Such examples are [RDF ranking|GraphDB-SE RDF Rank] and [RDF priming|GraphDB-SE Experimental Features]. [Full-text search features|GraphDB-SE Full-text Search] are also available in GraphDB-SE, although this could be seen as an enhancement since similar, but less powerful, behaviour is available by using regular expression constraints (which disregard tokenisation and deliver different results).
In the 'do more' category, GraphDB-SE delivers functionality that is not exposed by the Sesame API. Typically, this is achieved with the use of special-purpose system predicates. One should be aware that using the 'do more' features will affect compatibility with other semantic repositories.
h1. Persistence Strategy
GraphDB-SE stores all of its data (statements, indexes, entity pool, etc.) in files in the configured storage directory, usually called 'storage'. The content and names of these files is not defined and is subject to change between versions. In general, the index structures used in GraphDB-SE are chosen and optimised to allow for efficient:
* handling of billions of statements under reasonable RAM constraints
* query optimisation
* transaction management
GraphDB-SE maintains two main indices on statements for use in inference and query evaluation, these are the predicate-object-subject (POS) index and the predicate-subject-object (PSO) index. There are many other additional data structures that are used to enable the efficient manipulation of RDF data, but these are not listed, since these internal mechanisms cannot be configured.
The following subsections describe several indexing options, which deliver considerable advantages for specific datasets, retrieval patterns and query loads. Most of these are switched off by default, thus the user should take the initiative to switch them on as necessary. Unless otherwise stated, GraphDB-SE allows one to switch indices on and off against an already populated repository; the repository should be shut down before the change of the configuration is specified. The next time the repository is started, GraphDB-SE will create or remove the corresponding index. In the case that the repository is already loaded with a large volume of data, switching on a new index can lead to considerable delays during initialisation -- this is the time required for building the new index.
h1. Transaction Mode
There are two transaction mechanisms in GraphDB. The default (safe) mode causes all updates to be flushed to disk as part of the commit operation. The ordering of updated pages in the index files and the sequence used to write them to the file-system mean that they are consistent with the state of the database prior to the update in the event of an abnormal termination. In other words, rollback is natively supported should the application crash and recovery after such an event is instant. Also, the method for updating data structures (copy of page index and copy-on-write of pages) mean that a high level of concurrency is supported between updates queries.
In bulk-loading (fast) mode, updated pages are not automatically flushed to disk and remain in memory until the cache is exhausted and further pages are required. Only then are the least recently used dirty pages swapped to disk. This can be significantly faster than safe mode when updating using a single-thread, but no guarantees for data security are made in this mode. If a crash occurs, then data will be lost. The intention of this mode is to speed up regular bulk-loading in situations where query loads are negligible or non-existent. Query and update concurrency in this mode is not as sophisticated as safe mode.
{warning}In fast mode, it is vitally important to shutdown repository connections properly to ensure that unwritten data is flushed to the file-system. If the database is not shutdown properly for any reason then data corruption is assumed to have occurred and GraphDB will refuse to start with the same disk image.{warning}
The transaction mode is set using the {{transaction-mode}} configuration parameter - see the [configuration section|GraphDB-SE Configuration]. Changing modes requires GraphDB to be restarted.
In fast transaction mode, the isolation constraint can be relaxed in order to improve concurrency behaviour when strict read isolation is not a requirement - this is controlled by a new parameter {{transaction-isolation}} that only has an effect in fast mode - see the [configuration section|GraphDB-SE Configuration].
h1. Transaction Control
Transaction support is exposed via Sesame's *RepositoryConnection* interface. The three methods of this interface that give the client control over when updates are committed to the repository are shown below:
\\
|| Method || Effect ||
| {{void begin()}} | Begins a transaction. Subsequent changes effected through update operations will only become permanent after commit() is called. |
| {{void commit()}} | Commits all updates that have been performed through this connection since the last call to begin(). |
| {{void rollback()}} | Rolls back all updates that have been performed through this connection since the last call to begin(). |
GraphDB-SE supports the so called 'read committed' transaction isolation level, well known to relational database management systems. It guarantees that changes will not impact query evaluation, before the entire transaction they are part of is successfully committed. It does not guarantee that execution of a single transaction is performed against a single state of the data in the repository. Regarding concurrency:
* multiple update/modification/write transactions can be initiated and stay open simultaneously, i.e. one transaction does not need to be committed in order to allow another transaction to complete;
* update transactions are processed internally in sequence, i.e. GraphDB processes the commits one after another;
* update transactions do not block read requests in any way, i.e. hundreds of SPARQL queries can be evaluated in parallel (the processing is properly multi-threaded) while update transactions are being handled on separate threads.
One should note that GraphDB performs materialisation, making sure that all the statements that can be inferred from the current state of the repository are indexed and persisted (except for those compressed due to the *owl:sameAs* optimisation, described in section 7.5). When the commit method completes, all reasoning related activities related to the changes in the data introduced by the corresponding transaction will have already been performed.
{note}An uncommitted transaction will not affect the 'view' of the repository through any connection, +including the connection used to do the modification+. This is perhaps not in keeping with most relational database implementations. However, committing a modification to a semantic repository involves considerably more work, specifically the computation of the changes to the inferred closure resulting from the addition or removal of explicit statements. This computation is only carried out at the point where the transaction is committed and so to be consistent, neither the inferred statements nor the modified statements related to the transaction are 'visible'.{note}
h1. Predicate Lists
Certain data-sets and certain kinds of query activities, for example queries that use wild-card patterns for predicates, benefit from another type of index called a 'predicate list'. This index maps from entities (subject or object) to their predicates. This index is not switched on by default (see *enablePredicateList* in the [configuration section|GraphDB-SE Configuration]), because it is not always necessary. Indeed, for most datasets and query loads the performance of GraphDB-SE without such an index is good enough even with wild-card-predicate queries, and the overhead of maintaining this index are not justified. One should consider using this index for datasets that contain a very large number (greater than around 1000) different predicates.
h1. Context Indices
There are two more optional indices that can be used to speed up query evaluation when searching statements via their context identifier. These indices are the PCSO and the PCOS indices and are switched on together. See the *enable-context-index* parameter in the [configuration section|GraphDB-SE Configuration].
h1. Index Compression
The pages containing index data structures can optionally be written to disk with ZIP compression. This adds a small overhead to the performance of read/write operations, but can save a significant amount of disk-storage space. This is particularly significant for large databases that use expensive SSD storage devices.
Index compression is controlled using a single configuration parameter called {{index-compression-ratio}}, whose default value is {{\-1}} indicating no compression. To create a repository that uses ZIP compression, set this parameter to a value between 10 and 50 percent (inclusive). Once created, this compression ratio can not be changed.
The value for this parameter indicates the attempted compression ratio for pages - the smaller the value the more compression is attempted. Pages that can not be compressed below the requested size are stored uncompressed. Therefore, setting this value too low will not save any disk space and will simply add to the processing overhead. Typically, a value of 30% gives good performance with significant disk-space reduction, i.e. around 70% less disk space used for each index. The total disk space requirements are typically reduced by around half when using index compression at 30%.
h1. Literal Index
A literal index is automatically built that allows faster look-ups of numeric and date/time object values. The index is used during query evaluation, only if a query or a subquery (e.g. union) has a filter that is comprised of a conjunction of literal constraints using comparisons and equality (but no negation or inequality), e.g. FILTER(?x = 100 && ?y <= 5 && ?start > "2001-01-01"^^xsd:date)
Other patterns will not use the index in this version of GraphDB, i.e. no attempt is made to re-write filters into usable patterns.
For example, these FILTER patterns will all make use of the literal index:
{noformat}
FILTER( ?x = 7 )
FILTER( 3 < ?x )
FILTER( ?x >= 3 && ?y <= 5 )
FILTER( ?x > "2001-01-01"^^xsd:date )
{noformat}
Whereas these FILTER patterns will not:
{noformat}
FILTER( ?x > (1 + 2) )
FILTER( ?x < 3 || ?x > 5 )
FILTER( (?x + 1) < 7 )
FILTER( ! (?x < 3) )
{noformat}
The decision by the query-optimiser whether to make use of this index is statistics-based. If the estimated number of matches for a filter constraint is large relative to the rest of the query, e.g. a constraint with large or one-sided range, then the index might not be used at all.
{info}
Due to the way that literals are stored, dates far in the future and far into the past (approximately 200,000,000 years forward or backward) will behave unexpectedly. Also, numbers beyond the range of 64-bit floating-point representation, i.e. above approximately 1e309 and below \-1e309 will behave unexpectedly.
{info}
{info}
In case of some unexpected behaviour or problem with the literal index implementation, the use of this index during query evaluation can be disabled with a configuration parameter called {{enable-literal-index}}. The default value is {{true}}.
{info}
h1. Handling of Explicit and Implicit Statements
As already described, GraphDB-SE applies the inference rules at load time in order to compute the full closure. Therefore a repository will contain some statements that are explicitly asserted and other statements that exist through implication. In most cases clients will not be concerned with the difference, however there are some scenarios when it is useful to work with only explicit or only implicit statements. The following sections describe how these two groups of statements can be isolated during programmatic statement retrieval using the Sesame API and during (SPARQL) query evaluation.
h2. Retrieving Statements with the Sesame API
The usual technique for retrieving statements is to use the *RepositoryConnection* method:
{code}
RepositoryResult<Statement> getStatements(
Resource subj,
URI pred,
Value obj,
boolean includeInferred,
Resource... contexts)
{code}
The method retrieves statements by 'triple pattern', where any or all of the subject, predicate and object parameters can be *null* to indicate 'wild cards'.
To retrieve explicit and implicit statements, the *includeInferred* parameter must be set to *true*. To retrieve only explicit statements, the *includeInferred* parameter must be set to *false*.
However, the Sesame API does not provide the means to enable the retrieval of implicit statements only. In order to allow clients to do this, GraphDB-SE allows the use of the special 'implicit' pseudo-graph (section 10.7.1) with this API, which can be passed as the context parameter. The following example shows how only implicit statements can be retrieved:
{code}RepositoryResult<Statement> statements =
repositoryConnection.getStatements(
null, null, null, true,
new URIImpl("http://www.ontotext.com/implicit"));
while (statements.hasNext()) {
Statement statement = statements.next();
// Process statement
}
statements.close();
{code}The above example uses wildcards for subject, predicate and object and will therefore return all implicit statements in the repository.
h2. SPARQL Query Evaluation
GraphDB-SE also provides mechanisms to differentiate between explicit and implicit statements during query evaluation. This is achieved by associating statements with two pseudo-graphs (explicit and implicit) and using special system URIs to identify these graphs. Full details can be found in the [query behaviour|GraphDB-SE Query Behaviour] section.
h1. Comparison of GraphDB-SE and GraphDB-Lite
The major differences between GraphDB-SE and GraphDB-Lite are their performance and scalability. Both GraphDB editions deliver identical functionality for RDF storage, inference and query answering and they both implement Sesame's SAIL APIs. This guarantees that all essential functions of a semantic repository are supported by GraphDB in a standard, consistent, and interoperable manner.
Compared to GraphDB-Lite, GraphDB-SE adds additional features and augments many aspects related to performance, which in most cases is a matter of special indexing strategies that allow more efficient retrieval. Functionally, the differences can be classified in to two groups:
* Do the same better: the corresponding feature does not allow the user to do more things with GraphDB-SE -- it rather makes it work better in specific circumstances. This is the case with [predicate lists|GraphDB-SE Indexing Specifics#Predicate Lists] and the [owl:sameAs optimisation|GraphDB-SE Reasoner]
* Do more: deliver a new type of functionality, which is not available in GraphDB-Lite. Such examples are [RDF ranking|GraphDB-SE RDF Rank] and [RDF priming|GraphDB-SE Experimental Features]. [Full-text search features|GraphDB-SE Full-text Search] are also available in GraphDB-SE, although this could be seen as an enhancement since similar, but less powerful, behaviour is available by using regular expression constraints (which disregard tokenisation and deliver different results).
In the 'do more' category, GraphDB-SE delivers functionality that is not exposed by the Sesame API. Typically, this is achieved with the use of special-purpose system predicates. One should be aware that using the 'do more' features will affect compatibility with other semantic repositories.
h1. Persistence Strategy
GraphDB-SE stores all of its data (statements, indexes, entity pool, etc.) in files in the configured storage directory, usually called 'storage'. The content and names of these files is not defined and is subject to change between versions. In general, the index structures used in GraphDB-SE are chosen and optimised to allow for efficient:
* handling of billions of statements under reasonable RAM constraints
* query optimisation
* transaction management
GraphDB-SE maintains two main indices on statements for use in inference and query evaluation, these are the predicate-object-subject (POS) index and the predicate-subject-object (PSO) index. There are many other additional data structures that are used to enable the efficient manipulation of RDF data, but these are not listed, since these internal mechanisms cannot be configured.
The following subsections describe several indexing options, which deliver considerable advantages for specific datasets, retrieval patterns and query loads. Most of these are switched off by default, thus the user should take the initiative to switch them on as necessary. Unless otherwise stated, GraphDB-SE allows one to switch indices on and off against an already populated repository; the repository should be shut down before the change of the configuration is specified. The next time the repository is started, GraphDB-SE will create or remove the corresponding index. In the case that the repository is already loaded with a large volume of data, switching on a new index can lead to considerable delays during initialisation -- this is the time required for building the new index.
h1. Transaction Mode
There are two transaction mechanisms in GraphDB. The default (safe) mode causes all updates to be flushed to disk as part of the commit operation. The ordering of updated pages in the index files and the sequence used to write them to the file-system mean that they are consistent with the state of the database prior to the update in the event of an abnormal termination. In other words, rollback is natively supported should the application crash and recovery after such an event is instant. Also, the method for updating data structures (copy of page index and copy-on-write of pages) mean that a high level of concurrency is supported between updates queries.
In bulk-loading (fast) mode, updated pages are not automatically flushed to disk and remain in memory until the cache is exhausted and further pages are required. Only then are the least recently used dirty pages swapped to disk. This can be significantly faster than safe mode when updating using a single-thread, but no guarantees for data security are made in this mode. If a crash occurs, then data will be lost. The intention of this mode is to speed up regular bulk-loading in situations where query loads are negligible or non-existent. Query and update concurrency in this mode is not as sophisticated as safe mode.
{warning}In fast mode, it is vitally important to shutdown repository connections properly to ensure that unwritten data is flushed to the file-system. If the database is not shutdown properly for any reason then data corruption is assumed to have occurred and GraphDB will refuse to start with the same disk image.{warning}
The transaction mode is set using the {{transaction-mode}} configuration parameter - see the [configuration section|GraphDB-SE Configuration]. Changing modes requires GraphDB to be restarted.
In fast transaction mode, the isolation constraint can be relaxed in order to improve concurrency behaviour when strict read isolation is not a requirement - this is controlled by a new parameter {{transaction-isolation}} that only has an effect in fast mode - see the [configuration section|GraphDB-SE Configuration].
h1. Transaction Control
Transaction support is exposed via Sesame's *RepositoryConnection* interface. The three methods of this interface that give the client control over when updates are committed to the repository are shown below:
\\
|| Method || Effect ||
| {{void begin()}} | Begins a transaction. Subsequent changes effected through update operations will only become permanent after commit() is called. |
| {{void commit()}} | Commits all updates that have been performed through this connection since the last call to begin(). |
| {{void rollback()}} | Rolls back all updates that have been performed through this connection since the last call to begin(). |
GraphDB-SE supports the so called 'read committed' transaction isolation level, well known to relational database management systems. It guarantees that changes will not impact query evaluation, before the entire transaction they are part of is successfully committed. It does not guarantee that execution of a single transaction is performed against a single state of the data in the repository. Regarding concurrency:
* multiple update/modification/write transactions can be initiated and stay open simultaneously, i.e. one transaction does not need to be committed in order to allow another transaction to complete;
* update transactions are processed internally in sequence, i.e. GraphDB processes the commits one after another;
* update transactions do not block read requests in any way, i.e. hundreds of SPARQL queries can be evaluated in parallel (the processing is properly multi-threaded) while update transactions are being handled on separate threads.
One should note that GraphDB performs materialisation, making sure that all the statements that can be inferred from the current state of the repository are indexed and persisted (except for those compressed due to the *owl:sameAs* optimisation, described in section 7.5). When the commit method completes, all reasoning related activities related to the changes in the data introduced by the corresponding transaction will have already been performed.
{note}An uncommitted transaction will not affect the 'view' of the repository through any connection, +including the connection used to do the modification+. This is perhaps not in keeping with most relational database implementations. However, committing a modification to a semantic repository involves considerably more work, specifically the computation of the changes to the inferred closure resulting from the addition or removal of explicit statements. This computation is only carried out at the point where the transaction is committed and so to be consistent, neither the inferred statements nor the modified statements related to the transaction are 'visible'.{note}
h1. Predicate Lists
Certain data-sets and certain kinds of query activities, for example queries that use wild-card patterns for predicates, benefit from another type of index called a 'predicate list'. This index maps from entities (subject or object) to their predicates. This index is not switched on by default (see *enablePredicateList* in the [configuration section|GraphDB-SE Configuration]), because it is not always necessary. Indeed, for most datasets and query loads the performance of GraphDB-SE without such an index is good enough even with wild-card-predicate queries, and the overhead of maintaining this index are not justified. One should consider using this index for datasets that contain a very large number (greater than around 1000) different predicates.
h1. Context Indices
There are two more optional indices that can be used to speed up query evaluation when searching statements via their context identifier. These indices are the PCSO and the PCOS indices and are switched on together. See the *enable-context-index* parameter in the [configuration section|GraphDB-SE Configuration].
h1. Index Compression
The pages containing index data structures can optionally be written to disk with ZIP compression. This adds a small overhead to the performance of read/write operations, but can save a significant amount of disk-storage space. This is particularly significant for large databases that use expensive SSD storage devices.
Index compression is controlled using a single configuration parameter called {{index-compression-ratio}}, whose default value is {{\-1}} indicating no compression. To create a repository that uses ZIP compression, set this parameter to a value between 10 and 50 percent (inclusive). Once created, this compression ratio can not be changed.
The value for this parameter indicates the attempted compression ratio for pages - the smaller the value the more compression is attempted. Pages that can not be compressed below the requested size are stored uncompressed. Therefore, setting this value too low will not save any disk space and will simply add to the processing overhead. Typically, a value of 30% gives good performance with significant disk-space reduction, i.e. around 70% less disk space used for each index. The total disk space requirements are typically reduced by around half when using index compression at 30%.
h1. Literal Index
A literal index is automatically built that allows faster look-ups of numeric and date/time object values. The index is used during query evaluation, only if a query or a subquery (e.g. union) has a filter that is comprised of a conjunction of literal constraints using comparisons and equality (but no negation or inequality), e.g. FILTER(?x = 100 && ?y <= 5 && ?start > "2001-01-01"^^xsd:date)
Other patterns will not use the index in this version of GraphDB, i.e. no attempt is made to re-write filters into usable patterns.
For example, these FILTER patterns will all make use of the literal index:
{noformat}
FILTER( ?x = 7 )
FILTER( 3 < ?x )
FILTER( ?x >= 3 && ?y <= 5 )
FILTER( ?x > "2001-01-01"^^xsd:date )
{noformat}
Whereas these FILTER patterns will not:
{noformat}
FILTER( ?x > (1 + 2) )
FILTER( ?x < 3 || ?x > 5 )
FILTER( (?x + 1) < 7 )
FILTER( ! (?x < 3) )
{noformat}
The decision by the query-optimiser whether to make use of this index is statistics-based. If the estimated number of matches for a filter constraint is large relative to the rest of the query, e.g. a constraint with large or one-sided range, then the index might not be used at all.
{info}
Due to the way that literals are stored, dates far in the future and far into the past (approximately 200,000,000 years forward or backward) will behave unexpectedly. Also, numbers beyond the range of 64-bit floating-point representation, i.e. above approximately 1e309 and below \-1e309 will behave unexpectedly.
{info}
{info}
In case of some unexpected behaviour or problem with the literal index implementation, the use of this index during query evaluation can be disabled with a configuration parameter called {{enable-literal-index}}. The default value is {{true}}.
{info}
h1. Handling of Explicit and Implicit Statements
As already described, GraphDB-SE applies the inference rules at load time in order to compute the full closure. Therefore a repository will contain some statements that are explicitly asserted and other statements that exist through implication. In most cases clients will not be concerned with the difference, however there are some scenarios when it is useful to work with only explicit or only implicit statements. The following sections describe how these two groups of statements can be isolated during programmatic statement retrieval using the Sesame API and during (SPARQL) query evaluation.
h2. Retrieving Statements with the Sesame API
The usual technique for retrieving statements is to use the *RepositoryConnection* method:
{code}
RepositoryResult<Statement> getStatements(
Resource subj,
URI pred,
Value obj,
boolean includeInferred,
Resource... contexts)
{code}
The method retrieves statements by 'triple pattern', where any or all of the subject, predicate and object parameters can be *null* to indicate 'wild cards'.
To retrieve explicit and implicit statements, the *includeInferred* parameter must be set to *true*. To retrieve only explicit statements, the *includeInferred* parameter must be set to *false*.
However, the Sesame API does not provide the means to enable the retrieval of implicit statements only. In order to allow clients to do this, GraphDB-SE allows the use of the special 'implicit' pseudo-graph (section 10.7.1) with this API, which can be passed as the context parameter. The following example shows how only implicit statements can be retrieved:
{code}RepositoryResult<Statement> statements =
repositoryConnection.getStatements(
null, null, null, true,
new URIImpl("http://www.ontotext.com/implicit"));
while (statements.hasNext()) {
Statement statement = statements.next();
// Process statement
}
statements.close();
{code}The above example uses wildcards for subject, predicate and object and will therefore return all implicit statements in the repository.
h2. SPARQL Query Evaluation
GraphDB-SE also provides mechanisms to differentiate between explicit and implicit statements during query evaluation. This is achieved by associating statements with two pseudo-graphs (explicit and implicit) and using special system URIs to identify these graphs. Full details can be found in the [query behaviour|GraphDB-SE Query Behaviour] section.