GraphDB-Lite Indexing Specifics

compared with
Current by Reneta Popova
on Aug 22, 2014 14:48.

Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (7)

View Page History
h1. Comparison of GraphDB-SE and GraphDB-Lite

The major differences between GraphDB-SE and GraphDB-Lite are their performance and scalability. Both GraphDB editions deliver identical functionality for RDF storage, inference and query answering, and they both implement Sesame's SAIL APIs, as discussed in section 4. This guarantees that all essential functions of a semantic repository are supported by GraphDB in a standard, consistent, and interoperable manner.
Compared to GraphDB-SE, GraphDB-Lite does not scale as well in terms of the volume of data that can be managed -- the upper limit being typically some tens of millions of statements. However, GraphDB-Lite can perform faster inferencing and query answering, due to the fact that it holds all data in memory. However, this is not always the case, because GraphDB-SE has several optimisations that GraphDB-Lite does not, i.e. special owl:sameAs handling and various query optimisations.
Furthermore, GraphDB-SE has a range of advanced features that are not included in GraphDB-Lite, i.e. RDF Ranking, RDF Priming, RDF Search, Node Search, notifications.

GraphDB-Lite stores the repository contents in a binary file in the storage folder when it is shutdown. The format is such that it can be quickly read back in to memory when GraphDB-Lite is restarted, i.e. the synchronisation of the in-memory contents of the repository with a persistent binary storage file occurs only at initialisation and shutdown.
Furthermore, new statements that are added to the repository are also stored in N-Triples format in an external file (see the new-triples-file configuration parameter in section 8.4). section 8.4). In the event of abnormal termination, the contents of this external file are added to the repository immediately after the repository is restored from the binary file.
{warning}It is vitally important to shutdown repository connections properly to ensure that the repository contents are written to the file-system on shutdown.{warning}

| {{void setAutoCommit(boolean autoCommit)}} | Enables or disables auto-commit mode for the connection. |

GraphDB supports the so called 'read committed' transaction isolation level, well known to relational database management systems. It guarantees that changes will not impact query evaluation, before the entire transaction, they are part of, is successfully committed. It does not guarantee that execution of a single transaction is performed against a single state of the data in the repository. Regarding concurrency:
* multiple update/modification/write transactions can be initiated and stay open simultaneously, i.e. one transaction does not need to be committed in order to allow another transaction to complete;
* update transactions are processed internally in sequence, i.e. GraphDB processes the commits one after another;
* update transactions do not block read requests in any way, i.e. hundreds of SPARQL queries can be evaluated in parallel (the processing is properly multi-threaded), while update transactions are being handled on separate threads.

One should note that GraphDB performs materializsation, making sure that all the statements which can be inferred from the current state of the repository are indexed and persisted. When the commit method completes, all reasoning related activities related to the changes in the data introduced by the corresponding transaction will have already been performed.

{note}
h1. Handling of Explicit and Implicit Statements

As already described, GraphDB-Lite applies the inference rules at load time in order to compute the full closure. Therefore, a repository will contain some statements that are explicitly asserted and other statements that exist through implication. In most cases clients will not be concerned with the difference, however there are some scenarios when it is useful to work with only explicit or only implicit statements. The following sections describe how these two groups of statements can be isolated during programmatic statement retrieval using the Sesame API and during (SPARQL) query evaluation.
The usual technique for retrieving statements is to use the RepositoryConnection method:
{code}RepositoryResult<Statement> getStatements(
h1. Multi-threading

GraphDB-Lite features thread-safe techniques for managing the internal storage of data structures, where several 'worker' threads do inferencing in parallel. The number of worker threads is controlled via the num.threads.run=n system property, see section&nbsp;8.4. section 8.4. Each of the allocated threads operates on jobs of committed explicit statements. Each thread computes the increment to the inferred closure from applying the rule set to these new statements and the existing statements in the repository. The number of collected statements is controlled via the {{jobsize}} configuration parameter. When transaction is committed, the caller is blocked until all worker threads have finished their job units.
There are other threads running simultaneously apart from the main application thread and the worker threads mentioned above. If persistence is not switched off, then a persistence thread wakes up every few seconds and scans for new explicit statements. Any new statements found are added to the persistence file. There is also an additional thread that is spawned (by Sesame) during the parsing of RDF at load time. This should also be considered when deciding how many worker threads to allocate for inference.