The entity pool is a key component of the GraphDB storage layer. With version 6.2 we introduce a new transactional implementation that improves space usage and cluster behavior.
Entity pool basics
The entity pool is a storage for the strings of all RDF values - URIs, BNodes and Literals and their IDs. The data structure can be considered a file-based table, which maps strings to IDs and vice versa. For the inference and the query evaluation, when the RDF values are not explicitly required, we use their internal representations, 32-bit or 40-bit IDs. Strings are only required by the FILTERs and BIND clauses that perform SPARQL string operations. As the comparison operations between strings are much slower, we use the IDs for all iterations that are performed to reach the needed result. When found, the strings from the entity pool are read again in order to return this result.
Given that the entity pool should perform mappings while loading data and all statements that come from the inferencer should be materialized prior to the end of the transaction, the natural question is: What happens if the transaction is prematurely terminated, for example by an exception originating from the transaction itself or by an interruption?
In previous versions, the entity pool was not transactional, which meant that new entities were created at the point of adding statements and remained in the entity pool. Consequently, the failed transactions and the ones that were intentionally rolled back, constantly collected new garbage entities in the entity pool and consumed disk space.
Therefore, with version 6.2, the entity pool can also be transactional, which means that all newly created entities in it will be deleted, if the transaction is prematurely terminated. This does not only save disc space but also ensures that the same strings match the same IDs. This is crucial in a cluster environment as during replications (partial or full) and ordinary commits, which take place asynchronously between worker nodes, the new entities are always transmitted separately from the data. It would be a big problem, if during the transaction an entity is missed or another one is added on some of the worker nodes. As all new data is shared in the form of quads of IDs between the worker nodes, this will result in a shift in the IDs and all data after a given ID will be rendered incorrectly.
The main reason for making the entity pool transactional is to always keep the data on all worker nodes in sync. This demand came from using the connectors in a cluster environment where IDs were frequently falling out of sync.
The transactional implementation is accessed through an entity pool connection and all data written there is invisible to other connections until the commit. We can create new connections simultaneously and add any kind of entities, even duplicates, because all their IDs are temporary. This means that all new entities will have different IDs in different connections and the entity->ID and ID->entity mappings will be accessible only through the connection used to add them. In other words, the same entities in a single connection will map to the same temp IDs but these same entities when in different connections will not share the same IDs. All committed entities will be visible through all new connections. The connections can be used independently and asynchronously. However, there are two stages of committing data: pre-commit and commit. As the IDs are temporary, they cannot be used to represent real statements in the storage and to inference over them.
In the pre-commit stage, the IDs are fixed, so that they become permanent. They can be inserted and inference can be performed but they are not yet materialized in the entity pool. Entities in this stage are still invisible to the other connections.
The reason for this stage is to ensure that no entities are written to the entity pool until the transaction succeeds. As many operations take place while committing statements, the storage, the inferencer and the plugins are involved and most of the problems may result in erroneous data or during consistency checking at the end of the commit. These operations are very time-consuming and it is best to prevent data inconsistencies after transaction failures. Therefore, we want to make sure that everything is processed normally before writing the entities from the current entity pool connection to the entity pool.
The pre-commit stage fixes the entity IDs (mapping them from temp to permanent ones) so that the store/inference/plugin/consistency checking operations can pass normally as if the IDs were already written to the entity pool. If there is a failure, the transaction is rolled back and all entities are deleted. As we do not know whether the transaction will be committed or rolled back and we cannot fix the IDs in the other connections and guarantee that they will not change over time, when the entity IDs are fixed in the pre-commit stage, no other connections can enter that stage. So, during the pre-commit stage, as in earlier versions, only one transaction can be processed at a time.
The commit stage finalizes the entity pool transaction and writes the entities on the disk. From this moment, these entities will always appear with the IDs with which they were written. Any consequent entity pool connection that starts a transaction will be able to see them with their respective IDs. If a newly written entity is present in another connection, pending to be committed, it will be resolved within the current transaction with the same ID. As an entity can be written just once, there will be no duplicate strings in the entity pool.
The transactionability of the entity pool fixes many issues related to creating IDs. However, we still need to pre-process the entities and perform all other commit operations, including adding the entities to the permanent store. All these operations are time-consuming, so the new transactional entity pool cannot be faster than the previous non-transactional one.
Therefore, we offer two options: a fully transactional implementation, as described above, and a 'classic' entity pool implementation, which avoids the overhead. The default is the 'classic' entity pool, which performs like the entity pool from the earlier GraphDB versions. It still uses connections and the same entity pool interface but when adding statements, the entities are directly added and cannot be rolled back.
The entity pool implementation can be selected by the entity-pool-implementation config parameter or the -D command line parameter with the same name. There are two possible values:
When in the transactional-simple mode, all entities are kept in memory. It is good to keep this in mind when dealing with large transactions as the memory dedicated for the transaction will be released only after the entities have been committed to the disk. Transactions up to 100 million statements should not present a problem but if they are much larger, it will occupy significant memory resources. Therefore, such big loads should be used with the 'classic' option in order to prevent OutOfMemoryErrors.
On the other hand, a large number of small transactions will have the same performance as before because the overall overhead per transaction is much larger than the overhead resulting from the transactional entity pool. So, the transactional-simple option is preferable over the classic one, especially in a cluster environment.
|entity-pool-implementation||transactional-simple||all entities are kept in memory|
|classic (default)||for bulk loads|