The entity pool is a key component of the GraphDB storage layer. With version 6.2 we introduce a new transactional implementation that improves space usage and cluster behavior.
The entity pool is a storage for the strings of all RDF values - URIs, BNodes and Literals and their IDs. The data structure can be considered a file-based table, which maps strings to IDs and vice versa. For the inference and the query evaluation, when the RDF values are not explicitly required, we use their internal representations, 32-bit or 40-bit IDs. Strings are only required by the FILTERs and BIND clauses that perform SPARQL string operations. As the comparison operations between strings are much slower, we use the IDs for all iterations that are performed to reach the needed result. When found, then the strings from the entity pool are read again in order to return this result.
Given that the entity pool should perform mappings while loading data and all statements that come from the inferencer should be materialized prior to the end of the transaction, the natural question is: What happens if the transaction is prematurely terminated, for example by an exception originating from the transaction itself or by an interruption?
In previous versions, the entity pool was not transactional, which meant that new entities were created at the point of adding statements and remained in the entity pool. Consequently, the failed transactions and the ones that were intentionally rolled back, constantly collected new garbage entities in the entity pool and consumed disk space.
Therefore, with version 6.2, we made the entity tool transactional, which means that all newly created entities in it will be deleted, if the transaction is prematurely terminated. This does not only save disc space but also ensures that the same strings match the same IDs. This is crucial in a cluster environment as during replications (partial or full) and ordinary commits, which take place asynchronously between worker nodes, the new entities are always transmitted separately from the data. It would be a big problem, if during the transaction an entity is missed or another one is added on some of the worker nodes. As all new data is shared in the form of quads of IDs between the worker nodes, this will result in a shift in the IDs and all data after a given ID will be rendered incorrectly.
The main reason for making the entity pool transactional is to always keep the data on all worker nodes in sync. This demand came from using the connectors in a cluster environment where IDs were frequently falling out of sync.
The transactional implementation is accessed through an entity pool connection and all data written there is invisible to other connections until the commit. We can create new connections simultaneously and add any kind of entities, even duplicates, because all their IDs are temporary. This means that all new entities will have different IDs in different connections and the entity->ID and ID->entity mappings will be accessible only through the connection used to add them. In other words, the same entities in a single connection will map to the same temp IDs but these same entities when in different connections will not share the same IDs. All committed entities will be visible through all new connections. The connections can be used independently and asynchronously. However, there are two stages of committing data: pre-commit and commit. As the IDs are temporary, they cannot be used to represent real statements in the storage and to inference over them.
In the pre-commit stage, the IDs are fixed, so that they become permanent. They can be inserted and inference can be performed but they are not yet materialized in the entity pool. Entities in this stage are still invisible to the other connections.
The reason for this stage is to ensure that no entities are written to the entity pool until the transaction succeeds. As many operations take place while committing statements, the storage, the inferencer and the plugins are involved and most of the problems may result in erroneous data or during consistency checking at the end of the commit. These operations are very time-consuming and it is best to prevent data inconsistencies after transaction failures. Therefore, we want to make sure that everything is processed normally before writing the entities from the current entity pool connection to the entity pool.
The pre-commit stage fixes the entity IDs (mapping them from temp to permanent ones) so that the store/inference/plugin/consistency checking operations can pass normally as if the IDs were already written to the entity pool. If there is a failure, the transaction is rolled back and all entities are deleted. As we do not know whether the transaction will be committed or rolled back and we cannot fix the IDs in the other connections and guarantee that they will not change over time, when the entity IDs are fixed in the pre-commit stage, no other connections can enter that stage. So, during the pre-commit stage, as in earlier versions, only one transaction can be processed at a time.
The commit phase finalizes the entity pool transaction and writes the entities on the disk. From now on, these entities will always appear with the IDs with which they were written, so any following entity pool connection which has just started a transaction will be able to see them with their respective IDs. If some entity which has just be written to the entity pool (a new entity) is present in another connection which is pending to be committed, the entity will be resolved within the current transaction with the ID it has just been written, so there are no worries about duplicate strings in the entity pool. An entity can be written just once.
The transactability of the entity pool fixes many issues related to creating IDs but at the same time we should pre-process the entities, which takes some time, and then to do the usual stuff we do on commit including adding the entities in the permanent store. Therefore, the new entity pool cannot be faster than the previous non-transactional one. So we have got two implementation options: one truly transactional, which does all the things described above, and one classic entity pool, which avoids the overhead. The classic one still uses connections and the same entity pool interface but the entities are directly added when adding statements and can't be rolled back, i.e. it behaves just like the entity pool from the previous GraphDB versions. The default one is the classic entity pool. The entity pool implementation can be chosen by the entity-pool-implementation config parameter or the -D command line parameter with the same name. There are two possible values:
When we are in the transactional-simple mode all entities are kept in memory. One should have this in mind when dealing with large transactions. Not until have the entities been committed to the disk when the memory dedicated for the transaction will be released. Transactions up to 100 million statements should be ok but much larger ones will severely occupy memory resources. Such bulk loads should be used with the 'classic' option to prevent OutOfMemoryErrors. On the other hand, big in number small transactions will have the same performance like before because the overall overhead per transaction is much larger than the overhead introduced by the transactional entity pool, so the transactional-simple option should be preferred over the classic, especially in a cluster environment.
|entity-pool-implementation||transactional-simple||all entities are kept in memory|
|classic (default)||for bulk loads|