The entity pool is a key component of the GraphDB storage layer. With version 6.2 we introduce a new transactional implementation that improves space usage and cluster behavior.
The entity pool is a storage for the strings of all RDF values - URIs, BNodes and Literals and their IDs. The data structure can be considered a file-based table, which maps strings to IDs and vice versa. For the inference and the query evaluation, when the RDF values are not explicitly required, we use their internal representations, 32-bit or 40-bit IDs. Strings are only required by the FILTERs and BIND clauses that perform SPARQL string operations. As the comparison operations between strings are much slower, we use the IDs for all iterations that are performed to reach the needed result. When found, then the strings from the entity pool are read again in order to return this result.
Given that the entity pool should perform mappings while loading data and all statements that come from the inferencer should be materialized prior to the end of the transaction, the natural question is: What happens if the transaction is prematurely terminated, for example by an exception originating from the transaction itself or by an interruption?
In previous versions, the entity pool was not transactional, which meant that new entities were created at the point of adding statements and remained in the entity pool. Consequently, the failed transactions and the ones that were intentionally rolled back, constantly collected new garbage entities in the entity pool and consumed disk space.
Therefore, with version 6.2, we made the entity tool transactional, which means that all newly created entities in it will be deleted, if the transaction is prematurely terminated. This does not only save disc space but also ensures that the same strings match the same IDs. This is crucial in a cluster environment as during replications (partial or full) and ordinary commits, which take place asynchronously between worker nodes, the new entities are always transmitted separately from the data. It would be a disaster, if during the transaction an entity is missed or another one is added on some of the worker nodes. As all new data is shared in the form of quads of IDs between the worker nodes, this will result in a shift in the IDs and all data after a given ID will be rendered incorrectly.
The main reason for making the entity pool transactional is to always keep the data on all worker nodes in sync. This demand came from using the connectors in a cluster environment where IDs were frequently falling out of sync.
The transactional implementation is accessed through an entity connection and all data written there is invisible to other connections until the commit. We can create new connections simultaneously and add any kind of entities, even duplicates, because all their IDs are temporary. This means that all new entities will have different IDs in different connections and the entity->ID and ID->entity mappings will be accessible only through the connection used to add them. In other words, the same entities in a single connection will map to the same temp IDs but these same entities when in different connections will not share the same IDs. All committed entities will be visible through all new connections. The connections can be used independently and asynchronously. However, there are two stages of committing data: pre-commit and commit. As the IDs are temporary, they cannot be used to represent real statements in the storage and to inference over them.
The pre-commit stage fixes the IDs, so that they become permanent in sense that they can be inserted and inference can be performed but they are not yet materialized in the entity pool. Entities in this stage are still invisible to the other connections. The reason to pass through such a stage is that we don't want to write the entities until we make sure the transaction succeeds. Most of the operations take place while committing statements, so the storage, the inferencer and the plugins are involved in this process and most of the problems may result in erroneous data or during consistency checking in the end of the commit. These operations are very time-consuming and to prevent data inconsistencies after transaction failures we want to make sure that everything is going to be processed normally before we write the entities from the current entity pool connection to the entity pool. This is the pre-commit stage all about: to fix the entity IDs (to map them from temp to permanent ones) in order store/inference/plugin/consistency checking operations to pass normally as if the IDs were already written to the entity pool. In case of failure the transaction will be rolled back and all entities will be thrown away. When we fix the entity IDs in the pre-commit stage no other connections can enter that stage, because we still don't know whether the transaction will be committed or rolled back and we can't fix the IDs in the other connections and guarantee that they won't change through time. So during pre-commit stage only one transaction can be processed at a time (and this has been always the case, so nothing to worry about).
The commit phase finalizes the entity pool transaction and writes the entities on the disk. From now on, these entities will always appear with the IDs with which they were written, so any following entity pool connection which has just started a transaction will be able to see them with their respective IDs. If some entity which has just be written to the entity pool (a new entity) is present in another connection which is pending to be committed, the entity will be resolved within the current transaction with the ID it has just been written, so there are no worries about duplicate strings in the entity pool. An entity can be written just once.
The transactability of the entity pool fixes many issues related to creating IDs but also add a little bit of an overhead in the same time because now we should pre-process the entities, which takes some time, and then to do the usual stuff we do on commit including adding the entities in the permanent store, therefore the new entity pool can't be faster than the previous non-transactional one, so we've got two implementation options: one truly transnational which does all the things described above, and one classic entity pool which avoids the overhead. The classic one still uses connections and the same entity pool interface but the entities are directly added when adding statements and can't be rolled back, i.e. it behaves just like the entity pool from the previous GraphDB versions. The default one is the classic entity pool. The entity pool implementation can be chosen by the 'entity-pool-implementation' config parameter or the -D command line parameter with the same name. There are two possible values: 'transactional-simple' and 'classic'. When we are in the 'transactional-simple' mode all entities are kept in memory. One should have this in mind when dealing with large transactions. Not until have the entities been committed to the disk when the memory dedicated for the transaction will be released. Transactions up to 100 million statements should be ok but much larger ones will severely occupy memory resources. Such bulk loads should be used with the 'classic' option to prevent OutOfMemoryErrors. From the other hand, big in number small transactions will have the same performance like before because the overall overhead per transaction is much larger than the overhead introduced by the transactional entity pool, so the 'transactional-simple' option should be preferred over the 'classic'. It should be preferred especially in a cluster environment.
|entity-pool-implementation||transactional-simple||all entities are kept in memory|
|classic (default)||for bulk loads|