The entity pool is a key component of GraphDB storage layer, since 6.2 we introduced a new transactional implementation which improves space usage and cluster behavior.
The entity pool is a storage for the strings of all RDF values which can be URIs, BNodes and Literals. We don't usually deal with the strings themselves, but with their internal representation which can be 32-bit or 40-bit IDs. These are used in the inference and in the query evaluation when the RDF values are not explicitly required (actually strings are required by FILTERs and BIND clauses which perform SPARQL string operations and in this case we do need the strings; in all other cases we can do with IDs only, until we find a result; when a result is found then we again need to read the strings from the entity pool in order to return that result; but all iterations that we should perform until we reach a result are a pretty big deal, so it makes sense to use IDs over strings since the comparison operations between IDs are so much faster). This data structure can be considered a file-based table which maps strings to IDs and vice versa.
Given that the entity pool should perform mappings while loading data and given that all the statements which come from the inferencer should be materialized prior to the end of the transaction, there follows the natural question: What actually will happen if by some reason the transaction is to be prematurely terminated, say by an exception originating within the transaction itself or by an external force (interruption)? And by now what was happening was that the newly created entities were remaining in the entity pool since they had been created at the point we had added a statement, not while committing of the transaction, at which point the entities had long since been created.
In this manner one can easily observe that the failed transactions and the ones which were consciously rolled back keep the entity pool ever growing. And this is because of the fact the entity pool was the only part of the data storage which remained non-transactional (besides the plugins which may or may not be transactional but they are not a part of the storage). Making it transactional is a great leap forward meaning that any data which is to be thrown away will really disappear without traces which would otherwise unnecessarily consume disk space. Of course, it's not only about disk space, which is nowadays plentiful, but also about making sure that the same strings match the same IDs, which is crucial in a cluster environment. That's because during replications (no matter if they are partial or full) and ordinary commits, which take place asynchronously between worker nodes, we always transmit the new entities separately from the data and it would be a disaster, should an entity is missed or one more are added within the transaction on some worker node unexpectedly. This would introduce a shift in the IDs and, because all new data is shared in the form of quads of IDs between the worker nodes, all data after a given ID will be rendered senseless. And this is the primary reason for the demand for transactability of the entity pool -- to keep all the data on all worker nodes always in sync. This demand originates from the usage of the connectors under a cluster environment where IDs frequently were falling out of sync.
The new implementation is now accessed through a connection and all data written to a connection is invisible to the other connections until commit. We can create new connections simultaneously and add any entities we want, some of them may even be duplicates, but the IDs we receive for all of them are temporary. This means that all new entities will have different IDs (same entities in the same connection will map to the same temp IDs but the same entities from different connections do not necessarily share the same IDs) and the entity->ID and ID->entity mappings are accessible only through the connection which was used to add the entity. All entities committed so far are visible through all new connections. All connections can be used independently and asynchronously. However there are two stages of committing data: pre-commit and commit, because the IDs are temporary, they can't be used to represent real statements in the storage and to make inference over them.
The pre-commit stage fixes the IDs, so that they become permanent in sense that they can be inserted and inference can be performed but they are not yet materialized in the entity pool. Entities in this stage are still invisible to the other connections. The reason to pass through such a stage is that we don't want to write the entities until we make sure the transaction succeeds. Most of the operations take place while committing statements, so the storage, the inferencer and the plugins are involved in this process and most of the problems may result in erroneous data or during consistency checking in the end of the commit. These operations are very time-consuming and to prevent data inconsistencies after transaction failures we want to make sure that everything is going to be processed normally before we write the entities from the current entity pool connection to the entity pool. This is the pre-commit stage all about: to fix the entity IDs (to map them from temp to permanent ones) in order store/inference/plugin/consistency checking operations to pass normally as if the IDs were already written to the entity pool. In case of failure the transaction will be rolled back and all entities will be thrown away. When we fix the entity IDs in the pre-commit stage no other connections can enter that stage, because we still don't know whether the transaction will be committed or rolled back and we can't fix the IDs in the other connections and guarantee that they won't change through time. So during pre-commit stage only one transaction can be processed at a time (and this has been always the case, so nothing to worry about).
The commit phase finalizes the entity pool transaction and writes the entities on the disk. From now on, these entities will always appear with the IDs with which they were written, so any following entity pool connection which has just started a transaction will be able to see them with their respective IDs. If some entity which has just be written to the entity pool (a new entity) is present in another connection which is pending to be committed, the entity will be resolved within the current transaction with the ID it has just been written, so there are no worries about duplicate strings in the entity pool. An entity can be written just once.
The transactability of the entity pool fixes many issues related to creating IDs but also add a little bit of an overhead in the same time because now we should pre-process the entities, which takes some time, and then to do the usual stuff we do on commit including adding the entities in the permanent store, therefore the new entity pool can't be faster than the previous non-transactional one, so we've got two implementation options: one truly transnational which does all the things described above, and one classic entity pool which avoids the overhead. The classic one still uses connections and the same entity pool interface but the entities are directly added when adding statements and can't be rolled back, i.e. it behaves just like the entity pool from the previous GraphDB versions. The default one is the classic entity pool. The entity pool implementation can be chosen by the 'entity-pool-implementation' config parameter or the -D command line parameter with the same name. There are two possible values: 'transactional-simple' and 'classic'. When we are in the 'transactional-simple' mode all entities are kept in memory. One should have this in mind when dealing with large transactions. Not until have the entities been committed to the disk when the memory dedicated for the transaction will be released. Transactions up to 100 million statements should be ok but much larger ones will severely occupy memory resources. Such bulk loads should be used with the 'classic' option to prevent OutOfMemoryErrors. From the other hand, big in number small transactions will have the same performance like before because the overall overhead per transaction is much larger than the overhead introduced by the transactional entity pool, so the 'transactional-simple' option should be preferred over the 'classic' one. This option should be preferred especially in a cluster environment.