OWLIM Frequently Asked Questions
OWLIM is a sematic repository - a software component for storing and manipulating huge quantities of RDF data. OWLIM is packaged as a Storage and Inference Layer (SAIL) for the Sesame OpenRDF framework (http://www.aduna-software.com/technology/sesame).
A semantic repository is a software component for storing and manipulating RDF data. It is made up of three distinct components:
The name originally comes from the term "OWL In Memory" and is fitting for what became OWLIM-Lite. However, OWLIM-SE uses a transactional, index-based file-storage layer where "In Memory" is no longer appropriate. Nevertheless, the name has stuck and it is seldom that anyone ever asks where it came from...
OWLIM is packaged as a Storage and Inference Layer (SAIL) for the Sesame RDF framework (http://www.aduna-software.com/technology/sesame). OWLIM can be used in two different ways:
One approach is to use it as a library, an example of which is provided in the release distribution that can be started by using 'example.cmd' in the 'getting-started' folder.
Another approach is to download the full version of Sesame and configure OWLIM-SE as a plug-in. This method uses the Sesame HTTP server hosted in Tomcat (or similar) and in this way you can use Sesame togther with OWLIM as a server application, accessed via the standard Sesame APIs.
Sesame version 2.2 onwards includes the Sesame Workbench - a convenient Web Application for managing repositories, importing/exporting RDF data, executing queries, etc. For more information please check the "doc" folder of the OWLIM-SE archive.
OWLIM-Lite and OWLIM-SE are identical in terms of usage and integration for storing and managing RDF data. They share the same inference mechanisms and semantics (rule-compiler, etc). The different editions of OWLIM use different indexing, inference, and query evaluation implementations, which results in different performance, memory requirements, and scalability.
OWLIM-Lite is designed for medium data volumes (below 100 million statements) and for prototyping. Its key characteristics are as follows:
OWLIM-SE is suitable for handling massive volumes of data and very intensive querying activities. It is designed as an enterprise-grade database management system. This has been made possible through:
See http://ontotext.com/owlim/version-map.html for more details.
All editions of OWLIM support:
The the SPARQL 1.1 Graph Store Protocol will be supported in subsequent versions of OWLIM.
There are several types of indices available, *all* of which apply to *all* triples, whether explicit or implicit. These indices are maintained automatically.
The main indexes that are always used are:
There are other optional indices and these have advantages for specific datasets, retrieval patterns and query loads. These are switched off by default.
For some datasets, or when executing queries with triples patterns with a wild-card for the predicate, a pair of indices can be used that map from resources (subject, object) to predicate, i.e.
This pair of indices are known as 'predicate lists', see enablePredicateList in the user guide.
For more efficient processing of named graphs, two other indexes can be used:
These can be switched on using the enable-context-index parameter.
There are also several variations on full-text-search indexes for both Node Search and lucene-based RDF Search. Details of these can be found in the user guide.
There is no simple answer to this question, since it depends on reasoning complexity (how many inferred triples), how long the URIs are, what additional indices are used, etc. For an example, the following table shows the disk space requirement in bytes per explicit statement when loading the wordnet dataset with various OWLIM-SE configurations:
When planning for storage capacity based on input RDF file size, this depends not only on the OWLIM-SE configuration, but also the RDF file format used and the complexity of its contents. The following table can be used to give a rough estimate for the expansion to be expected from an input RDF file to OWLIM-SE storage requirements, e.g. when using OWL2-RL with all optional indices turned on, OWLIM-SE will need about 6.7GB of storage space to load a one gigabyte N3 file - with no inference ('empty') and no optional indices, OWLIM-SE will need about 0.7GB of storage space to load a one gigabyte Trix file. Again, these results were created with the Wordnet dataset:
Firstly, note that OWLIM-SE computes inferences as new explicit statements are committed to the repository. The number of inferred statements can be zero (when using the 'empty' rule set) or many multiples of the number of explicit statements (it depends on the chosen ruleset and the complexity of the data).
The disk space required for each statement further depends on the size of the URIs and literals, but for typical datasets around 200 bytes is required with only the default indices, up to about 300 bytes when all optional indices are turned on.
So when using the default indices, a good estimate for the amount of disk space you will need is 200 bytes per statement (explicit and inferred), i.e.
OWLIM-Lite uses a similar mechanism to OWLIM-SE for maintaining a dictionary of internal identifiers that map to resources. This allows the internal identifiers to be used for indexing statements rather than using the full URI/blank-node/literal for every statement where a resource is referenced. Every unique resource requires some disk storage space and a further 12 bytes in RAM.
Explicit statements are stored in RAM and require 13 bytes per statement. Implicit statements require 8 bytes per statement.
Assuming roughly 3 unique explicit statements to every 1 unique resources then total memory requires will vary from approximately 17 bytes per explicit statement (empty rule-set) to around 40 bytes per statement (expressive rule-sets). However, this will also vary depending on the 'geometry' of the input data and the amount of use of complex (OWL) language features.
The above figures give the memory footprint for storing statements. In order to be useful, an OWLIM-Lite instance needs more memory for maintaining these data structures, executing queries, loading and inferencing. A safe figure is to allow twice as much memory, i.e. the storage space of the statements + resources multiplied by 2. Therefore, for practical purposes OWLIM-Lite will need between 34 and 80+ bytes per statement.
OWLIM-Lite is only capable of storing up to 1 billion unique resources. This sets a practical limit of approximately 3 billion explicit statements (assuming 3:1 ratio). Using the above metrics, storing this much data will require between 100 and 240 GB of RAM (or more).
It should be noted that these are theoretical limits and Ontotext have not attempted to load this much data in to an OWLIM-Lite instance.
Yes. Both OWLIM-Lite and OWLIM-SE can process queries concurrently.
Furthermore, when OWLIM-SE is used in a cluster configuration, the throughput of parallel query answering can be scaled (almost) linearly by adding more nodes.
OWLIM supports the read-committed isolation level, i.e. pending updates are not visible to other connected users until the complete update transaction has been committed. However, for efficiency reasons and unlike typical relational database behaviour, uncommitted changes are not 'visible' even using the connection that made the updates.
Yes. Unlike relational databases, a semantic database needs to conduct inference for inserted and deleted statements. This involves making highly unpredictable joins using statements anywhere in the indices for all new/deleted statements. Despite paging as best as possible, a large number of disk seeks can be expected and SSDs perform far better than HDDs in this task.
The difference between performance on SSDs and HDDs is most pronounced when OWLIM-SE is running in the safe transaction mode (the default). In fast mode, updated pages are not flushed to disk at the end of a commit operation - instead they are only swapped to disk when the cache memory is exhausted. However, in safe mode all updated pages are flushed to disk before a commit operation returns and the higher number of writes causes a small slow-down when using SSDs, but a large slow-down when using HDDs. In performance tests using LUBM(1000) (approximately 135 million statements) with a 1 GB cache, the difference in load times in fast and safe modes increases quite differently when comparing SSDs and HDDs:
The recommended transaction mode is 'safe'. However, if only HDDs are available then there is the option of using the 'fast' transaction mode with the corresponding increase in performance and the increased risk of data loss in the event of an abnormal termination. Switching modes is straightforward and just requires a restart of OWLIM.
RAID-0 gives good performance, but is more likely to suffer problems due to disk failure. RAID-5 is a good balance between resilience/redundancy and cost. Using SSDs, we have (little more than anecdotal) evidence that RAID-0 is fast, RAID-1 is slower, and RAID-5 is slower with less than 4 disks and abut the same as RAID-0 with 4 or more disks.
When using a LocalRepositoryManager, Sesame will store the configuration data for repositories in its own 'SYSTEM' repository. A tomcat instance will do the same and you will see 'SYSTEM' under the list of repositories that the instance is managing. To see what configuration data is stored, connect to the SYSTEM repository and execute the following query:
This will return the repository ID and type, followed by name-value pairs of configuration data for SAIL repositories, including the SAIL type - "owlim:Sail" for OWLIM-SE and "swiftowlim:Sail" for OWLIM-Lite. OWLIM-Enterprise master nodes are not SAIL repositories and have the type "owlim:ReplicationCluster".
If you uncomment the FILTER clause you can substitute a repository id to get the configuration just for that repository.
There is no easy generic way of changing the configuration - it is stored in the SYSTEM repository created and maintained by Sesame. However, OWLIM allows overriding of these parameters by specifying the parameter values as JVM options. For instance, by passing -Dcache-memory=1g option to the JVM, OWLIM-SE will read it and use its value to override whatever was configured by the .ttl file. This is convenient for temporary set-ups that require easy and fast configuration change, e.g. for experimental purposes.
Changing the configuration in the SYSTEM repository is trickier, because the configurations are usually structured using blank node identifiers - which are always unique, so attempting to modify a statement with a blank node by using the same blank node identifier will fail. However, this can be achieved with SPARQL UPDATE using a DELETE-INSERT-WHERE command as follows:
Modify the last three lines of the update command to specify the repository ID, the parameter and the new value. Then execute against the SYSTEM repository. In this example, the enable-context-index is changed, but there are other parameters that can not be changed once the repository is created, e.g. the rule-set (in the case of OWLIM-SE).
A restart of Sesame/OWLIM is required. If deployed using Tomcat then the easiest way is just to restart Tomcat itself.
Using OWLIM-Lite, the rule-set parameter can be changed like any other parameter. After the repository is restarted, insert and delete a statement in order to trigger a full re-computation of the inferred statements.
Using OWLIM-SE or OWLIM-Enterprise, this is more difficult. It will be necessary to export/backup all explicit statements and recreate a new repository with the required rule-set. Once created, the explicit statements exported from the old repository can be imported in to the new one.
For an existing repository that has already been used:
Both OWLIM-SE and OWLIM-Enterprise worker nodes require license files for long term use. These can be obtained from Ontotext. When purchasing OWLIM-SE, you will receive one license file. When purchasing OWLIM-Enterprise, you will receive a license file for the worker nodes. Master nodes do not require a license file. License files should be stored where they are accessible to the processes that need to read them to validate the software, i.e. Tomcat instances or application software that embeds OWLIM.
When installing OWLIM-SE or OWLIM-Enterprise worker nodes, the license file can be set in several ways:
The maximum CPU count is checked during OWLIM initialisation by checking for the number of CPU cores available to the Java Virtual Machine (JVM) in which OWLIM is running. If the number of CPU cores available is greater than specified in the license file then OWLIM will still run, but it will throttle itself to use the equivalent of the licensed number of CPU cores.
OWLIM can be embedded in a user application or deployed using Tomcat/Sesame. In both cases, the user should restrict the number of CPUs available to the JVM in which OWLIM is running. The method to do this depends upon the operating system:
When deploying with Tomcat/Sesame, the easiest method is to edit the startup script(s) for tomcat using the above modifications. The scripts can be found in a variety of locations depending on the operating system, distribution and installation method. Some command locations are:
This can be achieved on the command line using a repository configuration file (usually in Turtle format) and curl. The following steps must be followed:
Why am I getting this exception: java.lang.NoSuchMethodError: org.apache.lucene.queryParser.QueryParser?
You must add the Lucene jar file to the Java classpath (the Lucene core jar is included with the distribution). OWLIM-SE 4 is known to work properly with Lucene 3.0
I am getting this exception: java.lang.NoClassDefFoundError: Could not initialize class com.infomatiq.jsi.rtree.RTreeWithCoords
The jsi, log4j, sil and trove4j jar files (included with the distribution) must be added to the classpath.
Each rule set defines both rules and some schema statements, otherwise known as axiomatic triples. These (read-only) triples are inserted in to the repository at intialisation time and count towards the total number of reported 'explicit' triples. The variation may be up to the order of hundreds depending upon the rule set.
Statements that were added during repository initialisation, either because they are asserted in rule files or because they were loaded using the "imports" parameter are marked read-only. Having read-only statements (especially schema definition statements) are one way to ensure that 'smooth delete' can operate very quickly.
This problem will manifest itself in many ways after deploying the Sesame/OWLIM war files to Tomcat's webapps directory. If you are unable to set the server URL in the Workbench then this is an indication that the problem has occurred. It will likely be due to a permissions problem on the logging directory for the openrdf-sesame server. To check this, point your browser directly at the Sesame server with a URL similar to the following:
If you receive a stack trace containing the following:
then this indicates that Tomcat does not have write permission to its data directory (where it stores configuration, logs and actual repository data). To fix this, log in as root to the server machine and do the following:
Now when you use the server URL in your browser you should see the Sesame server welcome screen.
In order to preserve the context (named graph) when exporting/importing the whole database, a context-aware RDF file format must be used, e.g. TriG. After serialising the repository to a file with this format (this can be done through the Sesame workbench Web application) the file can be imported with the following steps:
The TriX format (an XML-based context-aware RDF serialisation) can also be used.
There is no facility at present to seamlessly back up a repository while it is running. However, several options are available.
The simplest method (that works for a running system) is to export the database contents using the Sesame Workbench. If you want to preserve contexts then choose a suitable output format. However, this can be memory intensive.
Perhaps the best method to make a backup uses the graph store protocol and 'curl'. This can be achieved on the command line in a single step using the graph store protocol (change the repository URL and name of the export file accordingly):
A backup can also be done programmatically using the Sesame API. See the RepositoryConnection.exportStatements() method and the example in the next question.
If it is possible to shutdown the repository then a backup can be effected by copying the OWLIM storage directory (and any sub-directories). See the installation section for information about where OWLIM storage folders are located. To restore a repository from a back up, make sure the repository is not running and then replace the entire contents of the storage directory (and any sub-directories) with the backup. Then restart the repository and check the log file to ensure a successful start up.
The Sesame openRDF workbench Web application has an export function that can be used to export the contents of moderately sized repositories. However, using this with large repositories (more than a hundred million statements or more) causes problems, usually time-outs for the Servlet container (Tomcat) hosting the application. Also, the workbench cannot be used when using OWLIM-SE without Tomcat.
While most versions of OWLIM are backward compatible, some major version number increases use such different data structures that disk images can no longer be automatically updated to the latest version.
The basic procedure is to export the RDF data from the old version of OWLIM-SE and then reload it in to a new repository instance that uses the new version of OWLIM-SE. Exporting is straightforward when using the Sesame workbench – simply click the 'Export' button, choose the format and click 'download'. To import in to a new repository, click 'add', select a format, specify the file and base URI, then click 'Upload'.
The XML parser will generate an error similar to the following:
when it generates more than a specified number of 'entities'. The default limit for the built-in Java XML parser is 64,000, however it can be configured by using a Java system property. To increase the limit, pass the following to the JVM in which OWLIM/Sesame is running. Note that the actual value can be increased as necessary. Don't forget that if running in Tomcat then this must be passed to the Tomcat instance using the CATALINA_OPTS environment variable.
There might be subtle differences between versions of OWLIM-SE that mean that exporting and re-importing explicit statements from the older version of the the repository is the best and safest means to upgrade. However, this can be lengthy with large databases.
Probably the fastest way to upgrade is to make a binary copy of the OWLIM storage folder and use the new version 'on top'. This will cause it to automatically upgrade the file formats (there are minor differences between versions) and from then on it should run fine with the new version.
1. Use the new version of OWLIM-SE to create an empty repository (with the right configuration, e.g. ruleset)
There is a good chance that it will take quite a long time to initialise as the storage files are modified, but it should be quicker than re-importing all the data.
In general RDF data can be loaded into a given Sesame repository using the 'load' command in the Sesame console application or directly through the workbench web application. However, neither of these approaches will work when using a very large number of triples, e.g. a billion statements. A common solution would be to convert the RDF data into a line-based RDF format (e.g. N-triples) and then split it into many smaller files (e.g. using the linux command 'split'). This would allow each file to be uploaded separately using either the console or workbench applications.
There is a fix in Sesame, but I don't want to wait for the next release. How can I build a snapshot version of Sesame?
The compiled jars/zips will have 'SNAPSHOT' in their name.
Skip to end of metadata Go to start of metadata