OWLIM Frequently Asked Questions
- What is OWLIM?
- What is a Semantic Repository?
- Where does the name "OWLIM" come from?
- How do I use OWLIM?
- What is the difference between OWLIM-Lite and OWLIM-SE?
- What kind of SPARQL conformance is supported?
- How does OWLIM-SE index triples?
- How much disk space does OWLIM-SE require to load my dataset?
- How much disk space does OWLIM-SE need per statement?
- Can OWLIM answer queries in parallel?
- What kind of transaction isolation is supported?
- Are solid-state drives better than hard-disk drives?
- What kind of RAID set-up is best?
- How do I flush the repository contents to disk without shutting down the whole repository?
- I am getting this exception: java.lang.NoSuchMethodError: org.apache.lucene.queryParser.QueryParser
- I am getting this exception: java.lang.NoClassDefFoundError: Could not initialize class com.infomatiq.jsi.rtree.RTreeWithCoords
- Why does my repository report a different number of explicit statements when I change rule sets?
- How do I change the configuration of an OWLIM Sesame repository that was initialized through a .ttl file?
- Why can't I delete some statements?
- How can I retrieve my repository configurations from the Sesame SYSTEM repository?
- How do I set up license files for OWLIM-SE and OWLIM-Enterprise
- How can I load a large RDF/XML file without getting an "entity expansion limit exceeded" error?
- How can I upgrade to a new version of OWLIM-SE without exporting and reimporting all my data?
OWLIM is a sematic repository - a software component for storing and manipulating huge quantities of RDF data. OWLIM is packaged as a Storage and Inference Layer (SAIL) for the Sesame OpenRDF framework (http://www.aduna-software.com/technology/sesame).
A semantic repository is a software component for storing and manipulating RDF data. It is made up of three distinct components:
- An RDF database for storing, retrieving, updating and deleting RDF statements (triples)
- An inference engine that uses rules to infer 'new' knowledge from explicit statements
- A powerful query engine for accessing the explicit and implicit knowledge
The name originally comes from the term "OWL In Memory" and is fitting for what became OWLIM-Lite. However, OWLIM-SE uses a transactional, index-based file-storage layer where "In Memory" is no longer appropriate. Nevertheless, the name has stuck and it is seldom that anyone ever asks where it came from...
OWLIM is packaged as a Storage and Inference Layer (SAIL) for the Sesame RDF framework (http://www.aduna-software.com/technology/sesame). OWLIM can be used in two different ways:
One approach is to use it as a library, an example of which is provided in the release distribution that can be started by using 'example.cmd' in the 'getting-started' folder.
Another approach is to download the full version of Sesame and configure OWLIM-SE as a plug-in. This method uses the Sesame HTTP server hosted in Tomcat (or similar) and in this way you can use Sesame togther with OWLIM as a server application, accessed via the standard Sesame APIs.
Sesame version 2.2 onwards includes the Sesame Workbench - a convenient Web Application for managing repositories, importing/exporting RDF data, executing queries, etc. For more information please check the "doc" folder of the OWLIM-SE archive.
OWLIM-Lite and OWLIM-SE are identical in terms of usage and integration for storing and managing RDF data. They share the same inference mechanisms and semantics (rule-compiler, etc). The different editions of OWLIM use different indexing, inference, and query evaluation implementations, which results in different performance, memory requirements, and scalability.
OWLIM-Lite is designed for medium data volumes (below 100 million statements) and for prototyping. Its key characteristics are as follows:
- reasoning and query evaluation are performed in main memory
- it employs a persistence strategy that ensures data preservation and consistency
- the loading of data, including reasoning, is extremely fast
- easy configuration
OWLIM-SE is suitable for handling massive volumes of data and very intensive querying activities. It is designed as an enterprise-grade database management system. This has been made possible through:
- file-based indices, which enable it to scale to billions of statements even on desktop machines
- special-purpose index and query optimization techniques, ensuring fast query evaluation against very large volumes of data
- optimized handling of owl:sameAs (identifier equality) to boost efficiency for data integration tasks
- efficient retraction of explicit statements and their inferences, which allows for efficient delete operations
- a range of powerful 'advanced features' including: Full text search (Node search, RDF search), ranking, selection and notifications
See http://ontotext.com/owlim/version-map.html for more details.
All editions of OWLIM support:
- SPARQL 1.1 Update (May 2011 draft)
- SPARQL 1.1 Query (May 2011 draft)
- SPARQL 1.1 protocol (January 2010 draft)
- SPARQL 1.1 Federation extensions
The the SPARQL 1.1 Graph Store Protocol will be supported in subsequent versions of OWLIM.
There are several types of indices available, *all* of which apply to *all* triples, whether explicit or implicit. These indices are maintained automatically.
The main indexes that are always used are:
- predicate-object-subject (POS)
- predicate-subject-object (PSO)
There are other optional indices and these have advantages for specific datasets, retrieval patterns and query loads. These are switched off by default.
For some datasets, or when executing queries with triples patterns with a wild-card for the predicate, a pair of indices can be used that map from entities (subject, object) to predicate, i.e.
- subject-predicate (SP)
- object-predicate (OP)
This pair of indices are known as 'predicate lists', see enablePredicateList in the user guide.
For more efficient processing of named graphs (and triplesets), two other indexes can be used:
- predicate-context-subject-object-tripleset (pcsot)
- predicate-tripleset-subject-object-context (ptsoc)
These can be switched on using the build-pcsot and build-ptsoc parameters.
There are also several variations on full-text-search indexes for both Node Search and lucene-based RDF Search. Details of these can be found in the user guide.
There is no simple answer to this question, since it depends on reasoning complexity (how many inferred triples), how long the URIs are, what additional indices are used, etc. For an example, the following table shows the disk space requirement in bytes per explicit statement when loading the wordnet dataset with various OWLIM-SE configurations:
|Configuration||Bytes per explicit statement|
|owl2-rl + all optional indices||366|
|owl-horst + all optional indices||290|
|empty + all optional indices||240|
When planning for storage capacity based on input RDF file size, this depends not only on the OWLIM-SE configuration, but also the RDF file format used and the complexity of its contents. The following table can be used to give a rough estimate for the expansion to be expected from an input RDF file to OWLIM-SE storage requirements, e.g. when using OWL2-RL with all optional indices turned on, OWLIM-SE will need about 6.7GB of storage space to load a one gigabyte N3 file - with no inference ('empty') and no optional indices, OWLIM-SE will need about 0.7GB of storage space to load a one gigabyte Trix file. Again, these results were created with the Wordnet dataset:
|owl2-rl + all optional indices all||6.7||2.2||4.8||6.6||1.5||6.7|
|owl-horst + all optional indices||5.3||1.7||3.8||5.2||1.2||5.3|
|empty + all optional indices||4.4||1.4||3.1||4.3||1.0||4.4|
Firstly, note that OWLIM-SE computes inferences as new explicit statements are committed to the repository. The number of inferred statements can be zero (when using the 'empty' rule set) or many multiples of the number of explicit statements (it depends on the chosen ruleset and the complexity of the data).
The disk space required for each statement further depends on the size of the URIs and literals, but for typical datasets around 200 bytes is required with only the default indices, up to about 300 bytes when all optional indices are turned on.
So when using the default indices, a good estimate for the amount of disk space you will need is 200 bytes per statement (explicit and inferred), i.e.
- 1 million statements => ~200 Megabytes storage
- 1 billion statements => ~200 Gigabytes storage
- 10 billion statements => ~2 Terabytes storage
Yes. Both OWLIM-Lite and OWLIM-SE can process queries concurrently.
Furthermore, when OWLIM-SE is used in a cluster configuration, the throughput of parallel query answering can be scaled (almost) linearly by adding more nodes.
OWLIM supports the read-committed isolation level, i.e. pending updates are not visible to other connected users until the complete update transaction has been committed. However, for efficiency reasons and unlike typical relational database behaviour, uncommitted changes are not 'visible' even using the connection that made the updates.
Yes. Unlike relational databases, a semantic database needs to conduct inference for inserted and deleted statements. This involves making highly unpredictable joins using statements anywhere in the indices for all new/deleted statements. Despite paging as best as possible, a large number of disk seeks can be expected and SSDs perform far better than HDDs in this task.
RAID-0 gives good performance, but is more likely to suffer problems due to disk failure. RAID-5 is a good balance between resilience/redundancy and cost. Using SSDs, we have (little more than anecdotal) evidence that RAID-0 is fast, RAID-1 is slower, and RAID-5 is slower with less than 4 disks and abut the same as RAID-0 with 4 or more disks.
One should commit a statement containing a special predicate http://www.ontotext.com/flush, e.g.<http://www.example.com> <http://www.ontotext.com/flush> "" possibly together with other statements (as part of a single transaction). This would force the repository contents to be flushed to disk during the next commit().
This works only in OWLIM-SE through Sesame interface (i.e. not available in OWLIM-Lite, nor through ORDI framework).
You must add the Lucene jar file to the Java classpath (the Lucene core jar is included with the distribution). OWLIM-SE 4 is known to work properly with Lucene 3.0
I am getting this exception: java.lang.NoClassDefFoundError: Could not initialize class com.infomatiq.jsi.rtree.RTreeWithCoords
The jsi, log4j, sil and trove4j jar files (included with the distribution) must be added to the classpath.
Each rule set defines both rules and some schema statements, otherwise known as axiomatic triples. These (read-only) triples are inserted in to the repository at intialisation time and count towards the total number of reported 'explicit' triples. The variation may be up to the order of hundreds depending upon the rule set.
How do I change the configuration of an OWLIM Sesame repository that was initialized through a .ttl file?
There is no easy generic way of changing the configuration - it is stored in the SYSTEM repository created and maintained by Sesame. However, OWLIM-SE allows overriding of those parameters by specifying the parameter values as JVM options. For instance, by passing -Dcache-memory=1g option to the JVM, OWLIM-SE will read it and use its value to override whatever was configured by the .ttl file. This is comfortable for temporary setups that require easy and fast configuration change (e.g. for experimental purposes).
Statements that were added during repository initialisation, either because they are asserted in rule files or because they were loaded using the "imports" parameter are marked read-only. Having read-only statements (especially schema definition statements) are one way to ensure that 'smooth delete' can operate very quickly.
|OWLIM-SE does now allow read-only/schema statements to be modified when the repository is in a special mode. This feature will allow fast delete operations at the same time as ensuring that schemas can be changed when necessary. Full details on how to do this can be found in the [OWLIM-SE user guide] .|
When using a LocalRepositoryManager, Sesame will store the configuration data for repositories in its own 'SYSTEM' repository. A tomcat instance will do the same and you will see 'SYSTEM' under the list of repositories that the instance is managing. To see what configuration data is stored, connect to the SYSTEM repository and execute the following query:
This will return the repository ID and type, followed by name-value pairs of configuration data for SAIL repositories, including the SAIL type - "owlim:Sail" for OWLIM-SE and "swiftowlim:Sail" for OWLIM-Lite. OWLIM-Enterprise master nodes are not SAIL repositories and have the type "owlim:ReplicationCluster".
Both OWLIM-SE and OWLIM-Enterprise worker nodes require license files for long term use. These can be obtained from Ontotext. When purchasing OWLIM-SE, you will receive one license file. When purchasing OWLIM-Enterprise, you will receive a license file for the worker nodes. Master nodes do not require a license file. License files should be stored where they are accessible to the processes that need to read them to validate the software, i.e. Tomcat instances or application software that embeds OWLIM.
When installing OWLIM-SE or OWLIM-Enterprise worker nodes, the license file can be set in several ways:
- Set owlim:owlim-license in a Turtle configuration/template file, e.g. when using the Sesame console.
- In the 'License file' field when using the Sesame workbench (the repackaged version in the OWLIM distribution).
- Using the CATALINA_OPTS environment variable, i.e. -Dowlim-license=<full_path_to_license>, which will apply to all OWLIM-SE repositories as it overrides each configured repository's license file setting.
|OWLIM-SE and OWLIM-Enterprise worker node licenses are different and will not work if used with the wrong software.|
The XML parser will generate an error similar to the following:
Parser has reached the entity expansion limit "64,000" set by the Application.
when it generates more than a specified number of 'entities'. The default limit for the built-in Java XML parser is 64,000, however it can be configured by using a Java system property. To increase the limit, pass the following to the JVM in which OWLIM/Sesame is running. Note that the actual value can be increased as necessary. Don't forget that if running in Tomcat then this must be passed to the Tomcat instance using the CATALINA_OPTS environment variable.
There might be subtle differences between versions of OWLIM-SE that mean that exporting and re-importing explicit statements from the older version of the the repository is the best and safest means to upgrade. However, this can be lengthy with large databases.
Probably the fastest way to upgrade is to make a binary copy of the OWLIM storage folder and use the new version 'on top'. This will cause it to automatically upgrade the file formats (there are minor differences between versions) and from then on it should run fine with the new version.
1. Use the new version of OWLIM-SE to create an empty repository (with the right configuration, e.g. ruleset)
2. Shutdown any running OWLIM instance
3. Locate the storage folders for the current instance and the new instance
4. Delete the contents of the new storage folder (will be called something like repo_id/storage)
5. Copy all files and (sub-directories if present) from the old storage folder to the new storage folder
6. Restart the new instance
There is a good chance that it will take quite a long time to initialise as the storage files are modified, but it should be quicker than re-importing all the data.