GraphDB FAQ

compared with
Current by Nikola Petrov
on Apr 24, 2015 19:28.

Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (94)

View Page History
GraphDB is a sematic repository - a software component for storing and manipulating huge quantities of RDF data. GraphDB is packaged as a Storage and Inference Layer (SAIL) for the Sesame OpenRDF framework ([http://www.aduna-software.com/technology/sesame|http://www.aduna-software.com/technology/sesame]).

Its former name was 'OWLIM', and 'OWLIM' but it was renamed to 'GraphDB' to better represent what it actually does.

h5. What changes are involved in renaming OWLIM to GraphDB?

- the renaming was a cosmetic change, change - the error messages and other OWLIM strings are have been renamed to GraphDB
- property files, license files and package/class names remain the same for compatibility reasons (e.g. the owlim.properties file is has not been changed)
- OWLIM 5.6 corresponds to GraphDB 6.0

* A powerful query engine for accessing the explicit and implicit knowledge

h5. Where does the name "OWLIM" (the former GraphDB name) come from?

The name originally comes from the term "OWL In Memory" and is fitting for what became OWLIM-Lite. However, OWLIM-SE used a transactional, index-based file-storage layer where "In Memory" is no longer appropriate. Nevertheless, the name has stuck and it is seldom that someone asks where it comes from...
The name originally came from the term "OWL In Memory" and was fitting for what later became OWLIM-Lite. However, OWLIM-SE used a transactional, index-based file-storage layer where "In Memory" was no longer appropriate. Nevertheless, the name stuck and it was rarely asked where it came from...

h5. How do I use GraphDB?

GraphDB is packaged as a Storage and Inference Layer (SAIL) for the Sesame RDF framework ([http://www.aduna-software.com/technology/sesame|http://www.aduna-software.com/technology/sesame]). GraphDB ([http://www.aduna-software.com/technology/sesame|http://www.aduna-software.com/technology/sesame]) and can be used in two different ways:

One approach is to use it as a library, an example of which is provided in the release distribution that can be started by using 'example.cmd' in the 'getting-started' folder.
Another approach is to download the full version of Sesame and configure GraphDB-SE as a plug-in. This method uses the Sesame HTTP server hosted in Tomcat (or similar) and in this way you can use Sesame together with GraphDB as a server application, accessed via the standard Sesame APIs.

Sesame version 2.2 onwards includes the Sesame Workbench - a convenient Web Application for managing repositories, importing/exporting RDF data, executing queries, etc. For more information please check the "doc" folder of the OWLIM-SE archive.

For more information please check the "doc" folder of the GraphDB-SE archive.

h5. What is the difference between GraphDB-Lite and GraphDB-SE?

There are several types of indices available, \*all\* of which apply to \*all\* triples, whether explicit or implicit. These indices are maintained automatically.

The main indices that are always used are:
* predicate-object-subject-context (POSC)
* predicate-subject-object-context (PSOC)

There are other optional indices that have advantages for specific datasets, retrieval patterns, and query loads. By default they are disabled.

For some datasets, or when executing queries with triples patterns with a wild-card for the predicate, a pair of indices can be used that map from resources (subject, object) to predicate, i.e.
This pair of indices are known as 'predicate lists', see enablePredicateList in the user guide.

For more efficient processing of named graphs, two other indexes indices can be used:
* predicate-context-subject-object (PCSO)
* predicate-context-object-subject (PCOS)
These can be switched on using the enable-context-index parameter.

There are also several variations on full-text-search indices for both Node Search node search and lucene-based RDF Ssearch. Details of these can be found in the user guide.

h5. How much disk space does GraphDB-SE require to load my dataset?

There is no simple answer to this question, since as it depends on the reasoning complexity (how many inferred triples), how long the URIs are, what additional indices are used, etc. For an example, the following table shows the required disk space requirement in bytes per explicit statement when loading the wordnet dataset with various GraphDB-SE configurations:

|| Configuration || Bytes per explicit statement ||
| empty | 171 |

When planning for storage capacity based on input RDF file size, this depends not only on the GraphDB-SE configuration, but also on the RDF file format used and the complexity of its contents. The following table can be used to give a rough estimate for the expansion to be expected from an input RDF file to GraphDB-SE storage requirements, e.g. when using OWL2-RL with all optional indices turned on, GraphDB-SE will need about 6.7GB of storage space to load a one gigabyte N3 file - with no inference ('empty') and no optional indices, GraphDB-SE will need about 0.7GB of storage space to load a one gigabyte Trix file. Again, these results were created with the Wordnet dataset:
When planning for storage capacity based on the input RDF file size, the required disk space depends not only on the GraphDB-SE configuration, but also on the RDF file format used and the complexity of its contents. The following table gives a rough estimate of the expected expansion from an input RDF file to GraphDB-SE storage requirements. E.g. when using OWL2-RL with all optional indices turned on, GraphDB-SE needs about 6.7GB of storage space to load one gigabyte N3 file. With no inference ('empty') and no optional indices, GraphDB-SE needs about 0.7GB of storage space to load one gigabyte Trix file. Again, these results were created with the Wordnet dataset:

| || N3 || N-Triples || RDF/XML || Trig || Trix || Turtle |
|| owl2-rl + all optional indices all | 6.7 | 2.2 | 4.8 | 6.6 | 1.5 | 6.7 ||
|| owl2-rl | 4.3 | 1.4 | 3.1 | 4.2 | 1.0 | 4.3 ||
|| owl-horst + all optional indices | 5.3 | 1.7 | 3.8 | 5.2 | 1.2 | 5.3 ||
h5. How much disk space does GraphDB-SE need per statement?

Firstly, note that GraphDB-SE computes inferences as when new explicit statements are committed to the repository. The number of inferred statements can be zero, (when using the 'empty' rule set), or many multiples of the number of explicit statements (it depends on the chosen ruleset and the complexity of the data).

The disk space required for each statement further depends on the size of the URIs and literals, but for typical datasets around 200 bytes is required with only the default indices, up to about 300 bytes when all optional indices are turned on.
The disk space required for each statement further depends on the size of the URIs and literals. The typical datasets with only the default indices require around 200 bytes, and up to about 300 bytes when all optional indices are turned on.

So when using the default indices, a good estimate for the amount of disk space you will need is 200 bytes per statement (explicit and inferred), i.e.
Explicit statements are stored in RAM and require 13 bytes per statement. Implicit statements require 8 bytes per statement.

Assuming roughly 3 unique explicit statements to every 1 unique resources, then total memory requiresd will vary from approximately 17 bytes per explicit statement (empty rule-set) to around 40 bytes per statement (expressive rule-sets). However, this will also vary depending on the 'geometry' of the input data and the amount of use of complex (OWL) language features.

The above figures give the memory footprint for storing statements. In order to be useful, an OWLIM-Lite a GraphDB-Lite instance needs more memory for maintaining these data structures, executing queries, loading and inferencing. A safe figure is to allow twice as much memory, i.e. the storage space of the statements + resources multiplied by 2. Therefore, for practical purposes GraphDB-Lite will need needs between 34 and 80\+ bytes to index each statement. To this must be added statement, plus the actual size of the sum total sum of unique URIs, blank nodes and literals.

h5. What is the maximum amount of data that can be stored in GraphDB-Lite?
GraphDB-Lite is only capable of storing up to 1 billion unique resources. This sets a practical limit of approximately 3 billion explicit statements (assuming 3:1 ratio). Using the above metrics, storing this much data will require between 100 and 240 GB of RAM (or more).

It should be noted that these are theoretical limits and Ontotext have not attempted to load this much data in to an GraphDB-Lite instance.

h5. Can GraphDB answer queries in parallel?
Yes. Both GraphDB-Lite and GraphDB-SE can process queries concurrently.

Furthermore, when OWLIM-SE GraphDB-SE is used in a cluster configuration, the throughput of parallel query answering can be scaled (almost) linearly by adding more nodes.

h5. What kind of transaction isolation is supported?

OWLIM GraphDB supports the read-committed isolation level, i.e. pending updates are not visible to other connected users, until the complete update transaction has been committed. However, for efficiency reasons and unlike typical relational database behaviour, uncommitted changes are not 'visible' even when using the connection that made the updates.

h5. Are solid-state drives better than hard-disk drives for GraphDB-SE (and GraphDB-Enterprise)?
Yes. Unlike relational databases, a semantic database needs to conduct inference for inserted and deleted statements. This involves making highly unpredictable joins using statements anywhere in the indices for all new/deleted statements. Despite paging as best as possible, a large number of disk seeks can be expected and SSDs perform far better than HDDs in this task.

The difference between performance on SSDs and HDDs is most pronounced when OWLIM-SE GraphDB-SE is running in the safe transaction mode (the default). In fast mode, updated pages are not flushed to disk at the end of a commit operation - instead they are only swapped to disk when the cache memory is exhausted. However, in safe mode all updated pages are flushed to disk before a commit operation returns and the higher number of writes causes a small slow-down when using SSDs, but a large slow-down when using HDDs. In performance tests using LUBM(1000) (approximately 135 million statements) with a 1 GB cache, the difference in load times in fast and safe modes increases quite differently when comparing SSDs and HDDs:

* SSD: safe mode loading is between 40% and 50% slower than fast mode
* HDD: safe mode loading is approximately 16 times slower than fast mode

The recommended transaction mode is 'safe'. However, if only HDDs are available, then there is the option of using the 'fast' transaction mode with the corresponding increase in performance and the increased risk of data loss in the event of an abnormal termination. Switching modes is straightforward and just requires a restart of GraphDB.

h5. What kind of RAID set-up is best?

RAID-0 gives good performance, but is more likely to suffer present problems due to disk failure. RAID-5 is a good balance between resilience/redundancy and cost. Using SSDs, we have (little more than anecdotal) evidence that RAID-0 is fast, RAID-1 is slower, and RAID-5 is slower with less than 4 disks and abut the same as RAID-0 with 4 or more disks.

h1. Configuration
h5. How can I retrieve my repository configurations from the Sesame SYSTEM repository?

When using a LocalRepositoryManager, Sesame will store the configuration data for repositories in its own 'SYSTEM' repository. A tomcat instance will do the same and you will see 'SYSTEM' under the list of repositories that the instance is managing. To see what configuration data is stored in a GraphDB-SE repository, connect to the SYSTEM repository and execute the following query:

{noformat}
{noformat}

For GraphDB-Enterprise worker repositories the query is slightly different:

{noformat}
PREFIX sys: <http://www.openrdf.org/config/repository#>
PREFIX sail: <http://www.openrdf.org/config/repository/sail#>

select ?id ?type ?param ?value
where {
?rep sys:repositoryID ?id .
?rep sys:repositoryImpl ?delegate .
?delegate sys:delegate ?impl .
?impl sys:repositoryType ?type .
optional {
?impl sail:sailImpl ?sail .
?sail ?param ?value .
}
# FILTER( ?id = "specific_repository_id" ) .
}
ORDER BY ?id ?param
{noformat}


This will return the repository ID and type, followed by name-value pairs of configuration data for SAIL repositories, including the SAIL type - "owlim:Sail" for GraphDB-SE and "swiftowlim:Sail" for GraphDB-Lite. GraphDB-Enterprise master nodes are not SAIL repositories and have the type "owlim:ReplicationCluster".

If you uncomment the FILTER clause, you can substitute a repository id to get the configuration just for that repository.

h5. How do I change the configuration of an a GraphDB Sesame repository after it has been created?

There is no easy generic way of changing the configuration - it is stored in the SYSTEM repository created and maintained by Sesame. However, GraphDB allows overriding of these parameters by specifying the parameter values as JVM options. For instance, by passing \-Dcache-memory=1g option to the JVM, GraphDB-SE will read it and use its value to override whatever was configured by the .ttl file. This is convenient for temporary set-ups that require easy and fast configuration change, e.g. for experimental purposes.

Changing the configuration in the SYSTEM repository is trickier, because the configurations are usually structured using blank node identifiers - identifiers, which are always unique, so attempting to modify a statement with a blank node by using the same blank node identifier will fail. However, this can be achieved with SPARQL UPDATE using a DELETE-INSERT-WHERE command as follows (please note that this is valid for GraphDB-SE repositories):

{noformat}
{noformat}

Modify the last three lines of the update command to specify the repository ID, the parameter and the new value. Then execute against the SYSTEM repository. In this example, the enable-context-index is changed, but there are other parameters that can not be changed once the repository is created, e.g. the rule-set (in the case of OWLIM-SE).
For GraphDB-Enterprise worker repositories, the Sparql update is slightly different:

A restart of Sesame/GraphDB is required. If deployed using Tomcat then the easiest way is just to restart Tomcat itself.
{noformat}
PREFIX sys: <http://www.openrdf.org/config/repository#>
PREFIX sail: <http://www.openrdf.org/config/repository/sail#>
PREFIX onto: <http://www.ontotext.com/trree/owlim#>
DELETE { GRAPH ?g {?sail ?param ?old_value } }
INSERT { GRAPH ?g {?sail ?param ?new_value } }
WHERE {
GRAPH ?g { ?rep sys:repositoryID ?id . }
GRAPH ?g { ?rep sys:repositoryImpl ?delegate . }
GRAPH ?g { ?delegate sys:repositoryType ?type . }
GRAPH ?g { ?delegate sys:delegate ?impl . }
GRAPH ?g { ?impl sail:sailImpl ?sail . }
GRAPH ?g { ?sail ?param ?old_value . }
FILTER( ?id = "repo_id" ) .
FILTER( ?param = onto:enable-context-index ) .
BIND( "true" AS ?new_value ) .
}
{noformat}


Modify the last three lines of the update command to specify the repository ID, the parameter, and the new value. Then execute against the SYSTEM repository. In this example, the enable-context-index is changed, but there are other parameters that can not be changed once the repository is created, e.g. the rule-set (in the case of GraphDB-SE).

A restart of Sesame/GraphDB is required. If deployed using Tomcat, then the easiest way is just to restart Tomcat itself.

h5. How can I find out the exact version number of GraphDB-SE/GraphDB-Enterprise?

The major/minor version and build number make up part of the GraphDB distribution zip file name. The embedded owlim jar file has the major and minor version numbers appended.

In addition, at start up, GraphDB-SE and GraphDB-Enterprise worker nodes will log the full version number in an INFO logger message, e.g. {{OwlimSchemaRepository: version: 5.6, revision: 6864}}

The following DESCRIBE query:
{noformat}

will return pseudo-triples providing information on various OWLIM GraphDB states, including: the number of triples (total and explicit), storage space (used and free), commits (total and if one is in progress), the repository signature, and the build number of software.


h5. How can the rule set be changed?

BIND( "new_repository_name" AS ?new_name ) . }
{noformat}
# Rename the folder for this repository oin the file system
#* mv /usr/share/tomcat6/.aduna/openrdf-sesame/repositories/OLD_NAME /usr/share/tomcat6/.aduna/openrdf-sesame/repositories/NEW_NAME (linux/unix)
#* The location for the 'repositories' folder under Windows varies depending on the version of Windows and the installation method for Tomcat - some possibilities are:
#** {{C:\Users\<username>\AppData\Roaming\Aduna\repositories}} (Windows 7 when running Tomcat as a user)
#** {{C:\Documents and Settings\LocalService\Application Data\Aduna\OpenRDF Sesame\repositories}} (Windows XP when running Tomcat as a service)
# Reselect the SYSTEM repository and press F5

{{[http://www.ontotext.com/trree/owlim#storage-folder]}}

If this is set to an absolute pathname and moving the repository requires an update of this parameter as well, then you will now need to the value of this parameter (with the new name) as described in the question further up about modifying a repository configuration.
{info}

h5. How do I set up license files for GraphDB-SE and GraphDB-Enterprise

Both GraphDB-SE and GraphDB-Enterprise worker nodes require license files for long term use. These can be obtained from Ontotext. When purchasing OWLIM-SE, GraphDB-SE, you will receive one license file. When purchasing GraphDB-Enterprise, you will receive a license file for the worker nodes. Master nodes do not require a license file. License files should be stored where they are accessible to the processes that need to read them to validate the software, i.e. Tomcat instances or application software that embeds GraphDB.

When installing GraphDB-SE or GraphDB-Enterprise worker nodes, the license file can be set in several ways:
** Set the {{GRAPHDB_LICENSE_FILE}} environment variable to point to the license file. This will be overridden by the following methods.
* *Repository configuration parameter*
** Set {{owlim:owlim-license}} in a Turtle configuration/template file, e.g. when using the Sesame console.
** Set the 'License file' field when using the Sesame workbench (the repackaged version in the GraphDB distribution).
* *System property*
** Use {{\-Dowlim-license=<full_path_to_license>}} for the Java virtual machine that is running GraphDB.
** When deployed using Tomcat, the {{CATALINA_OPTS}} or {{JAVA_OPTS}} environment variables can be set, i.e. {{\-Dowlim-license=<full_path_to_license>}}, which will apply to all GraphDB-SE repositories as it overrides each configured repository's license file setting and the environment variable.
** For linux installations, you can also set JAVA_OPTS in the {{/etc/default/tomcat6}} file.

{note}
h5. How do I run GraphDB-SE or an GraphDB-Enterprise worker node on a machine with more CPUs than licensed?

The maximum CPU count is checked during GraphDB initialisation by checking for the number of CPU cores available to the Java Virtual Machine (JVM) in which GraphDB is running. If the number of CPU cores available is greater than specified in the license file, then GraphDB will still run, but it will throttle itself to use the equivalent of the licensed number of CPU cores.

GraphDB can be embedded in a user application or deployed using Tomcat/Sesame. In both cases, the user should restrict the number of CPUs available to the JVM in which GraphDB is running. The method to do this depends upon the operating system:
{{C:\WINDOWS\system32\cmd.exe /c start /AFFINITY 5 java}} {{{_}rest_of_command_line{_}}} |

When deploying with Tomcat/Sesame, the easiest method is to edit the startup script(s) for tomcat using the above modifications. The scripts can be found in a variety of locations depending on the operating system, distribution, and installation method. Some command locations are:
* {{/usr/share/tomcat6/bin/startup.sh}}
* {{/etc/init.d/tomcat6}}

{info}
MacOS does not provide an easy means to set the processor affinity. The most straightforward method to limit CPU usage is to create a virtual machine (VM) (using VirtualBox for example) and set the number of processors for this VM to the number of licensed CPU cores. A free linux distribution can then be installed in the VM and Sesame/GraphDB/Tomcat set up as necessary. Alternatively, just use GraphDB as normal and rely on the internal throttling to manage CPU utilisation.
{info}

{noformat}


h1. Problems

h5. Why does my repository report a different number of explicit statements with different rule sets?

Each rule set defines both rules and some schema statements, otherwise known as axiomatic triples. These (read-only) triples are inserted in to the repository at intialisation time and count towards the total number of reported 'explicit' triples. The variation may be up to the order of hundreds depending upon the rule set.

h5. Why can't I delete some statements?

Statements that were added during repository initialisation, either because they are asserted in rule files or because they were loaded using the "imports" parameter are marked read-only. Having read-only statements (especially schema definition statements) are is one way to ensure that 'smooth delete' can operate very quickly.

{note}GraphDB-SE does now allow read-only/schema statements to be modified when the repository is in a special mode. This feature will allow fast delete operations at the same time as ensuring that schemas can be changed when necessary. Full details on how to do this can be found in the [GraphDB-SE user guide|GraphDB-SE Reasoner#Schemaupdatetransactions] .
{note}

h5. Why won't Sesame start in Tomcat?

This problem will manifest itself in many ways after deploying the Sesame/GraphDB war files to the Tomcat's webapps directory. If you are unable to set the server URL in the Workbench, then this it is an indication that the this problem has occurred. It will likely be due to a permissions problem on the logging directory for the openrdf-sesame server. To check this, point your browser directly at the Sesame server with a URL similar to the following:

{noformat}
If you receive a stack trace containing the following:





bq. Invocation of init method failed; nested exception is java.io.IOException: Unable to create logging directory /usr/share/tomcat6/.aduna/openrdf-sesame/logs
then this indicates that Tomcat does not have write permission to its data directory (where it stores configuration, logs, and actual repository data). To fix this, log in as root to the server machine and do the following:

{noformat}
{noformat}

Now when you use the server URL in your browser, you should see the Sesame server welcome screen.


h5. Sesame Workbench starts, but gives a memory error on the 'explore' and 'query' menus

The maximum heap space must be increased, i.e. Tomcat's Java virtual machine must be allowed to allocate more memory. This can be done by setting the environment variable 'CATALINA_OPTS' to include the desired value, e.g. \-Xmx1024m

h5. I cannot copy GraphDB or 3rd party jars files to the openrdf-sesame WEB-INF/lib directory.

h5. Can not copy GraphDB or 3rd party jars files the openrdf-sesame WEB-INF/lib directory.

This directory will not exist until the Sesame war files have been deployed to the \[WEBAPPS\] directory AND Tomcat is running. If the war files have been deployed, but the directory does not exist, try restarting Tomcat.

h5. Can not I cannot connect the Sesame console to the local Sesame server at [http://localhost:8080/openrdf-sesame].

Make sure that the Sesame war files have been deployed and that Tomcat is running. Restart Tomcat if necessary.

h5. Can not create an GraphDB repository using the Sesame console
h5. I cannot create a GraphDB repository using the Sesame console.

Make sure that the repository template file(s) have been copied to the 'templates' sub-directory of the Sesame console's data directory. These files are:
* GraphDB-Lite - {{owlim-lite.ttl}}
* GraphDB-SE - {{owlim-se.ttl}}
* GraphDB-Enterprise - {{master.ttl}} and {{worker.ttl}}

h5. Can not create an GraphDB repository, the Sesame console says 'unknown Sail type'
h5. I cannot create a GraphDB repository, the Sesame console says 'unknown Sail type'.

The Sesame console cannot find the GraphDB jar file. Make sure it was copied from the distribution zip file to the 'lib' sub-directory of the Sesame installation directory.

h5. Cannot I cannot use my custom rule file (pie file), an exception occurred.

To use custom rule files, GraphDB must be running in a Java Virtual Machine JVM that has access to the Java compiler. The easiest way to do this is to use the Java runtime from a Java Development Kit (JDK).

h5. I am getting SocketTimout when uploading large files

This might be caused because the import is taking longer than the configured timeout for your application server. In the case of tomcat you can change/add a connectionTimeout parameter in the&nbsp;Connector&nbsp;element in *conf/server.xml* like this:
{code} <Connector executor="tomcatThreadPool"
port="8080" protocol="HTTP/1.1"
connectionTimeout="40000"
redirectPort="8443" />{code}
or disable upload timeouts by setting the&nbsp;*disableUploadTimeout* to false. More information can be found in tomcat's documentation&nbsp;[here|http://tomcat.apache.org/tomcat-7.0-doc/config/http.html#Standard_Implementation]

h5. Slow tomcat startup time on Linux

If you are seeing the following in the logs
{code}
Creation of SecureRandom instance for session ID generation using [SHA1PRNG] took [5172] milliseconds.
{code}
then try adding {code}
-Djava.security.egd=file:/dev/./urandom
{code}
to your *CATALINA_OPTS* in *setenv.sh*
You can check [this|http://wiki.apache.org/tomcat/HowTo/FasterStartUp#Entropy_Source] for more information and implications for this parameter

h1. Backup/restore/import/export

h5. How do I preserve contexts when exporting and importing Data

In order to preserve the context (named graph) when exporting/importing the whole database, a context-aware RDF file format must be used, e.g. TriG. After serialising the repository to a file with this format (this can be done through the Sesame workbench Web application), the file can be imported with the following steps:
* Go to *\[Add\]*
* Choose Data format: *TriG*
* Choose RDF Data File: e.g. *export.trig*
* Clear the context text field (it will have been set to the URL of the file). If this is not cleared, then all the imported RDF statements will be given a context of <[file://export.trig]> or similar.
* Upload

{noformat}

This method will stream a snapshot of the database's explicit statements in to the 'export.trig' file.

A backup can also be done programmatically using the Sesame API. See the [RepositoryConnection.exportStatements()|http://www.openrdf.org/doc/sesame2/api/org/openrdf/repository/RepositoryConnection.html#exportStatements(org.openrdf.model.Resource,%20org.openrdf.model.URI,%20org.openrdf.model.Value,%20boolean,%20org.openrdf.rio.RDFHandler,%20org.openrdf.model.Resource...)] [RepositoryConnection.exportStatements()|http://openrdf.callimachus.net/sesame/2.6/apidocs/index.html] method and the example in the next question.

If it is possible to shutdown the repository, then a backup can be effected by copying the GraphDB storage directory (and any sub-directories). See the [installation section|GraphDB-SE Installation] for information about where GraphDB storage folders are located. To restore a repository from a back up, make sure the repository is not running and then replace the entire contents of the storage directory (and any sub-directories) with the backup. Then restart the repository and check the log file to ensure a successful start up.

GraphDB-Enterprise has an additional online backup feature that copies a binary database image from a worker node to the cluster master - see the [GraphDB-Enterprise user guide|GraphDB-Enterprise Administration].
h5. How do I dump the contents of a large repository to RDF?

The Sesame openRDF workbench Web application has an export function that can be used to export the contents of moderately sized repositories. However, using this with large repositories (more than a hundred million statements or more) causes problems, problems - usually time-outs for the Servlet container (Tomcat) hosting the application. Also, the workbench cannot be used when using GraphDB-SE without Tomcat.
Therefore, a more straightforward approach for exporting RDF data from repositories is to do this programmatically. The Sesame {{RepositoryConnection.getStatements()}} method can be called with the {{includeInferred}} flag set to {{false}} (in order not to serialise the inferred statements). Then the returned iterator can be used to visit every explicit statement in the repository and one of the Sesame RDF writer implementations can be used to output the statements in the chosen format. If the data will be re-imported, the N-Triples format is recommended, because this can easily be broken in to large 'chunks' that can be inserted and committed separately. The following code snippet shows how an export can be achieved using this approach:

Therefore, a more straightforward approach for exporting RDF data from repositories is to do this programmatically. The Sesame {{RepositoryConnection.getStatements()}} method can be called with the {{includeInferred}} flag set to {{false}} (in order not to serialise the inferred statements). Then the returned iterator can be used to visit every explicit statement in the repository and one of the Sesame RDF writer implementations can be used to output the statements in the chosen format. If the data will be re-imported, the N-Triples format is recommended, because this can easily be broken into large 'chunks' that can be inserted and committed separately. The following code snippet shows how an export can be achieved using this approach:

{code:borderStyle=solid}java.io.OutputStream out = ...;
RDFWriter writer = Rio.createWriter(RDFFormat.NTRIPLES, out);
While most versions of GraphDB are backward compatible, some major version number increases use such different data structures that disk images can no longer be automatically updated to the latest version.

The basic procedure is to export the RDF data from the old version of GraphDB-SE and then reload it in to a new repository instance that uses the new version of GraphDB-SE. Exporting is straightforward when using the Sesame workbench -- - simply click the 'Export' button, choose the format, and click 'download'. To import in to a new repository, click 'add', select a format, specify the file and base URI, then click 'Upload'.
If not using the Sesame workbench, the export must be done programmatically using the {{RepositoryConnection.getStatements()}} API, because the Sesame console does not have an export function. NOTE: It should be possible to export only the explicit statements, as the inferred statements will be recomputed at load time. Fortunately, the Sesame console application does have a 'load' function and this can be used to reload the exported statements.

If not using the Sesame workbench, the export must be done programmatically using the {{RepositoryConnection.getStatements()}} API, because the Sesame console does not have an export function.
NOTE: It should be possible to export only the explicit statements, as the inferred statements will be recomputed at load time. Fortunately, the Sesame console application does have a 'load' function and this can be used to reload the exported statements.

h5. How can I load a large RDF/XML file without getting an "entity expansion limit exceeded" error?





bq. Parser has reached the entity expansion limit "64,000" set by the Application.
when it generates more than a specified number of 'entities'. The default limit for the built-in Java XML parser is 64,000, however it can be configured by using a Java system property. To increase the limit, pass the following to the JVM in which GraphDB/Sesame is running. Note that the actual value can be increased as necessary. Don't forget that if running in Tomcat then this must be passed to the Tomcat instance using the CATALINA_OPTS environment variable.
{noformat}


h5. How can I upgrade to a new version of GraphDB-SE without exporting and reimporting all my data?

There is a good chance that it will take quite a long time to initialise as the storage files are modified, but it should be quicker than re-importing all the data.

h5. How do I load large amounts of data in to GraphDB-SE or GraphDB-Enterprise?

In general, RDF data can be loaded into a given Sesame repository using the 'load' command in the Sesame console application or directly through the workbench web application. However, neither of these approaches will work when using a very large number of triples, e.g. a billion statements. A common solution would be to convert the RDF data into a line-based RDF format (e.g. N-triples) and then split it into many smaller files (e.g. using the linux command 'split'). This would allow each file to be uploaded separately using either the console or the workbench applications.

h1. Developers

# Install ant 1.8 from [http://ant.apache.org/]
# Install maven 2.2.1 - just download from [http://maven.apache.org/] and install in a convenient location and make making sure that mvn.bat is on your PATH so you can run it from the command line. No additional configuration is required.
# Check out the desired Sesame branch from SVN - 2.6 is used in these instructions [http://repo.aduna-software.org/svn/org.openrdf/sesame/branches/2.6]
# Open a command line and go to the core subdirectory of your branch working directory.