
This section gives an overview of configuring an GraphDB-SE repository. Also covered are the contents of the GraphDB-SE distribution and a description of the 'getting-started' application that is included. This sample application serves as an example for integrating GraphDB-SE in to other systems. For a detailed step-by-step guide for installing and setting-up GraphDB-SE, see the [installation section|GraphDB-SE Installation].
{toc}
h1. Contents of the Distribution Package
The GraphDB-SE distribution zip file includes the following folders:
\\
|| Folder || Contents ||
| {{doc}} | User guide, quick start guide and the GraphDB primer |
| {{ext}} | All required third party libraries. The Sesame 2 installation can be downloaded separately from [http://www.openrdf.org/|http://www.openrdf.org/]. The folder also contains a copy of Lehigh University Benchmark library (lubm.jar), JUnit library (junit.jar) necessary for executing inference tests, the simple logging framework for Java jar files and the lucene full-text search jar. |
| {{getting-started}} | An example application that uses GraphDB, with all the necessary auxiliary files and folders, see section 8.2 |
| {{lib}} | Contains the binary executable version of GraphDB-SE as a JAR (Java library) file |
| {{lubm}} | Scripts and configuration files to run the LUBM \[16\] benchmarks documented in section 11. |
| {{templates}} | Contains an GraphDB-SE repository template file (.ttl) used by the Sesame 2 framework for creating new repositories. |
The distribution contains the following files in the root directory of the zip file:
\\
|| File || Description ||
| {{\*.pie}} | Rule files containing definitions of the built-in rule-sets, see section 7.1.2. |
| {{GraphDB-SE_license_agreement_xx.pdf}} | The license under which GraphDB-SE is published. |
| {{owlim-se-configurator.xls}} | A useful memory requirement calculator and configuration tool that can be used to calculate the correct Java heap size, memory allocation and various other configuration parameters. Command line and turtle configurations are generated. Instructions for using this spreadsheet are given on the first page. |
| {{setvars.cmd}} \\ {{setvars.sh}} | Scripts (Windows and Linux) that define several environment variables used by the scripts that run the test cases and getting-started application. It should be edited for each installation, as it determines the Java virtual machine to be started, the path to all the relevant JAR files, including those of GraphDB-SE and Sesame. |
h1. Getting-Started Application
The GraphDB distribution comes with a sample application that can be used as a template for building applications that interact with an GraphDB repository. The source code of this application performs a sequence of typical operations: initialisation of the repository, uploading statements, executing queries and obtaining results, deleting statements, etc. This application template comes with:
* Source code and compiled class files;
* Sample ontology and data files;
* Sesame repository template file;
* Scripts which invoke the application.
An easy way to set up an application to use GraphDB is to copy the {{getting-started}} folder and modify the contents as necessary.
There follows a short description on how the getting started application is organised and what the sample code does. The easiest way of getting a good understanding of it is to read the source code of the {{GettingStarted}} class, located in the {{src}} folder - the code is extensively commented. The program accepts a number of parameters as described below.
|| Parameter || Description || Default ||
| | *Repository/communication parameters:* | |
| {{config}} | Specifies the repository description file used to create a repository. Configuration options specified in this file are explained in following sections. This parameter is ignored if the {{url}} parameter is used. | ./owlim.ttl |
| {{url}} | Used in conjunction with the {{repository}} parameter, this URL specifies the remote Sesame server and will have the form http://<hostname>:<port>/openrdf-sesame/. This parameter overrides the {{config}} parameter. | |
| {{repository}} | The repository ID, used in conjunction with the {{url}} parameter, identifies the repository on the remote Sesame server. | |
| {{username}} | Specifies the username for HTTP authentication (if enabled at the server) | |
| {{password}} | Specifies the password for HTTP authentication (if enabled at the server) | |
| | *Export parameters:* | |
| {{exportfile}} | dump the repository contents to the given filename | |
| {{exporttype}} | export all/explicit/implicit statements, default is explicit | |
| {{exportformat}} | the RDF format: N-Triples, N3, Turtle, RDF/XML, TriG, TriX | |
| | *Data loading parameters:* | |
| {{context}} | If not specified, statements loaded are given the context of the file URL from which the statements were loaded. If specified ({{context=URI}}), then all statements loaded are given this URI for the context. If an empty context is used ({{context=}}) then all statements loaded have no context (default graph) | |
| {{preload}} | Specifies the folder or file containing RDF data that is loaded automatically when the program starts. If the parameter value specifies a folder then it is searched recursively for all files that contain RDF data. | ./preload |
| {{verify}} | Verify the integrity of the RDF data during parsing | true |
| {{stoponerror}} | Whether the parser should stop immediately if it finds an error in the data | true |
| {{preservebnodes}} | Whether the parser should preserve bnode identifiers specified in the source | true |
| {{datatypehandling}} | The data-type handling method, one of: *ignore* (allow any data-type values), *verify* (validate data-type representations) or *normalize* (convert all data-type values to their cononical form) | verify |
| {{chunksize}} | The number of statements to parse/load before inserting a commit instruction | 500000 |
| | *Query and miscellaneous parameters:* | |
| {{queryfile}} | Specifies the file containing queries that are to be executed. The files can contain queries in any format supported by Sesame and can include SPARQL updates. | ./queries/sample.sparql |
| {{showresults}} | Specifies whether the results from queries will be displayed or not. | true |
| {{showstats}} | Indicates whether to show initialisation statistics after loading the selected data files | false |
| {{updates}} | Specifies whether the statement insertion and deletion step is performed. | false |
To run the program, use the {{example.cmd}} / {{example.sh}} script. This script requires that the {{JAVA_HOME}} environment variable to be set. Alternatively, it can be set directly by editing the {{setvars.cmd}} / {{setvars.sh}} script in the root folder of the GraphDB-SE software distribution. If the program is modified to use a custom rule set, then {{JAVA_HOME}} must point to the Java Runtime Environment (JRE) of a Java Development Kit (JDK) version 1.6 or later. This is so that the new mechanism for locating the Java compiler can be used.
With the example set up, GraphDB-SE loads the example ontology at start up as specified by the {{imports}} parameter in the repository configuration file, i.e. {{owlim.ttl}}. This ontology is {{./ontology/example.rdfs}}. The sample program then loads any other ontologies that it finds in the {{preload}} folder. When start up is complete, the program outputs some statistics and lists the namespaces found.
The next step is to load the specified query file and to execute the queries that they contain. Some example query files are included in the {{queries}} folder. The files can contain several queries where each query starts with an identifier, enclosed in square brackets {{\^\[}} and {{\]}} on a single line; everything between two subsequent query identifiers is treated as a SeRQL or SPARQL query and is evaluated against the contents of the prepared repository. You may also use the {{\#}} sign as a single line comment, so each line starting with {{\#}} will be ignored. Syntax overview:
{noformat}#some comment
^[queryid1]
<query line1>
<query line2>
...
<query lineN>
#some other comment
^[nextqueryid]
<query line1>
...
<EOF>
{noformat}
The queries are always evaluated, but the results are output only if the {{showresults}} parameter is set to {{true}}.
Furthermore, the sample application updates the contents of the repository by inserting a statement using {{RepositoryConnection.addStatement()}} and the transaction is committed. The program then fetches some statements from the repository using a direct call to the {{RepositoryConnection.getStatements()}}. The set of retrieved statements should contain the newly added statement since it matches the given pattern. The statement is then removed in a separate transaction.
The application can also be run against a remote repository exposed using the Sesame HTTP server. In this case, the {{url}} parameter is used to specify the sesame endpoint and the {{repository}} parameter is used to specify the repository to use on this server. The use of {{url}} and {{repository}} overrides the {{config}} parameter.
h2. Bulk data loading
Due to its range of functions, the getting started application makes a useful bulk-loading tool. It can load a single file or traverse through a whole directory structure loading any RDF file it can find. If the files are very large, it will automatically insert commit operations at suitable moments, so it is not necessary to convert and split large files in to smaller ones. For example:
{noformat}
./example.sh url=http://192.168.1.31:8080/openrdf-sesame repository=my_repo
preload=/home/me/wordnet/ username=me password=secret queryfile=none
{noformat}
It loads all RDF files located in {{/home/me/wordnet/}} and its subdirectories in to the repository called {{my_repo}} at the Sesame endpoint {{[http://192.168.1.31:8080/openrdf-sesame]}}, secured using HTTP authentication with the above credentials. If an error occurs, it outputs a message and continues on to the next RDF file.
h2. Making a back-up
The export features of getting-started allow a reasonable back-up of an GraphDB database to be made.
{noformat}
./example.sh queryfile=none url=http://192.168.1.31:8080/openrdf-sesame repository=my_repo
preload= exportfile=backup.trig exportformat=trig exporttype=explicit
{noformat}
In this example the TriG file format is used, because it preserves named-graph names (it is a quad format).
h2. Wordnet example
[Wordnet|http://wordnet.princeton.edu/], is the most popular lexical knowledge base, developed at the University of Princeton. It encodes the meanings of about 150,000 English words. The meanings of the words are defined by word-senses, which relate a word to a lexical concept. Lexical concepts are called _synsets_, i.e. synonym sets -- about 115,000 of those appear in Wordnet v.2.0. Numerous lexical semantic relations are formally modelled, e.g.
* Hyponymy (subsumption from a more-general term)
* Antonymy (negation, a term with the opposite meaning)
* Causation and entailment (for verbs)
A standard RDF/OWL representation of Wordnet is available at [http://www.w3.org/TR/wordnet-rdf/|http://www.w3.org/TR/wordnet-rdf/]. It contains about 1.9 million explicit statements (the Full variant), expressed in a fragment of OWL-Lite that further entails 6.3 million implicit statements.
To configure the getting started example program to use the Wordnet data sets and run the included Wordnet queries, one should download the archive of the full version from [http://www.w3.org/2006/03/wn/wn20/download/wn20full.zip|http://www.w3.org/2006/03/wn/wn20/download/wn20full.zip], extract it into a folder, e.g. {{./preload/wordnet}} and provide a path to this folder using the {{preload}} command line parameter when starting the program. Some sample Wordnet queries are provided in the {{wordnet.serql}} and {{wordnet.sparql}} query files. These can be specified on the command line using the {{queryfile}} command line parameter.
h1. Configuration
Sesame 2.0 keeps repository configurations in a SYSTEM repository -- in RDF. A new repository can be configured simply by inserting an appropriate graph in to the SYSTEM repository. The getting started application uses the Turtle format for convenience and also because the Sesame console application uses the Turtle format for template files when creating repositories.
The diagram below gives a graphical illustration of an RDF graph that describes a repository configuration:
!sesame_owlim_config.png!
Often it is desirable to ensure that a repository starts with a predefined set of RDF statements, usually one or more schema graphs. This is possible by using the {{owlim:imports}} property. After start up, these files are parsed and their contents are permanently added to the repository. The complete set of configuration parameters, their descriptions and their default and allowed values are listed below. What follows is a short description of those properties specific to GraphDB-SE that are used to setup the repository. For more information about Sesame 2 configuration schema refer to the Sesame documentation \[9\]. In short, the configuration is an RDF graph, the root node is of {{rdf:type rep:Repository}}, it must be connected through {{rep:RepositoryID}} property to a Literal that contains the human readable name of the repository. The root node must be connected via the {{rep:repositoryImpl}} property to a node that describes the configuration. The type of the repository is defined via {{rep:repositoryType}} property and its value must be {{openrdf:SailRepository}} to allow for custom sail implementations (such as GraphDB-SE) to be used in Sesame 2.0. Then a node that specifies the Sail implementation to be instantiated must be connected with {{sr:sailImpl}} property. To instantiate GraphDB-SE, this last node must have a property {{sail:sailType}} with the value {{owlim:Sail}} -- the Sesame framework will locate the correct {{SailFactory}} within the application {{classpath}} that will be used to instantiate the Java implementation class.
Namespaces corresponding to the prefixes used in the above paragraph are:
{noformat}rep: <http://www.openrdf.org/config/repository#>
sr: <http://www.openrdf.org/config/repository/sail#>
sail: <http://www.openrdf.org/config/sail#>
owlim: <http://www.ontotext.com/trree/owlim#>
{noformat}All properties used to specify GraphDB-SE's configuration parameters use the {{owlim:}} prefix and the local names match up with the parameters listed below, e.g. the value of the {{ruleset}} parameter can be specified using the {{[http://www.ontotext.com/trree/owlim#ruleset]}} property.
Many of the GraphDB specific configuration parameters can be set via the Java Virtual Machine (JVM) system properties passed as command line parameters when starting the JVM. Values for configuration parameters that are given on the command line take precedence over those present in the repository configuration. For instance, the {{ruleset}} parameter can be set from the command line by using:
{noformat}-Druleset=owl-max
{noformat}
h2. Sample Configuration
There follows an example configuration (in Turtle RDF format) of a Sesame 2 repository that uses an GraphDB-SE sail implementation:
{noformat}@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix rep: <http://www.openrdf.org/config/repository#>.
@prefix sr: <http://www.openrdf.org/config/repository/sail#>.
@prefix sail: <http://www.openrdf.org/config/sail#>.
@prefix owlim: <http://www.ontotext.com/trree/owlim#>.
[] a rep:Repository ;
rep:repositoryID "owlim" ;
rdfs:label "GraphDB Getting Started" ;
rep:repositoryImpl [
rep:repositoryType "openrdf:SailRepository" ;
sr:sailImpl [
sail:sailType "owlim:Sail" ;
owlim:ruleset "owl-horst-optimized" ;
owlim:base-URL "http://example.org/owlim#" ;
owlim:imports "./ontology/my_ontology.rdf" ;
owlim:defaultNS "http://www.my-organisation.org/ontology#" ;
owlim:entity-index-size "5000000" ;
owlim:cache-memory "4G" ;
owlim:storage-folder "storage" ;
owlim:repository-type "file-repository" ;
]
].
{noformat}
h1. Memory Requirements
Apart from the I/O buffers used for caching, GraphDB-SE keeps in memory the indexes from the nodes in the RDF graph. This is a design decision in order to improve the overall performance of the repository. Each I/O buffer (page) is exactly 64kb and the indexing information per node in the graph is 12 bytes. So, depending on the dataset, memory requirements per repository may vary. To ease the calculation for the amount of Java heap memory required for an GraphDB-SE repository an excel spreadsheet is included in the distribution -- {{owlim-se-configurator.xls}}.
The page cache is organised in two sets of buffers, read-only and dirty. Each page is first loaded in to the read-only cache. When this gets full, a page (if dirty) is moved to the dirty cache, where it can be later written to the storage.
h2. Cache Memory Configuration
There are several components in GraphDB that make use of caching (e.g. FTS indices, predicate list, tuple indices). In different situations certain caches will need more memory than others. GraphDB allows for the configuration of both the total cache memory to be used by a repository and all the separate per-module caches.
h2. Parameters
The following parameters control the amount of memory assigned to each of the different caches:
|| Parameter || Unit || Default || Description ||
| cache-memory | bytes | | The amount of memory to be distributed among different caches |
| tuple-index-memory | bytes | 80M | Memory used for PSO and POS caches |
| predicate-memory | bytes | 80M | Memory used for predicate list cache |
| fts-memory | bytes | 20M | Memory used for full-text index cache (node search) |
All parameters can be specified in bytes, kilobytes, megabytes or gigabytes by using a unit specifier at the end of the integer number. When no unit specifier is given, this is interpreted as bytes, otherwise use k or K - kilobytes, m or M - megabytes and g or G - gigabytes (everything base 2).
h2. Memory Distribution
The general rule of thumb is:
{noformat}cache-memory = tuple-index-memory + predicate-memory + fts-memory
{noformat}However, if some of the modules using the cache (e.g. full-text search) are turned off, it is excluded from the above equation.
Furthermore, if cache-memory is explicitly configured and some of the other memory parameters are omitted, the missing values are resolved by uniformly distributing the remaining memory after all the explicitly configured memory parameters are subtracted. For example if cache-memory = 100M, fts-memory = 10M and the other memory parameters are missing, then they are implicitly assigned (100M - 10M) / 2 = 45M each.
If cache-memory isn't specified then all the missing memory parameters are assigned their default values.
h1. Configuration Parameters
Almost all GraphDB parameters can be set both in the TTL configuration file (that will populate the SYSTEM repository) and from the command line using the Java {{\-D<param.name>=<value>}} command line option to set system properties. When a parameter is set simultaneously using both methods, the system property overrides the value in the configuration file. Some GraphDB parameters can only be set using system properties.
Once a repository is created, it is possible to change some parameters, either by changing the SYSTEM repository or by overriding values using Java system properties - both methods require a restart of the JVM or Tomcat. Some parameters can not be changed after a repository has been created. These either have no effect (once the relevant data structures are built, their structure can not be changed) or changing them will cause inconsistencies (these parameters affect the reasoner). The following table explains the variations used in the table of configuration parameters:
|| Value of 'Can be changed' \\ || Meaning \\ ||
| Yes \\ | Parameters that can be changed (effective after a restart) |
| No \\ | Parameters where a change has no effect |
| {color:#ff0000}Must NOT change{color} | Parameters that must not be changed once the repository has been created - doing so will likely lead to consistent data (unsupported inferred statements, missing inferred statements or inferred statements that can not be deleted) \\ |
The following table lists all GraphDB configuration parameters:
|| Parameter || TTL || Java \-D || Can be changed \\ || Description ||
| *base-URL* | X | X | Yes \\ | _default_ *<none>*, specifies the default namespace for the main persistence file. Non-empty namespaces are recommended, because their use guarantees the uniqueness of the anonymous nodes that may appear within the repository. |
| *cache-memory* | X | X | Yes \\ | _default_ <_none_>, specifies the total amount of memory to be given to all types of cache. |
| *check-for-inconsistencies* | X | X | Yes \\ | _default_ *false*, turns on or off the mechanism for consistency checking; consistency checks are defined in the rule file and are applied at the end of every transaction, if this parameter is *true*. If an inconsistency is detected, when committing a transaction, then the whole transaction will be rolled back. |
| *debug.level* | | X | Yes \\ | _default_ *0,* defines the level of detail of the GraphDB output, used in *QueryModelConverter*, *SailConnectionImpl* and *HashEntityPool*. The extra logging information is written to the logger at 'DEBUG' level, so in order to see this output the logger properties must be set by adding an entry at the appropriate level. For example, when using GraphDB, deployed using Tomcat on Ubuntu Linux, you will need to edit the file {{/usr/share/tomcat6/.aduna/openrdf-sesame/conf/logback.xml}} and add an entry after the <appender>...</appender> section. Exactly how to set the logger depends on which of the classes are being examined as shown below:
* *QueryModelConverter*: \\
** Logger entry: {{<logger name="com.ontotext.trree.query" level="all"/>}}
** *debug.level* > 2 : Outputs the query optimisation time. \\
** *debug.level* > 3 : Outputs the query plan. \\
* *SailConnectionImpl*: \\
** Logger entry: {{<logger name="com.ontotext.trree.SailConnectionImpl" level="all"/>}}
** *debug.level* > 0 : Outputs "Owlim evaluation strategy" or "Sesame evaluation strategy" when evaluating a query. \\
** *debug.level* > 2 : ThreadPool outputs when a worker thread starts and stops. \\
* *HashEntityPool*: \\
** Logger entry: {{<logger name="com.ontotext.trree.entitypool.HashEntityPool" level="all"/>}}
** *debug.level* > 1 : If version number is less than the current one, outputs "Older Entity storage version found: X, recent one is: Y". |
| *defaultNS* | X | X | No \\ | _default_ *<empty>*, default namespaces corresponding to each imported schema file separated by semicolon \\
and the number of namespaces must be equal to the number of schema files from the *imports* parameter. \\
Example: \\
\\
{{owlim:defaultNS "http://www.w3.org/2002/07/owl#;}}{{[http://example.org/owlim#]}}{{";}} \\
Note: This parameter cannot be set via a command line argument. |
| *disable-sameAs* | X | X | {color:#ff0000}Must NOT change{color} | _default_ *false*, enables or disables the {{owl:sameAs}} optimisation |
| *enable-context-index* | X | X | Yes \\ | _default_ *false*, if set to 'true' then GraphDB will build and use the context index/indices. |
| *enable-literal-index* | X | X | Yes \\ | _default_ *true*, enables or disables the [literal index|GraphDB-SE Indexing Specifics]. The literal index is always built as data is loaded/modified. This parameter only affects whether the index is used during query-answering. |
| *enable-optimization* | X | X | Yes \\ | _default_ *true*, enables or disables query optimisation. \\
*NOTE* disabling query optimisation is rarely needed - usually only for debugging purposes. Also be aware that disabling query optimisation will also disable the correct behaviour of plug-ins (full-text search, geo-spatial extensions, RDF Rank, etc) |