GraphDB-SE LoadRDF tool

Skip to end of metadata
Go to start of metadata
Search
This documentation is NOT for the latest version of GraphDB.

Latest version - GraphDB 7.1

GraphDB Documentation

Next versions

GraphDB 6.2
GraphDB 6.3
GraphDB 6.4
GraphDB 6.5
GraphDB 6.6
GraphDB 7.0
GraphDB 7.1

Previous versions

[OWLIM 5.4]
[OWLIM 5.3]
[OWLIM 5.2]
[OWLIM 5.1]
[OWLIM 5.0]
[OWLIM 4.4]
[OWLIM 4.3]
[OWLIM 4.2]
[OWLIM 4.1]
[OWLIM 4.0]

The LoadRDF tool resides in the loadrdf/ folder inside the distribution of GraphDB-SE & GraphDB-Enterprise and is used for fast loading of large data sets.
Note. This tool is used to create a brand new repository from existing data. It cannot be used to update existing repositories.

Usage

java LoadRDF <config.ttl> <serial|parallel> <files...>

It accepts as input on the command line a standard config file in Turtle format, the keyword 'serial' or 'parallel', and a list of files to be loaded. The keywords 'serial' and 'parallel' specify the way the data is loaded into the repository where:

  • 'serial' means parsing is followed by entity resolution, which is then followed by load, optionally followed by inference;
  • 'parallel' means that resolving and loading will take place earlier, i.e. as soon as the parse buffer is filled, thus reducing the overall time by occupying more cpu and memory resources; inference, if any, is always performed in the end, because while at load time we can use sorting and other techniques to boost performance, at the inference stage the sorted order will always be spoiled and no usage can be made of it.

Gzipped files are supported. The format is guessed by the file extension (optionally preceding .gz), e.g. file.nt.gz should be in NTriples format.
In addition to files, whole directories will be processed (together with all their subfolders) if specified.

Java -D cmdline options

The tool accepts command line options using -D:

  • -Dcontext.file: text file containing context names, one per line. Lines are trimmed, and empty lines denote the 'null' context. Each file (transaction) uses one context line from this file

The following options can tune the behaviour of the ParallelParser when we use the 'parallel' loading mode.

  • -Dpool.size: how many sorting threads to use per-index, after resolving entities into IDs and prior to loading the data in the repo. Sorting accelerates data loading. There are separate thread pools for each of the indexes: PSO (sorts in pred-subj-obj order), POS (sorts in pred-obj-subj order), and if context indices are enabled: PCSO, PCOS.
    The value of this parameter defaults to 1 and it is currently more of an experimental option because experience suggests that more than one sorting thread doesn't have much effect. This is because the resolving stage takes much more time and the sorting stage is an in-memory operation. The quick sort algorithm is used and the operation is performed really fast even for large buffers.
  • -Dpool.buffer.size: buffer size (number of statements) for each stage. Defaults to 1M statements. The memory usage and the overhead of inserting data can be tuned with this parameter:
    • less buffer size reduces the memory required;
    • bigger buffer size theoretically reduces the overhead because operations performed by threads have lower probability to wait for the operations which they rely on and the CPU is used intensively most of the time.

ParallelParser

There is a way to specify the context programmatically, described below.

LoadRDF uses the ParallelParser class by giving an InputStream to it as a parameter. The ParallelParser internally parses the file and fills buffers of statements for the next stage (resolving) which on its side prepares resolved statements for the next stage (sorting) and then the sorted statements are asynchronously loaded into PSO and POS. Another way to use the ParallelParser is to specify an Iterator<Statement> (parsed from another sources or possibly generated) instead of InputStream or a File. The both constructors require to be supplied a context where statements are going to be loaded. Only statements without context will go to the specified one and in case we use a format supporting contexts (trig, trix, nq), only statements without specified context will go in the one we desire. Statements with contexts will use their own context rather than the one we have specified.

There is a third way to use the ParallelParser: not by constructing it to process specific data but to create it for general use and then invoking its addDataDirectly() method whenever we have prepared a new data buffer. In this way it is used in AVLRepositoryConnection internally (in putStatementParallel(), it is when we have set the 'parallelInsertion' flag by invoking useParallelInsertion() with poolSize and bufferSize). The buffer consists of parsed and resolved statements, 5 longs each (its size should be a multiple of 5), namely subj-pred-obj-context-status. This buffer is given to the sorting threads to process and then to load asynchronously into PSO and POS (and eventually in PCSO and PCOS if context index is enabled).

The ParallelParser accepts two -D command line options that are used for testing purposes: to measure the overhead of parsing & resolving vs loading data into the repo:

  • -Ddo.resolve.entities=false: only parse the data, don't proceed to resolving
  • -Ddo.load.data=false: parse and resolve the data, but don't proceed to loading into the repo
  • By default the data is parsed, resolved, and loaded into the repo

If any of these options is specified, a descriptive message is printed on System.out.

Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.