The LoadRDF tool resides in the loadrdf/ folder in the distribution of GraphDB-SE and GraphDB-Enterprise and it is used for fast loading of large data sets.
 | You create a brand new repository from existing data and cannot update existing repositories. |
Usage
As input on the command line, the LoadRDF tool accepts a standard config file in Turtle format, the mode, and a list of files for loading.
The mode specifies the way the data is loaded in the repository:
- serial - parsing is followed by entity resolution, which is then followed by load, optionally followed by inference;
- parallel - resolving and loading will take place earlier, i.e. as soon as the parse buffer is filled, thus reducing the overall time by occupying more CPU and memory resources. Inference, if any, is always performed at the end, while at load time you can use sorting and other techniques to boost performance, at the inference stage, you cannot take advantage of the sorting. Therefore, this mode is effective only without inference;
- fullyparallel - using the parallel parse and load from the previous mode but instead of starting the inference at the end, the inference is made parallel during the load. This can give a significant boost when loading large data sets with enabled inference (Note that at the moment, this mode does not support the owl:sameAs optimisation).
 | Gzipped files are supported. The format is guessed by the file extension (optionally preceding .gz), e.g. file.nt.gz has to be in a NTriples format.
In addition to files, if specified, whole directories will be processed (together with all their subfolders). |
Java -D cmdline options
The LoadRDF tool accepts java command line options, using -D:
- -Dcontext.file - a text file containing context names, one per line. Lines are trimmed and empty lines denote the 'null' context. Each file (transaction) uses one context line from this file.
The following options can tune the behaviour of the ParallelLoader when you use the parallel loading mode:
- -Dpool.size - how many sorting threads to use per index, after resolving entities into IDs and prior to loading the data in the repository. Sorting accelerates data loading. There are separate thread pools for each of the indeces: PSO (sorts in the pred-subj-obj order), POS (sorts in the pred-obj-subj order), and if context indices are enabled: PCSO, PCOS.
The value of this parameter defaults to 1 and it is currently more of an experimental option as experience suggests that more than one sorting thread does not have much effect. This is because the resolving stage takes much more time and the sorting stage is an in-memory operation. The 'quick sort' algorithm is used and the operation is performed really fast even for large buffers;
- -Dpool.buffer.size - the buffer size (number of statements) for each stage. Defaults to 200,000 statements. You can use this parameter to tune the memory usage and the overhead of inserting data:
- less buffer size reduces the memory required;
- bigger buffer size theoretically reduces the overhead as the operations performed by threads have a lower probability to wait for the operations on which they rely and the CPU is intensively used most of the time.
- -Dignore.corrupt.files - valid values are true/false, the default value is true. If set to true, the process will continue even if the dataset contains corrupt file (it will skip the file).
- -Dlru.cache.type - valid values are synch/lockfree, the default value is synch. It determines which type of the 'least recently used' cache will be used, the recommended value for LoadRDF is lockfree as it performs better in parallel modes and because that the option is set in the command line scripts.
- -Dinfer.pool.size - the number of inference threads in fullyparallel mode, the default value is the number of cores of the machine processor or 4 as set in the command line scripts. A bigger pool theoretically means faster load, if there are enough unoccupied cores and the inference does not wait for the other load stages to complete.
ParallelLoader
Beacause the LoadRDF tool uses the ParallelLoader, there is also a way to use loadrdf programmatically. For example, you can write you own java tool, which uses the ParallelLoader internally.
One option gives the ParallelLoader an InputStream as a parameter. The ParallelLoader internally parses the file and fills the buffers of statements for the next stage (resolving), which then prepares the resolved statements for the next stage (sorting) and, finally, the sorted statements are asynchronously loaded into PSO and POS.
Another way to use the ParallelLoader is to specify an Iterator<Statement> (parsed from another sources or possibly generated) instead of InputStream, or a File. Both constructors require to be supplied with the context where the statements will be loaded. Only statements without a context will go to the specified one and if you use a format supporting contexts (trig, trix, nq), only statements without a specified context will go in the one you want. Statements with contexts will use their own context rather than the one you have additionally specified.
The ParallelLoader accepts two -D command line options used for testing purposes: to measure the overhead of parsing and resolving vs loading data into the repository:
- -Ddo.resolve.entities=false - only parse the data, do not proceed to resolving;
- -Ddo.load.data=false - parse and resolve the data, but do not proceed to loading in the repository;
- By default, the data is parsed, resolved, and loaded in the repository.
If any of these options is specified, a descriptive message is printed on the console.