The LoadRDF tool resides in the loadrdf/ folder inside the distribution of GraphDB-SE & GraphDB-Enterprise and is used for fast loading of large data sets.
It accepts as input on the command line a standard config file in Turtle format, the keyword 'serial' or 'parallel', and a list of files to be loaded. The keywords 'serial' and 'parallel' specify the way the data is loaded into the repository where:
Gzipped files are supported. The format is guessed by the file extension (optionally preceding .gz), e.g. file.nt.gz should be in NTriples format.
The tool accepts command line options using -D:
The following options can tune the behaviour of the ParallelParser when we use the 'parallel' loading mode.
There is a way to specify the context programmatically, described below.
LoadRDF uses the ParallelParser class by giving an InputStream to it as a parameter. The ParallelParser internally parses the file and fills buffers of statements for the next stage (resolving) which on its side prepares resolved statements for the next stage (sorting) and then the sorted statements are asynchronously loaded into PSO and POS. Another way to use the ParallelParser is to specify an Iterator<Statement> (parsed from another sources or possibly generated) instead of InputStream or a File. The both constructors require to be supplied a context where statements are going to be loaded. Only statements without context will go to the specified one and in case we use a format supporting contexts (trig, trix, nq), only statements without specified context will go in the one we desire. Statements with contexts will use their own context rather than the one we have specified.
There is a third way to use the ParallelParser: not by constructing it to process specific data but to create it for general use and then invoking its addDataDirectly() method whenever we have prepared a new data buffer. In this way it is used in AVLRepositoryConnection internally (in putStatementParallel(), it is when we have set the 'parallelInsertion' flag by invoking useParallelInsertion() with poolSize and bufferSize). The buffer consists of parsed and resolved statements, 5 longs each (its size should be a multiple of 5), namely subj-pred-obj-context-status. This buffer is given to the sorting threads to process and then to load asynchronously into PSO and POS (and eventually in PCSO and PCOS if context index is enabled).
The ParallelParser accepts two -D command line options that are used for testing purposes: to measure the overhead of parsing & resolving vs loading data into the repo:
If any of these options is specified, a descriptive message is printed on System.out.
Skip to end of metadata Go to start of metadata