|The Explain Plan is the regular feature in GraphDB versions 6.4.0, 6.4.1, 6.4.2.|
All examples below refer to the LDBC Semantic Publishing Benchmark (SPB) query mix (http://ldbc.eu). This benchmark is chosen because the domain is easy to understand:
- CreativeWorks (journalistic articles)
- topics (entities found in the content of articles, such as people, organizations, locations)
- mentions of topics in works
The GraphDB storage component uses disk-based AVL Trees to keep triples in ordered fashion. GraphDB keeps two such trees as indices, which contain all statements, sorted by POS or PSO. They can return all triples for a fixed predicate with either bound or unbound subject & object.
GraphDB uses the indexed nested loops (INL) join strategy. E.g. assume that for the following query:
the optimiser has selected to execute the ?x rdf:type foaf:Person pattern first:
- ?x rdf:type foaf:Person causes a query to the POS index, which returns all triples with P=rdf:type, O=foaf:Person
- Then query execution loops over this collection, binding ?x to items X
- A nested query is made to the PSO index where P=rdfs:label, S=X
Aggregation is done in a single pass over the result set and uses HashMaps to calculate the aggregate values. The aggregation overhead is relatively small compared to the fetch time and is done in linear time over the collection size.
For example, a typical aggregation query is LDBC SPB Q7:
Execution time is ~500ms, fetch time is ~2700ms, and aggregation time is <100ms.
To see the query explain plan, use the onto:explain pseudo-graph:
Instead of the query result, GraphDB returns an iterator with the explain plan result.
- each row provides information about a single triple pattern
- the order of triple patterns may not correspond to the original query because the order is selected by the GraphDB Query Optimiser
- the identation in the first column shows the nesting of pattern queries (nested loops)
- the second column shows the collection size, i.e. number of triples matching the pattern
- the next two columns show the number of unique subjects and objects in this collection
- the last two columns show the optimizer's judgement of "complexity" and the time it took to iterate the collection
- the last two rows show the execution time, the fetching time, and the number of results
The query plan is:
The optimizer selected to first execute the join ?x rdf:type rdfs:Class, then ?x rdfs:label ?label. The reason is that the number of classes is much smaller than the number of things that have a label (45 vs 200 in this case).
The query plan (on a 5M Dataset) is: