View Source

{info}The Explain Plan is the regular feature in GraphDB versions 6.4.0, 6.4.1, 6.4.2.{info}

{toc}

h1. LDBC SPB
All examples below refer to the LDBC Semantic Publishing Benchmark (SPB) query mix ([http://ldbc.eu]). This benchmark is chosen because the domain is easy to understand:
* CreativeWorks (journalistic articles)
* topics (entities found in the content of articles, such as people, organizations, locations)
* mentions of topics in works

h1. GraphDB Execution Strategy

h2. Disk-based AVL Trees

The GraphDB storage component uses disk-based AVL Trees to keep triples in ordered fashion. GraphDB keeps two such trees as indices, which contain all statements, sorted by POS or PSO. They can return all triples for a fixed predicate with either bound or unbound subject & object.

h2. Join strategy

GraphDB uses the indexed nested loops (INL) join strategy. E.g. assume that for the following query:
{noformat}
select * where {
?x rdf:type foaf:Person.
?x rdfs:label ?label.
}
{noformat}
the optimiser has selected to execute the {{?x rdf:type foaf:Person}} pattern first:
- {{?x rdf:type foaf:Person}} causes a query to the POS index, which returns all triples with {{P=rdf:type, O=foaf:Person}}
- Then query execution loops over this collection, binding {{?x}} to items {{X}}
- A nested query is made to the PSO index where {{P=rdfs:label, S=X}}

h2. Aggregation

Aggregation is done in a single pass over the result set and uses HashMaps to calculate the aggregate values. The aggregation overhead is relatively small compared to the fetch time and is done in linear time over the collection size.

For example, a typical aggregation query is LDBC SPB Q7:
{noformat}
# Retrieve the N most popular topics mentioned by CreativeWorks. Limit results to some "primary content"
SELECT ?mentions (COUNT(*) as ?count)
WHERE {
?creativeWork cwork:mentions ?mentions .
{
SELECT ?creativeWork (count(*) as ?pcCount) {
?creativeWork bbc:primaryContentOf ?pc .
}
GROUP BY ?creativeWork
}
}
GROUP BY ?mentions
ORDER BY DESC(?count)
LIMIT 10
{noformat}

Execution time is \~500ms, fetch time is \~2700ms, and aggregation time is <100ms.

h1. Using Explain Plan

To see the query explain plan, use the {{onto:explain}} pseudo-graph:
{noformat}
PREFIX onto: <http://www.ontotext.com/>
select * from onto:explain
. . .
{noformat}

Instead of the query result, GraphDB returns an iterator with the explain plan result.
- each row provides information about a single triple pattern
-- the order of triple patterns may not correspond to the original query because the order is selected by the GraphDB Query Optimiser
-- the identation in the first column shows the nesting of pattern queries (nested loops)
-- the second column shows the collection size, i.e. number of triples matching the pattern
-- the next two columns show the number of unique subjects and objects in this collection
-- the last two columns show the optimizer's judgement of "complexity" and the time it took to iterate the collection
- the last two rows show the execution time, the fetching time, and the number of results

h2. Example plans

h3. Simple Query
{noformat}
select * from onto:explain where {
?x rdf:type rdfs:Class.
?x rdfs:label ?label.
}
{noformat}
The query plan is:
!Screen Shot 2014-08-27 at 4.27.20 PM.png!

The optimizer selected to first execute the join {{?x rdf:type rdfs:Class}}, then {{?x rdfs:label ?label}}. The reason is that the number of classes is much smaller than the number of things that have a label (45 vs 200 in this case).

h3. Query LDBC SPB Q16
{noformat}
SELECT DISTINCT ?thing ?tag ?category ?title FROM onto:explain
WHERE {
?thing rdf:type cwork:CreativeWork .
?thing rdf:type ?class .
?class rdfs:subClassOf cwork:CreativeWork .
?thing cwork:tag ?tag .
?thing ?a ?o .
?a rdfs:subPropertyOf cwork:tag .
?thing cwork:category ?category .
?thing cwork:title ?title .
FILTER (CONTAINS (?title, "policy") && (?tag = ?o)) .
OPTIONAL {
?thing cwork:audience ?audience .
FILTER ( ?audience = cwork:InternationalAudience ) .
}
} ORDER BY ?tag LIMIT 100
{noformat}
The query plan (on a 5M Dataset) is:
!Screen Shot 2014-08-19 at 12.09.40 PM.png!