GraphDB-SE Explain Plan

compared with
Current by Vladimir Alexiev
on Dec 13, 2014 14:52.

Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (56)

View Page History
h1. Introduction

GraphDB has a feature called "Explain Plan" which explains how GraphDB executes SPARQL queries and also includes information about unique subjects, predicates and objects collections sizes. It helps GraphDB users improve the query plans leading to better execution performance.
\\
"Explain Plan" is a feature that explains how GraphDB executes a SPARQL query and also includes information about unique subject, predicate and object collection sizes. It can help you improve the query, leading to better execution performance.


h2. Examples - based on LDBC SPB
h2. LDBC SPB
All examples below refer to the LDBC Semantic Publishing Benchmark (SPB) query mix ([http://ldbc.eu]). This benchmark is chosen because the domain is easy to understand:
* CreativeWorks (journalistic articles)
* topics (entities found in the content of articles, such as people, organizations, locations)
* mentions of topics in works

All examples below refer to the LDBC Semantic Publishing Benchmark (SPB) query mix ([http://ldbc.eu]). This benchmark is chosen, because the domain is easy to understand:
* articles (instances of CreativeWork)
* topics
* mentions (entities found in the content of the articles, such as people, organizations, locations)


h1. GraphDB Execution Strategy

h2. Disk-based AVL Trees

GraphDB Storage component uses disk-based AVL Trees to keep triples in ordered fashion. GraphDB keeps two such trees as indices, which contain all statements, sorted by POS or PSO. They can return the triples for a fixed predicate, which either bound or unbound the subject/object.
The GraphDB storage component uses disk-based AVL Trees to keep triples in ordered fashion. GraphDB keeps two such trees as indices, which contain all statements, sorted by POS or PSO. They can return all triples for a fixed predicate with either bound or unbound subject & object.

h2. Join strategy

GraphDB uses indexed nested loops (INL) join strategy. E.g. For when the following query is being executed:
GraphDB uses the indexed nested loops (INL) join strategy. E.g. assume that for the following query:
{code} {noformat}
SELECT select * where {
  ?x rdf:type rdfs:Class. foaf:Person.
  ?x rdfs:label ?label.
}
{code} {noformat}
the optimiser has selected to execute the {{?x rdf:type foaf:Person}} pattern first:
- {{?x rdf:type foaf:Person}} causes a query to the POS index, which returns all triples with {{P=rdf:type, O=foaf:Person}}
- Then query execution loops over this collection, binding {{?x}} to items {{X}}
- A nested query is made to the PSO index where {{P=rdfs:label, S=X}}

* The optimiser decides on the order of the joins;
\* The first join (let’s say the optimiser chooses “?x a rdf:type”) is translated into a query to the POS index, which returns all triples with P=rdf:type, O=rdfs:class;
\* Then in a loop over this collection, ?x is bound to an item X, and a new query is asked to the PSO index, where P=rdfs:label and S=X.


h2. Aggregation

Aggregation is done in a single pass over the result set and uses HashMaps to calculate the aggregate values. The aggregation overhead is relatively small compared to the fetch time and is done in linear time over the collection size.

A typical aggregation query is Q7:
For example, a typical aggregation query is LDBC SPB Q7:
{code} {noformat}
# Retrieve the N most popular topics that creative works mention. Further limiting the results above some primary content limit.
# reasoning : owl:ObjectProperty
# Retrieve the N most popular topics mentioned by CreativeWorks. Limit results to some "primary content"
SELECT ?mentions ((COUNT(*)) (COUNT(*) as ?count)
WHERE {
?creativeWork cwork:mentions ?mentions .
?creativeWork bbc:primaryContentOf ?pc .
}
GROUP BY (?creativeWork) ?creativeWork
}
}
ORDER BY DESC(?count)
LIMIT 10
{code} {noformat}

The “execution time” is \~500ms, the fetch time is \~2700ms, and the aggregation time is <100ms.
Execution time is \~500ms, fetch time is \~2700ms, and aggregation time is <100ms.

h1. Usage
h1. Using Explain Plan

To execute and return query explain plan, the user has to use a system *from* clause, which is:
To see the query explain plan, use the {{onto:explain}} pseudo-graph:
{code} {noformat}
from <http://www.ontotext.com/explain>
{code}

or if we use prefix:
{code}
PREFIX onto: <http://www.ontotext.com/>
select * from onto:explain
. . .
from onto:explain
{code} {noformat}

The server returns iterator with the explain plan result, instead of query result.
Instead of the query result, GraphDB returns an iterator with the explain plan result.
- each row provides information about a single triple pattern
-- the order of triple patterns may not correspond to the original query because the order is selected by the GraphDB Query Optimiser
-- the identation in the first column shows the nesting of pattern queries (nested loops)
-- the second column shows the collection size, i.e. number of triples matching the pattern
-- the next two columns show the number of unique subjects and objects in this collection
-- the last two columns show the optimizer's judgement of "complexity" and the time it took to iterate the collection
- the last two rows show the execution time, the fetching time, and the number of results

h2. Example plans

h3. Simple qQuery
{code} {noformat}
SELECT select * from onto:explain where {
  ?x rdf:type rdfs:Class.
  ?x rdfs:label ?label.
}
{code} {noformat}
The query plan is:
!Screen Shot 2014-08-27 at 4.27.20 PM.png!

h3. Plan
The optimizer selected to first execute the join {{?x rdf:type rdfs:Class}}, then {{?x rdfs:label ?label}}. The reason is that the number of classes is much smaller than the number of things that have a label (45 vs 200 in this case).

!Screen Shot 2014-08-27 at 4.27.20 PM.png|border=1!
h3. Query LDBC SPB Q16
{noformat}
In this example the optimizer selects first to execute the join "\[ x type Class \]", then "\[ x label label \]". This is so, because the number of classes is much smaller than the number of things that have a label (45 vs 200 in this case).


h3. Q16
{code}
SELECT DISTINCT ?thing ?tag ?category ?title FROM onto:explain
FROM ot:explain
WHERE {
WHERE {
?thing rdf:type cwork:CreativeWork .
?thing rdf:type ?class .
}
} ORDER BY ?tag LIMIT 100
{code} {noformat}

h3. Query Plan on 5M Dataset


The query plan (on a 5M Dataset) is:
!Screen Shot 2014-08-19 at 12.09.40 PM.png|border=1!

h3. Reading the plan

Let's explain the result returned from the example.
Every row that contains some information in brackets in the first column provides information about a triple pattern from the SPARQL query. The order of triple patterns is the same as they are fetched during query evaluation. They do not correspond to the original query, because they are modified by the GraphDB Query Optimiser.
\\
The first join selected by the optimiser is "?class subclassOf CreativeWork". The next columns provide information about collection size (all triples matched the pattern), unique subjects and objects from this collection, the pattern complexity and the time for iterating the entire collection.