View Source

{toc}

Full-text search (FTS) concerns retrieving text documents out of a large collection by keywords or, more generally, by tokens (represented as sequences of characters). Formally, the query represents an unordered set of tokens and the result is set of documents, relevant to the query. In a simple FTS implementation, relevance is Boolean: a document is either relevant to the query, if it contains all the query tokens, or not. More advanced FTS implementations deal with a degree of relevance of the document to the query, usually judged on some sort of measure of the frequency of appearance of each of the tokens in the document, normalised versus the frequency of their appearance in the entire document collection. Such implementations return an ordered list of documents, where the most relevant documents come first.

FTS and structured queries, like those in database management systems (DBMS), are different information access methods based on a different query syntax and semantics, where the results are also displayed in a different form. FTS systems and databases usually require different types of indices too. The ability to combine these two types of information access methods is very useful for a wide range of applications. Many relational DBMS support some sort of FTS (which is integrated into the SQL syntax) and maintain additional indices that allow efficient evaluation of FTS constraints. Typically, relational DBMS allow the user to define a query, which requires specific tokens to appear in a specific column of a specific table. In SPARQL, there is no standard way for the specification of FTS constraints. In general, there is neither a well-defined, nor widely-accepted concept for FTS in RDF data. Nevertheless, some semantic repository vendors offer some sort of FTS in their engines. This section documents the FTS supported by GraphDB-SE.

Two approaches are implemented in GraphDB-SE, a Lucene-based implementation called 'RDF Search', and a proprietary implementation called 'Node Search'. The two approaches are collectively referred to in this guide as 'full-text indexing' and both of them enable GraphDB to perform complex queries against character data, which significantly speeds up the query process. To select one of them, one should consider their functional differences, which are outlined in the table below. Furthermore, there can be considerable differences between indexing and search speed of the two FTS implementations. Thus, performance-conscious users are recommended to experiment with the performance of both methods with respect to dataset and queries representative for the intended application.

| || Node Search || RDF Search |
|| FTS query form | List of tokens | List of tokens (with Lucene query extensions) ||
|| Result form | Unordered set of nodes | Ordered list of URIs ||
|| Textual Representation | For literals: the string value. \\
For URIs and B-nodes: tokenized URL | Concatenation of the text representations of the nodes from the molecule (1-step neighbourhood in the graph) of the URI ||
|| Relevance | Boolean, based on presence of the query tokens in the text | Vector-space model, reflecting the degree of relevance of the text and the RDF rank of the URI ||
|| Implementation | Proprietary full-text indexing and search implementation | The Lucene engine is integrated and used for indexing and search ||
\\
The Node Search (with parameter *ftsLiteralsOnly* set to *true*) resembles functionality similar to typical FTS implementations in relational DBMS. However, RDF Search is a novel information retrieval concept, which allows for efficient extraction of RDF resources from huge datasets, where ordering of the results by relevance is crucial.


h1. RDF Search - Full-Text Search using Lucene

[Apache Lucene|http://lucene.apache.org] is a high-performance, full-featured text search engine written entirely in Java. GraphDB-SE supports full-text search capabilities, using Lucene with a variety of indexing options and the ability to simultaneously use multiple, differently configured indices in the same query.

In order to use Lucene full-text search in GraphDB-SE, a Lucene index must first be computed. Before being created, each index can be parameterised in a number of ways, using SPARQL 'control' updates. This provides the ability to:
* select what kinds of nodes are indexed (URIs/literals/blank-nodes);
* select what is included in the 'molecule' (explained below) associated with each node;
* select literals with certain language tags;
* choose the size of the RDF 'molecule' to index;
* choose whether to boost the relevance of nodes using RDF Rank values;
* select alternative analysers;
* select alternative scorers.

In order to use the indexing behaviour of Lucene, a text document must be created for each node in the RDF graph to be indexed. This text document is called the 'RDF molecule' and is made up of other nodes reachable via the predicates that connect nodes to each other. Once a molecule has been created for each node, Lucene creates an index over these molecules. During search (query answering) Lucene identifies the matching molecules and GraphDB uses the associated nodes as variables substitutions, when evaluating the enclosing SPARQL query.
The scope of an RDF molecule includes the starting node and its neighbouring nodes, which are reachable via the specified number of predicate arcs. What type of nodes are indexed and what type of nodes are included in the molecule can be specified for each Lucene index. Furthermore, the size of the molecule can be controlled by specifying the number of allowed traversals of predicate arcs starting from the molecule centre (the node being indexed). Note that blank nodes themselves are never included in the molecule. If a blank node is encountered the search is extended via any predicate to the next nearest entity and so on. Therefore, even when the molecule size is 1, entities reachable via several intermediate predicates can still be included in the molecule if all the intermediate entities are blank nodes.
The parameters are described in more detail as follows:

|| Parameter | *Exclude* ||
|| Predicate | [http://www.ontotext.com/owlim/lucene#exclude] ||
|| Description | Provides a regular expression to identify nodes, which will be excluded from to the molecule. Note that the regular expression is applied case-sensitively to literals and URI local names. \\
The example given below will cause matching URIs (e.g. <[http://example.com/uri#helloWorld]> ) and literals (e.g. "hello world\!") not to be included. ||
|| Default | <none> ||
|| Example | PREFIX luc: <[http://www.ontotext.com/owlim/lucene#]> \\
INSERT DATA \{ luc:exclude luc:setParam "hello.*" \} ||
\\
|| Parameter | *Exclude entities* ||
|| Predicate | [http://www.ontotext.com/owlim/lucene#excludeEntities] ||
|| Description | A comma/semi-colon/white-space separated list of entities that will NOT be included in an RDF molecule. \\
The example below includes any URI in a molecule, except the two listed. ||
|| Default | <none> ||
|| Example | PREFIX luc: <[http://www.ontotext.com/owlim/lucene#]> \\
INSERT DATA \{ luc:excludeEntities luc:setParam "http://www.w3.org/2000/01/rdf-schema#Class [http://www.example.com/dummy#E1]" \} ||
\\
|| Parameter | *Exclude predicates* ||
|| Predicate | [http://www.ontotext.com/owlim/lucene#excludePredicates] ||
|| Description | A comma/semi-colon/white-space separated list of properties that will NOT be traversed in order to build an RDF molecule. \\
The example below prevents any entities being added to an RDF molecule if they can only be reached via the two given properties. ||
|| Default | <none> ||
|| Example | PREFIX luc: <[http://www.ontotext.com/owlim/lucene#]> \\
INSERT DATA \{ luc:excludePredicates luc:setParam "http://www.w3.org/2000/01/rdf-schema#subClassOf [http://www.example.com/dummy#p1]" \} ||
\\
|| Parameter | *Include* ||
|| Predicate | [http://www.ontotext.com/owlim/lucene#include] ||
|| Description | Indicates what kinds of nodes are to be included in the molecule. The value can be a list of values from: uri, literal, centre (the plural forms are also allowed: uris, literals, centres). The value of _centre_ causes the node for which the molecule is built to be added to the molecule (provided it is not a blank node). This can be useful, for example, when indexing URI nodes with molecules that contain only literals, but the local part of the URI should also be searchable. ||
|| Default | "literals" ||
|| Example | PREFIX luc: <[http://www.ontotext.com/owlim/lucene#]> \\
INSERT DATA \{ luc:include luc:setParam "literal uri" . \} ||
\\
|| Parameter | *Include entities* ||
|| Predicate | [http://www.ontotext.com/owlim/lucene#includeEntities] ||
|| Description | A comma/semi-colon/white-space separated list of entities that can be included in an RDF molecule. \\
Any other entities are ignored. The example below builds molecules that only contain the two entities. ||
|| Default | <none> ||
|| Example | PREFIX luc: <[http://www.ontotext.com/owlim/lucene#]> \\
INSERT DATA \{ luc:includeEntities luc:setParam "http://www.w3.org/2000/01/rdf-schema#Class [http://www.example.com/dummy#E1]" \} ||
\\
|| Parameter | *Include predicates* ||
|| Predicate | [http://www.ontotext.com/owlim/lucene#includePredicates] ||
|| Description | A comma/semi-colon/white-space separated list of properties that can be traversed in order to build an RDF molecule. \\
The example below allows any entities to be added to an RDF molecule, but only if they can be reached via the two given properties. ||
|| Default | <none> ||
|| Example | PREFIX luc: <[http://www.ontotext.com/owlim/lucene#]> \\
INSERT DATA \{ luc:includePredicates luc:setParam "http://www.w3.org/2000/01/rdf-schema#subClassOf [http://www.example.com/dummy#p1]" \} ||
\\
|| Parameter | *Index* ||
|| Predicate | [http://www.ontotext.com/owlim/lucene#index] ||
|| Description | Indicates what kinds of nodes are to be indexed. The value can be a list of values from: uri, literal, bnode (the plural forms are also allowed: uris, literals, bnodes). ||
|| Default | "literals" ||
|| Example | PREFIX luc: <[http://www.ontotext.com/owlim/lucene#]> \\
INSERT DATA \{ luc:index luc:setParam "literals, bnodes" . \} ||
\\
|| Parameter | *Language(s)* ||
|| Predicate | [http://www.ontotext.com/owlim/lucene#languages] ||
|| Description | A comma separated list of language tags. Only literals with the indicated language tags are included in the index. To include literals that have no language tag, use the special value 'none'. ||
|| Default | "" (which is used to indicate that literals with any language tag are used, including those with no language tag) ||
|| Example | PREFIX luc: <[http://www.ontotext.com/owlim/lucene#]> \\
INSERT DATA \{ luc:languages luc:setParam "en,fr,none" . \} ||
\\
|| Parameter | *Molecule size* ||
|| Predicate | [http://www.ontotext.com/owlim/lucene#moleculeSize] ||
|| Description | Set the size of the molecule associated with each entity. A value of zero indicates that only the entity itself should be indexed. A value of 1 indicates that the molecule will contain all entities reachable by a single 'hop' via any predicate (predicates not included in the molecule). Note that blank nodes themselves are never included in the molecule. If a blank node is encountered the search is extended via any predicate to the next nearest entity and so on. Therefore, even when the molecule size is 1, entities reachable via several intermediate predicates can still be included in the molecule, if all the intermediate entities are blank nodes. Molecule sizes of 2 and upwards are allowed, but with large datasets it can take a very long time to create the index. ||
|| Default | 0 ||
|| Example | PREFIX luc: <[http://www.ontotext.com/owlim/lucene#]> \\
INSERT DATA \{ luc:moleculeSize luc:setParam "1" . \} ||
\\
|| Parameter | *Use RDF rank* ||
|| Predicate | [http://www.ontotext.com/owlim/lucene#useRDFRank] ||
|| Description | Indicates whether the RDF weights (if they have been computed already) associated with each entity should be used as boosting factors when computing the relevance to a given Lucene query. Allowable values are 'no', 'yes' and 'squared'. This last value indicates that the square of the RDF Rank value is to be used. ||
|| Default | "no" ||
|| Example | PREFIX luc: <[http://www.ontotext.com/owlim/lucene#]> \\
INSERT DATA \{ luc:useRDFRank luc:setParam "yes" . \} ||
\\
|| Parameter | *Set alternative analyser* ||
|| Predicate | [http://www.ontotext.com/owlim/lucene#analyzer] ||
|| Description | Used to set an alternative analyser for processing text to produce terms to index. By default, this parameter has no value and the default analyser used is: \\
org.apache.lucene.analysis.standard.StandardAnalyzer \\
An alternative analyser must be derived from: \\
org.apache.lucene.analysis.Analyzer \\
To use an alternative analyser, use this parameter to identify the name of a Java factory class that can instantiate it. The factory class must be available on the Java virtual machine's classpath and must implement this interface: \\
com.ontotext.trree.plugin.lucene.AnalyzerFactory ||
|| Default | <none> ||
|| Example | PREFIX luc: <[http://www.ontotext.com/owlim/lucene#]> \\
INSERT DATA \{ luc:analyzer luc:setParam "com.ex.MyAnalyserFactory" . \} ||
\\

h2. Example
In this example we will create two Java classes (Analyzer and Factory), and then, create Lucene index, using the custom analyzer. This custom analyzer will filter the accents (diacritics), so a search for "Beyonce" will find labels "Beyoncé".

{code}
public class CustomAnalyzerFactory implements com.ontotext.trree.plugin.lucene.AnalyzerFactory {
@Override
public Analyzer createAnalyzer() {
CustomAnalyzer ret = new CustomAnalyzer(Version.LUCENE_36);
return ret;
}

@Override
public boolean isCaseSensitive() {
return false;
}
}
{code}

{code}
public class CustomAnalyzer extends StopwordAnalyzerBase {
public CustomAnalyzer(Version matchVersion){
super(matchVersion, StandardAnalyzer.STOP_WORDS_SET);
}

@Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
final Tokenizer source = new StandardTokenizer(matchVersion, reader);
TokenStream tokenStream = source;
tokenStream = new StandardFilter(matchVersion, tokenStream);
tokenStream = new LowerCaseFilter(tokenStream);
tokenStream = new StopFilter(matchVersion, tokenStream, getStopwordSet());
tokenStream = new ASCIIFoldingFilter(tokenStream);
return new TokenStreamComponents(source, tokenStream);
}
}
{code}

h3. Create the index
- the two classes need to be put in a package, say "com.example" and packaged in a JAR file
- the JAR needs to be put into the plugins folder (specified by -Dexternal-plugins=..., which by default is under <TOMCAT-WEBAPPS>/openrdf-sesame/WEB-INF/plugins)
- there has to be some data in the repository
- then we need to create the index
{code}
PREFIX luc: <http://www.ontotext.com/owlim/lucene#>
INSERT DATA {
luc:analyzer luc:setParam "com.example.CustomAnalyzerFactory" .
luc:index luc:setParam "uris".
luc:moleculeSize luc:setParam "1".
luc:myIndex luc:createIndex "true".
}
{code}


|| Parameter | *Set alternative scorer* ||
|| Predicate | [http://www.ontotext.com/owlim/lucene#scorer] ||
|| Description | Used to set an alternative scorer that provides boosting values, which adjust the relevance (and hence the ordering) of results to a Lucene query. By default, this parameter has no value and no additional scoring takes place, however, if the useRDFRank parameter is set to true, then the RDF Rank scores are used (see section 10.1). \\
An alternative scorer must implement this interface: \\
com.ontotext.trree.plugin.lucene.Scorer \\
In order to use an alternative scorer, use this parameter to identify the name of a Java factory class that can instantiate it. The factory class must be available on the Java virtual machine's classpath and must implement this interface: \\
com.ontotext.trree.plugin.lucene.ScorerFactory ||
|| Default | <none> ||
|| Example | PREFIX luc: <[http://www.ontotext.com/owlim/lucene#]> \\
INSERT DATA \{ luc:scorer luc:setParam "com.ex.MxScorerFactory" . \} ||
\\
Once the parameters for an index have been set, the index is created and named by committing a SPARQL update of this form, where the index name appears as the subject in the triple pattern:
{noformat}PREFIX luc: <http://www.ontotext.com/owlim/lucene#>
INSERT DATA { luc:myIndex luc:createIndex "true" . }
{noformat}The index name must have the {{[http://www.ontotext.com/owlim/lucene#]}} namespace and the local part can contain only alphanumeric characters and underscores.
Creating an index can take some time, although usually no more than a few minutes when the molecule size is 1 or less. During this process, for each node in the repository its surrounding molecule is computed. Then, each such molecule is converted into a single string document (by concatenating the textual representation of all the nodes in the molecule) and this document is indexed by Lucene. If the RDF Rank weights are used (or an alternative scorer is specified), then the computed values are stored in Lucene's index as a boosting factor that will later on influence the selection order.
To use a custom Lucene index in a SPARQL query use the index's name as the predicate in a statement pattern, with the Lucene query as the object using the full [Lucene query|http://lucene.apache.org/java/3_0_0/queryparsersyntax.html] vocabulary.

The following query produces bindings for {{?s}} from entities in the repository, where the RDF molecule associated with that entity (for the given index) contains terms that begin with "United". Furthermore, the bindings are ordered by relevance (with any boosting factor):
{noformat}PREFIX luc: <http://www.ontotext.com/owlim/lucene#>
SELECT ?s
WHERE { ?s luc:myIndex "United*" . }
{noformat}

The Lucene score for a bound entity for a particular query can be exposed using a special predicate:
{noformat}http://www.ontotext.com/owlim/lucene#score{noformat}
This can be useful when the lucene query results are ordered in a manner based on, but different from, the original Lucene score. For example, the following query orders the results by a combination of the Lucene score and some ontology defined importance value:
{noformat}
PREFIX luc: <http://www.ontotext.com/owlim/lucene#>
PREFIX ex: <http://www.example.com/myontology#>
SELECT * {
?node luc:myIndex "lucene query string" .
?node ex:importance ?importance .
?node luc:score ?score .
} ORDER BY ( ?score + ?importance )
{noformat}
The {{luc:score}} predicate works only on bound variables. There is no problem disambiguating multiple indices because each variable is bound from exactly one Lucene index and hence its score.

The combination of ranking RDF molecules together with full-text search provides a powerful mechanism for querying/analysing datasets, even when the schema is not known. This allows for keyword-based search over both literals and URIs with the results ordered by importance/interconnectedness. For example of this kind of 'RDF Search', see [FactForge|http://factforge.net].

h2. Example

The following example configuration shows how to index URIs using literals attached to them by a single, named predicate - in this case {{rdfs:label}}. Assume the following starting data:

{noformat}
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX ex:<http://example.com#>
INSERT DATA {
ex:astonMT rdfs:label "Aston McTalisker" .
ex:astonMartin ex:link "Aston Martin" .
<http://www1.aston.ac.uk/> rdfs:label "Aston University"@EN .
}
{noformat}

Set up the configuration - index URIs by including in their RDF Molecule all literals that can be reached via a single statement using the {{rdfs:label}} predicate:

{noformat}
PREFIX luc: <http://www.ontotext.com/owlim/lucene#>
INSERT DATA {
luc:index luc:setParam "uris" .
luc:include luc:setParam "literals" .
luc:moleculeSize luc:setParam "1" .
luc:includePredicates luc:setParam "http://www.w3.org/2000/01/rdf-schema#label" .
}
{noformat}

Create a new index call {{luc:myTestIndex}} \- note that the index name must be in the {{<}}{{[http://www.ontotext.com/owlim/lucene#]}}{{>}} namespace:

{noformat}
PREFIX luc: <http://www.ontotext.com/owlim/lucene#>
INSERT DATA {
luc:myTestIndex luc:createIndex "true" .
}
{noformat}

Now use the index in a query - find all URIs indexed using the {{luc:myTestIndex}} index that match the Lucene query "ast*":

{noformat}
PREFIX luc: <http://www.ontotext.com/owlim/lucene#>
SELECT * {
?id luc:myTestIndex "ast*"
}
{noformat}

The results of this query are:

|| ?id ||
| [http://example.com#astonMT] |
| [http://www1.aston.ac.uk/] |

showing that {{ex:astonMartin}} is not returned, because it does not have an {{rdfs:label}}, linking it to the appropriate text.

h2. Incremental update

Each Lucene-based FTS index must be recreated from time to time as the indexed data changes. Due to the complex nature of the structure of RDF molecules, rebuilding an index is a relatively expensive operation. However, indices can be updated incrementally on a per resource basis as directed by the user. The following control update:

{noformat}
PREFIX luc: <http://www.ontotext.com/owlim/lucene#>
INSERT DATA { <index-name> luc:addToIndex <resource> . }
{noformat}

updates the FTS index for the given resource and the given index.

{info}
Each index stores the values of the parameters used to define it, e.g. the value of luc:includePredicates, therefore there is no need to set them before requesting an incremental update.
{info}

The following control update:

{noformat}
PREFIX luc: <http://www.ontotext.com/owlim/lucene#>
INSERT DATA { <index-name> luc:updateIndex _:b1 . }
{noformat}

causes all resources not currently indexed by <index-name> to get indexed. It is a shorthand way of batching together index updates for several (new) resources).


h1. Node Search -- Proprietary Full-Text Search

The parameters for GraphDB's full-text index control when/if the index is to be created, the index cache size, and whether literals only or all types of nodes should be indexed. See the parameters *ftsIndexPolicy*, *fts-memory* and *ftsLiteralsOnly* in the [configuration section|GraphDB-SE Configuration].
The following example configures the database engine to create a 20 megabyte cache for the full-text index on start up, which indexes all literals and URIs:
{noformat}owlim:ftsIndexPolicy "onStartup" ;
owlim:fts-memory "20m" ;
owlim:ftsLiteralsOnly "false"
{noformat}Full-text search patterns are embedded in SPARQL and SeRQL queries by adding extra statement patterns that use special system predicates:
{noformat}<String:> <Algorithm predicate> <Binding> .
{noformat}Each of the elements of this triple is explained below:
* {{<String:>}} the search string - a list of tokens separated by colons ':', which use is determined by the choice of predicate, see below;
* {{<Algorithm predicate>}} specifies the search method, i.e. how the tokens in the search string are to be used, see below;
* {{<Binding>}} the variable containing the result, i.e. the values (URIs or literals) that match with the given search string and method.


|| Predicate || Description ||
| {{fts:exactMatch}} | Matches literals that contain all tokens considering the case. For example, searching for {{<United:States>}} will match "The president of the United States", but not "United Statesless", "united states" or "notUnited notStates.". |
| {{fts:matchIgnoreCase}} | Similar to the above but ignores case. {{<United:States>}} will match "The president of the United States", "united states" but not "United Statesless" or "notUnited notStates." |
| {{fts:prefixMatch}} | Matches tokens that begin with the given search tokens considering the case. For example, {{<United:States>}} will match "The president of the United States" and "United Statesless" but not "notUnited notStates" or "united states." |
| {{fts:prefixMatchIgnoreCase}} | Similar to the above but ignores case. For example, {{<United:States>}} will match "The president of the United States", "United Statesless", "united states" but not "notUnited notStates". |

The namespace prefix {{onto}} in the above table {{<}}{{[http://www.ontotext.com/owlim/fts#]}}{{>}}
There follow some query examples for Node search in SPARQL and SeRQL:
* *Example 1:* Get all values that contain a token that matches exactly with 'abstract'
SPARQL query:
{noformat}PREFIX fts: <http://www.ontotext.com/owlim/fts#>
SELECT ?label
WHERE { <abstract:> fts:exactMatch ?label . }
{noformat}SeRQL query:
{noformat}SELECT L
FROM {<abstract:abstract>} fts:exactMatch {L}USING NAMESPACE
fts = <http://www.ontotext.com/owlim/fts#>
{noformat}Note that in SeRQL, *abstract:* is not a valid URI, so *abstract:abstract* is used instead, which works the same and also conforms with what the parser expects.
* *Example 2:* Get all values that contain both tokens 'Remorselessness' and 'books' using case-insensitive search (SPARQL):
{noformat}PREFIX fts: <http://www.ontotext.com/owlim/fts#>
SELECT ?label
WHERE { <Remorselessness:books> fts:matchIgnoreCase ?label. }
{noformat}The corresponding SeRQL query is omitted due to its similarity with the above SPARQL query.
* *Example 3:* Find everything that has a label that starts with "3d", regardless of the language or the case (SPARQL):
{noformat}PREFIX fts: <http://www.ontotext.com/owlim/fts#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?label WHERE {
?X rdfs:label ?label .
<3d:> fts:prefixMatchIgnoreCase ?label. }
{noformat}This query cannot be expressed in SeRQL using full-text search predicates, because the SERQL parser won't accept a URI starting with a digit.
The above example is hard to formulate without a full text search capability. For example, the trivial query below won't match an entry with the label {{"3d"@en}}, because this literal is an {{rdf:PlainLiteral}} and not the same as {{"3d"}}, which is a {{xsd:string}}, i.e. the data types are different.
{noformat}PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?x
WHERE { ?x rdfs:label "3d" }
{noformat}