Full-text search (FTS) concerns retrieving text documents out of a large collection by keywords or, more generally, by tokens (represented as sequences of characters). Formally, the query represents an unordered set of tokens and the result is set of documents, relevant to the query. In a simple FTS implementation, relevance is Boolean: a document is either relevant to the query, if it contains all the query tokens, or not. More advanced FTS implementations deal with a degree of relevance of the document to the query, usually judged on some sort of measure of the frequency of appearance of each of the tokens in the document, normalised versus the frequency of their appearance in the entire document collection. Such implementations return an ordered list of documents, where the most relevant documents come first.
FTS and structured queries, like those in database management systems (DBMS), are different information access methods based on a different query syntax and semantics, where the results are also displayed in a different form. FTS systems and databases usually require different types of indices too. The ability to combine these two types of information access methods is very useful for a wide range of applications. Many relational DBMS support some sort of FTS (which is integrated into the SQL syntax) and maintain additional indices that allow efficient evaluation of FTS constraints. Typically, relational DBMS allow the user to define a query, which requires specific tokens to appear in a specific column of a specific table. In SPARQL, there is no standard way for the specification of FTS constraints. In general, there is neither a well-defined, nor widely-accepted concept for FTS in RDF data. Nevertheless, some semantic repository vendors offer some sort of FTS in their engines. This section documents the FTS supported by GraphDB-SE.
Two approaches are implemented in GraphDB-SE, a Lucene-based implementation called 'RDF Search', and a proprietary implementation called 'Node Search'. The two approaches are collectively referred to in this guide as 'full-text indexing' and both of them enable GraphDB to perform complex queries against character data, which significantly speeds up the query process. To select one of them, one should consider their functional differences, which are outlined in the table below. Furthermore, there can be considerable differences between indexing and search speed of the two FTS implementations. Thus, performance-conscious users are recommended to experiment with the performance of both methods with respect to dataset and queries representative for the intended application.
Apache Lucene is a high-performance, full-featured text search engine written entirely in Java. GraphDB-SE supports full-text search capabilities, using Lucene with a variety of indexing options and the ability to simultaneously use multiple, differently configured indices in the same query.
In order to use Lucene full-text search in GraphDB-SE, a Lucene index must first be computed. Before being created, each index can be parameterised in a number of ways, using SPARQL 'control' updates. This provides the ability to:
In order to use the indexing behaviour of Lucene, a text document must be created for each node in the RDF graph to be indexed. This text document is called the 'RDF molecule' and is made up of other nodes reachable via the predicates that connect nodes to each other. Once a molecule has been created for each node, Lucene creates an index over these molecules. During search (query answering) Lucene identifies the matching molecules and GraphDB uses the associated nodes as variables substitutions, when evaluating the enclosing SPARQL query.
In this example we will create two Java classes (Analyzer and Factory), and then, create Lucene index, using the custom analyzer. This custom analyzer will filter the accents (diacritics), so a search for "Beyonce" will find labels "Beyoncé".
The index name must have the http://www.ontotext.com/owlim/lucene# namespace and the local part can contain only alphanumeric characters and underscores.
The following query produces bindings for ?s from entities in the repository, where the RDF molecule associated with that entity (for the given index) contains terms that begin with "United". Furthermore, the bindings are ordered by relevance (with any boosting factor):
The Lucene score for a bound entity for a particular query can be exposed using a special predicate:
This can be useful when the lucene query results are ordered in a manner based on, but different from, the original Lucene score. For example, the following query orders the results by a combination of the Lucene score and some ontology defined importance value:
The luc:score predicate works only on bound variables. There is no problem disambiguating multiple indices because each variable is bound from exactly one Lucene index and hence its score.
The combination of ranking RDF molecules together with full-text search provides a powerful mechanism for querying/analysing datasets, even when the schema is not known. This allows for keyword-based search over both literals and URIs with the results ordered by importance/interconnectedness. For example of this kind of 'RDF Search', see FactForge.
The following example configuration shows how to index URIs using literals attached to them by a single, named predicate - in this case rdfs:label. Assume the following starting data:
Set up the configuration - index URIs by including in their RDF Molecule all literals that can be reached via a single statement using the rdfs:label predicate:
Create a new index call luc:myTestIndex - note that the index name must be in the <http://www.ontotext.com/owlim/lucene#> namespace:
Now use the index in a query - find all URIs indexed using the luc:myTestIndex index that match the Lucene query "ast*":
The results of this query are:
showing that ex:astonMartin is not returned, because it does not have an rdfs:label, linking it to the appropriate text.
Each Lucene-based FTS index must be recreated from time to time as the indexed data changes. Due to the complex nature of the structure of RDF molecules, rebuilding an index is a relatively expensive operation. However, indices can be updated incrementally on a per resource basis as directed by the user. The following control update:
updates the FTS index for the given resource and the given index.
The following control update:
causes all resources not currently indexed by <index-name> to get indexed. It is a shorthand way of batching together index updates for several (new) resources).
The parameters for GraphDB's full-text index control when/if the index is to be created, the index cache size, and whether literals only or all types of nodes should be indexed. See the parameters ftsIndexPolicy, fts-memory and ftsLiteralsOnly in the configuration section.
Full-text search patterns are embedded in SPARQL and SeRQL queries by adding extra statement patterns that use special system predicates:
Each of the elements of this triple is explained below:
The namespace prefix onto in the above table <http://www.ontotext.com/owlim/fts#>
Skip to end of metadata Go to start of metadata