Full-text search (FTS) concerns retrieving text documents out of a large collection by keywords or, more generally, by tokens (represented as sequences of characters). Formally, the query represents an unordered set of tokens and the result is set of documents, relevant to the query. In a simple FTS implementation, relevance is Boolean: a document is either relevant to the query, when it contains all the query tokens, or not. More advanced FTS implementations deal with a degree of relevance of the document to the query, usually judged on some sort of measure of the frequency of appearance of each of the tokens in the document, normalized versus the frequency of their appearance in the entire document collection. Such implementations return an ordered list of documents, where the most relevant documents come first.
FTS and structured queries, like those in database management systems (DBMS), are different information access methods based on a different query syntax and semantics, where the results are also displayed in a different form. FTS and databases usually require different types of indices too. The ability to combine these two types of information access methods is very useful for a wide range of applications. Many relational DBMS support some sort of FTS (which is integrated into the SQL syntax) and maintain additional indices that allow efficient evaluation of FTS constraints. Typically, relational DBMS allow the user to define a query, which requires specific tokens to appear in a specific column of a specific table. In SPARQL there is no standard way for the specification of FTS constraints. In general, there is neither a well defined nor widely accepted concept for FTS in RDF data. Nevertheless, some semantic repository vendors offer some sort of FTS in their engines. This section documents the FTS supported by OWLIM-SE.
Two approaches are implemented in OWLIM-SE, a proprietary implementation called 'Node Search', and a Lucene-based implementation called 'RDF Search'. The two approaches are collectively referred to in this guide as 'full-text indexing' and both of them enable OWLIM to perform complex queries against character data, which significantly speeds up the query process. To select one of them, one should consider their functional differences, which are outlined in the table below. Furthermore, there can be considerable differences between indexing and search speed of the two FTS implementations. Thus, performance-conscious users are recommended to experiment with the performance of both methods with respect to dataset and queries representative for the intended application.
Node Search
RDF Search
FTS query form
List of tokens
List of tokens (with Lucene query extensions)
Result form
Unordered set of nodes
Ordered list of URIs
Textual Representation
For literals: the string value.
For URIs and B-nodes: tokenized URL
Concatenation of the text representations of the nodes from the molecule (1-step neighbourhood in the graph) of the URI
Relevance
Boolean, based on presence of the query tokens in the text
Vector-space model, reflecting the degree of relevance of the text and the RDF rank of the URI
Implementation
Proprietary full-text indexing and search implementation
The Lucene engine is integrated and used for indexing and search
The Node Search (with parameter ftsLiteralsOnly set to true) resembles functionality similar to typical FTS implementations in relational DBMS. However, RDF Search is a novel information retrieval concept, which allows for efficient extraction of RDF resources from huge datasets, where ordering of the results by relevance is crucial.
Node Search – Proprietary Full-Text Search
The parameters for OWLIM's full-text index control when/if the index is to be created, the index cache size, and whether literals only or all types of nodes should be indexed. See the parameters ftsIndexPolicy, fts-memory and ftsLiteralsOnly in the configuration section.
The following example configures the database engine to create a 20 megabyte cache for the full-text index on start up that indexes all literals and URIs:
Full-text search patterns are embedded in SPARQL and SeRQL queries by adding extra statement patterns that use special system predicates:
<String:> <Algorithm predicate> <Binding> .
Each of the elements of this triple is explained below:
<String:> the search string - a list of tokens separated by colons ':', whose use is determined by the choice of predicate, see below;
<Algorithm predicate> specifies the search method, i.e. how the tokens in the search string are to be used, see below;
<Binding> the variable containing the result, i.e. the values (URIs or literals) that match with the given search string and method.
Predicate
Description
fts:exactMatch
Matches literals that contain all tokens considering the case. For example, searching for <United:States> will match "The president of the United States", but not "United Statesless", "united states" or "notUnited notStates.".
fts:matchIgnoreCase
Similar to the above but ignores case. <United:States> will match "The president of the United States", "united states" but not "United Statesless" or "notUnited notStates."
fts:prefixMatch
Matches tokens that begin with the given search tokens considering the case. For example, <United:States> will match "The president of the United States" and "United Statesless" but not "notUnited notStates" or "united states."
fts:prefixMatchIgnoreCase
Similar to the above but ignores case. For example, <United:States> will match "The president of the United States", "United Statesless", "united states" but not "notUnited notStates".
The namespace prefix onto in the above table <http://www.ontotext.com/owlim/fts#>
There follow some query examples for Node search in SPARQL and SeRQL:
Example 1: Get all values that contain a token that matches exactly with 'abstract'
SPARQL query:
SELECT L
FROM {<abstract:abstract>} fts:exactMatch {L}USING NAMESPACE
fts = <http://www.ontotext.com/owlim/fts#>
Note that in SeRQL, abstract: is not a valid URI, so abstract:abstract is used instead, which works the same and also conforms with what the parser expects.
Example 2: Get all values that contain both tokens 'Remorselessness' and 'books' using case-insensitive search (SPARQL):
This query cannot be expressed in SeRQL using full-text search predicates, because the SERQL parser won't accept a URI starting with a digit.
The above example is hard to formulate without a full text search capability. For example, the trivial query below won't match an entry with the label "3d"@en, because this literal is an rdf:PlainLiteral and not the same as "3d", which is an xsd:string, i.e. the data types are different.
Apache Lucene is a high-performance, full-featured text search engine written entirely in Java. OWLIM-SE supports full text search capabilities using Lucene with a variety of indexing options and the ability to simultaneously use multiple, differently configured indices in the same query.
The classpath must include lucene-core-3*.jar (included with the OWLIM distribution) in order for the Lucene-based full-text search to function correctly. This can be substituted with the full Lucene jar file that can be downloaded from the Apache Lucene download page.
In order to use Lucene full-text search in OWLIM-SE a Lucene index must first be computed. Before being created, each index can be parameterised in a number of ways using SPARQL 'control' updates. This provides the ability to:
select what kinds of nodes are indexed (URIs/literals/blank-nodes)
select what is included in the 'molecule' (explained below) associated with each node
select literals with certain language tags
choose the size of the RDF 'molecule' to index
choose whether to boost the relevance of nodes using RDF Rank values
select alternative analysers
select alternative scorers
In order to use the indexing behaviour of Lucene, a text document must be created for each node in the RDF graph to be indexed. This text document is called the 'RDF molecule' and is made up of other nodes reachable via the predicates that connect nodes to each other. Once a molecule has been created for each node, Lucene creates an index over these molecules. During search (query answering) Lucene identifies the matching molecules and OWLIM uses the associated nodes as variables substitutions when evaluating the enclosing SPARQL query.
The scope of an RDF molecule includes the starting node and its neighbouring nodes that are reachable via the specified number of predicate arcs. What type of nodes are indexed and what type of nodes are included in the molecule can be specified for each Lucene index. Furthermore, the size of the molecule can be controlled by specifying the number of allowed traversals of predicate arcs starting from the molecule centre (the node being indexed). Note that blank nodes themselves are never included in the molecule. If a blank node is encountered the search is extended via any predicate to the next nearest entity and so on. Therefore even when the molecule size is 1, entities reachable via several intermediate predicates can still be included in the molecule if all the intermediate entities are blank nodes.
The parameters are described in more detail as follows:
Provides a regular expression to identify nodes that will be excluded from to the molecule. Note that the regular expression will be applied case-sensitively to literals and URI local names.
The example given below will cause matching URIs (e.g. <http://example.com/uri#helloWorld> ) and literals (e.g. "hello world!") not to be included.
A comma/semi-colon/white-space separated list of entities that will NOT be included in an RDF molecule.
The example below will include any URI in a molecule, except the two listed.
A comma/semi-colon/white-space separated list of properties that will NOT be traversed in order build an RDF molecule.
The example below will prevent any entities being added to an RDF molecule if they can only be reached via the two given properties.
Indicates what kinds of nodes are to be included in the molecule. The value can be a list of values from: uri, literal, centre (the plural forms are also allowed: uris, literals, centres). The value of centre causes the node for which the molecule is built to be added to the molecule (provided it is not a blank node). This can be useful, for example, when indexing URI nodes with molecules that contain only literals, but the local part of the URI should also be searchable.
A comma/semi-colon/white-space separated list of entities that can be included in an RDF molecule.
Any other entities will be ignored. The example below will build molecules that only contain the two entities.
A comma/semi-colon/white-space separated list of properties that can be traversed in order build an RDF molecule.
The example below will allow any entities to be added to an RDF molecule, but only if they can be reached via the two given properties.
Indicates what kinds of nodes are to be indexed. The value can be a list of values from: uri, literal, bnode (the plural forms are also allowed: uris, literals, bnodes).
A comma separated list of language tags. Only literals with the indicated language tags will be included in the index. To include literals that have no language tag, use the special value 'none'.
Default
"" (which is used to indicate that literals with any language tag are used, including those with no language tag)
Set the size of the molecule associated with each entity. A value of zero indicates that only the entity itself should be indexed. A value of 1 indicates that the molecule will contain all entities reachable by a single 'hop' via any predicate (predicates not included in the molecule). Note that blank nodes themselves are never included in the molecule. If a blank node is encountered the search is extended via any predicate to the next nearest entity and so on. Therefore even when the molecule size is 1, entities reachable via several intermediate predicates can still be included in the molecule if all the intermediate entities are blank nodes. Molecule sizes of 2 and upwards are allowed, but with large datasets it can take a very long time to create the index.
Indicates whether the RDF weights (if they have been computed already) associated with each entity should be used as boosting factors when computing the relevance to a given Lucene query. Allowable values are 'no', 'yes' and 'squared'. This last value indicates that the square of the RDF Rank value is to be used.
Used to set an alternative analyser for processing text to produce terms to index. By default, this parameter has no value and the default analyser used is:
org.apache.lucene.analysis.standard.StandardAnalyzer
An alternative analyser must be derived from:
org.apache.lucene.analysis.Analyzer
To use an alternative analyser, use this parameter to identify the name of a Java factory class that can instantiate it. The factory class must be available on the Java virtual machine's classpath and must implement this interface:
com.ontotext.trree.plugin.lucene.AnalyzerFactory
Used to set an alternative scorer that provides boosting values that adjust the relevance (and hence the ordering) of results to a Lucene query. By default, this parameter has no value and no additional scoring takes place, however, if the useRDFRank parameter is set to true, then the RDF Rank scores are used (see section 10.1).
An alternative scorer must implement this interface:
com.ontotext.trree.plugin.lucene.Scorer
In order to use an alternative scorer, use this parameter to identify the name of a Java factory class that can instantiate it. The factory class must be available on the Java virtual machine's classpath and must implement this interface:
com.ontotext.trree.plugin.lucene.ScorerFactory
Once the parameters for an index have been set, the index is created and named by committing a SPARQL update of this form, where the index name appears as the subject in the triple pattern:
The index name must have the http://www.ontotext.com/owlim/lucene# namespace and the local part can contain only alphanumeric characters and underscores.
Creating an index can take some time, although usually no more than a few minutes when the molecule size is 1 or less. During this process, for each node in the repository its surrounding molecule is computed. Then each such molecule is converted into a single string document (by concatenating the textual representation of all the nodes in the molecule) and this document is indexed by Lucene. If RDF Rank weights are used (or an alternative scorer is specified) then the computed values are stored in Lucene's index as a boosting factor that will later on influence the selection order.
To use a custom Lucene index in a SPARQL query use the index's name as the predicate in a statement pattern, with the Lucene query as the object using the full Lucene query vocabulary.
The following query will produce bindings for ?s from entities in the repository, where the RDF molecule associated with that entity (for the given index) contains terms that begin with "United". Furthermore, the bindings will be ordered by relevance (with any boosting factor):
The Lucene score for a bound entity for a particular query can be exposed using a special predicate:
http://www.ontotext.com/owlim/lucene#score
This can be useful when lucene query results should be ordered in a manner based on, but different from, the original Lucene score. For example, the following query will order results by a combination of the Lucene score and some ontology defined importance value:
The luc:score predicate will work only on bound variables. There is no problem disambiguating multiple indices because each variable will be bound from exactly one Lucene index and hence its score.
The combination of ranking RDF molecules together with full-text search provides a powerful mechanism for querying/analysing datasets even when the schema is not known. This allows for keyword-based search over both literals and URIs with the results ordered by importance/interconnectedness. For an example of this kind of 'RDF Search', see FactForge.
Example
The following example configuration shows how to index URIs using literals attached to them by a single, named predicate - in this case rdfs:label. Assume the following starting data:
Set up the configuration - index URIs by including in their RDF Molecule all literals that can be reached via a single statement using the rdfs:label predicate:
showing that ex:astonMartin was not returned, because it does not have an rdfs:label linking it to the appropriate text.
Incremental update
Each Lucene-based FTS index must be recreated from time to time as the indexed data changes. Due to the complex nature of the structure of RDF molecules, rebuilding an index is a relatively expensive operation. However, indices can be updated incrementally on a per resource basis as directed by the user. The following control update:
will update the FTS index for the given resource and the given index.
Each index stores the values of the parameters used to define it, e.g. the value of luc:includePredicates, therefore there is no need to set these before requesting an incremental update.
will cause all resources not currently indexed by <index-name> to get indexed. It is a shorthand way of batching together index updates for several (new) resources).