Full-text search (FTS) concerns retrieving text documents out of a large collection by keywords or, more generally, by tokens (represented as sequences of characters). Formally, the query represents an unordered set of tokens and the result is set of documents, relevant to the query. In a simple FTS implementation, relevance is Boolean: a document is either relevant to the query, when it contains all the query tokens, or not. More advanced FTS implementations deal with a degree of relevance of the document to the query, usually judged on some sort of measure of the frequency of appearance of each of the tokens in the document, normalized versus the frequency of their appearance in the entire document collection. Such implementations return an ordered list of documents, where the most relevant documents come first.
The parameters for OWLIM's full-text index control when/if the index is to be created, the index cache size, and whether literals only or all types of nodes should be indexed. See the parameters ftsIndexPolicy, fts-memory and ftsLiteralsOnly in the configuration section.
Full-text search patterns are embedded in SPARQL and SeRQL queries by adding extra statement patterns that use special system predicates:
Each of the elements of this triple is explained below:
The namespace prefix onto in the above table <http://www.ontotext.com/owlim/fts#>
Apache Lucene is a high-performance, full-featured text search engine written entirely in Java. OWLIM-SE supports full text search capabilities using Lucene with a variety of indexing options and the ability to simultaneously use multiple, differently configured indices in the same query.
In order to use Lucene full-text search in OWLIM-SE a Lucene index must first be computed. Before being created, each index can be parameterised in a number of ways using SPARQL ASK queries. This provides the ability to:
In order to use the indexing behaviour of Lucene, a text document must be created for each node in the RDF graph to be indexed. This text document is called the 'RDF molecule' and is made up of other nodes reachable via the predicates that connect nodes to each other. Once a molecule has been created for each node, Lucene creates an index over these molecules. During search (query answering) Lucene identifies the matching molecules and OWLIM uses the associated nodes as variables substitutions when evaluating the enclosing SPARQL query.
The index name must have the http://www.ontotext.com/owlim/lucene# namespace and the local part can contain only alphanumeric characters and underscores.
The following query will produce bindings for ?s from entities in the repository, where the RDF molecule associated with that entity (for the given index) contains terms that begin with "United". Furthermore, the bindings will be ordered by relevance (with any boosting factor):
The Lucene score for a bound entity for a particular query can be exposed using a special predicate:
This can be useful when lucene query results should be ordered in a manner based on, but different from, the original Lucene score. For example, the following query will order results by a combination of the Lucene score and some ontology defined importance value:
The luc:score predicate will work only on bound variables. There is no problem disambiguating multiple indices because each variable will be bound from exactly one Lucene index and hence its score.
The combination of ranking RDF molecules together with full-text search provides a powerful mechanism for querying/analysing datasets even when the schema is not known. This allows for keyword-based search over both literals and URIs with the results ordered by importance/interconnectedness. For an example of this kind of 'RDF Search', see FactForge.
Each Lucene-based FTS index must be recreated from time to time as the indexed data changes. Due to the complex nature of the structure of RDF molecules, rebuilding an index is a relatively expensive operation. However, indices can be updated incrementally on a per resource basis as directed by the user. The following control query:
will update the FTS index for the given resource and the given index.
The following control query:
will cause all resources not currently indexed by <index-name> to get indexed. It is a shorthand way of batching together index updates for several (new) resources).
Skip to end of metadata Go to start of metadata