Full-text search (FTS) concerns retrieving text documents out of a large collection by keywords or, more generally, by tokens (represented as sequences of characters). Formally, the query represents an unordered set of tokens and the result is set of documents, relevant to the query. In a simple FTS implementation, relevance is Boolean: a document is either relevant to the query, when it contains all the query tokens, or not. More advanced FTS implementations deal with a degree of relevance of the document to the query, usually judged on some sort of measure of the frequency of appearance of each of the tokens in the document, normalized versus the frequency of their appearance in the entire document collection. Such implementations return an ordered list of documents, where the most relevant documents come first.
The parameters for OWLIM's full-text index control when/if the index is to be created, the index cache size, and whether literals only or all types of nodes should be indexed. See the parameters ftsIndexPolicy, fts-memory and ftsLiteralsOnly in the configuration section.
Full-text search patterns are embedded in SPARQL and SeRQL queries by adding extra statement patterns that use special system predicates:
Each of the elements of this triple is explained below:
The namespace prefix onto in the above table <http://www.ontotext.com/owlim/fts#>
Apache Lucene is a high-performance, full-featured text search engine written entirely in Java. OWLIM-SE supports full text search capabilities using Lucene with a variety of indexing options and the ability to simultaneously use multiple, differently configured indices in the same query.
In order to use the indexing behaviour of Lucene, a text document must be created for each node in the RDF graph to be indexed. This text document is called the 'RDF molecule' and is made up of other nodes reachable via the predicates that connect nodes to each other. Once a molecule has been created for each node, Lucene creates an index over these molecules. During search (query answering) Lucene identifies the matching molecules and OWLIM uses the associated nodes as variables substitutions when evaluating the enclosing SPARQL query.
The index name must have the http://www.ontotext.com/owlim/lucene# namespace and the local part can contain only alphanumeric characters and underscores.
The following query will produce bindings for ?s from entities in the repository, where the RDF molecule associated with that entity (for the given index) contains terms that begin with "United". Furthermore, the bindings will be ordered by relevance (with any boosting factor):
The Lucene score for a bound entity for a particular query can be exposed using a special predicate:
This can be useful when lucene query results should be ordered in a manner based on, but different from, the original Lucene score. For example, the following query will order results by a combination of the Lucene score and some ontology defined importance value:
The luc:score predicate will work only on bound variables. There is no problem disambiguating multiple indices because each variable will be bound from exactly one Lucene index and hence its score.
The combination of ranking RDF molecules together with full-text search provides a powerful mechanism for querying/analysing datasets even when the schema is not known. This allows for keyword-based search over both literals and URIs with the results ordered by importance/interconnectedness. For an example of this kind of 'RDF Search', see FactForge.
OWLIM-SE has special support for 2-dimensional geo-spatial data that uses the WGS84 Geo Positioning RDF vocabulary (World Geodetic System 1984). Special indices can be used for this data that permit the efficient evaluation of special query forms and extension functions that allow:
The WGS84 ontology can be found at: http://www.w3.org/2003/01/geo/wgs84_pos and contains several classes and predicates:
Before the geo-spatial extensions can be used, the geo-spatial index must be built. This is achieved using a special predicate as follows:
If the indexing is successful, the above query will return true, false otherwise. Information about the indexing process and any errors can be found in the log.
The special syntax used to query geo-spatial data makes use of SPARQL's RDF Collections syntax. This syntax uses round brackets as a shorthand for the statements connecting a list of values using rdf:first and rdf:rest predicates with terminating rdf:nil. Statement patterns that use one of the special geo-spatial predicates supported by OWLIM-SE are treated differently by the query engine. The following special syntax is supported when evaluating SPARQL queries (the descriptions all use the namespace omgeo: <http://www.ontotext.com/owlim/geo#>):
At present there is just one SPARQL extension function.
Knowledge of the implementation's algorithms and assumptions will allow users to make the best use of the OWLIM-SE geo-spatial extensions. The following points are significant and can affect the expected behaviour during query answering:
RDF Rank is an algorithm that identifies the more important or more popular entities in the repository by examining their interconnectedness. The popularity of entities can then be used to order query results in a similar way to internet search engines, such as how Google orders search results using PageRank http://en.wikipedia.org/wiki/PageRank.
As seen in the example query, RDF Rank weights are made available via a special system predicate. Triple patterns with the predicate http://www.ontotext.com/owlim/RDFRank#hasRDFRank are handled specially by OWLIM, where the object of the statement pattern is bound to a literal containing the RDF Rank of the subject.
The computed weights can be exported to an external file using a query of this form:
The query will return true if the export was successful, false otherwise. If the export failed then an error message will be recorded in the log file.
RDF Priming is a technique that selects a subset of available statements for use as the input to query answering. It is based upon the concept of 'spreading activation' as developed in cognitive science. RDF Priming is a scalable and customisable implementation of the popular connectionist method on top of RDF graphs that allows for the "priming" of large datasets with respect to concepts relevant to the context and to the query. It is controlled using SPARQL ASK queries. This section provides an overview of the mechanism and explains the necessary SPARQL queries used to manage and set up RDF Priming.
To enable RDF Priming over the repository, the repository-type configuration parameter should be set to weighted-file-repository.
RDF Priming is controlled using SPARQL ASK queries, which allows all the parameters and default values to be set. These queries use special system predicates, which are described below:
The following example uses data from DBPEDIA http://dbpedia.org/About and was imported into OWLIM-SE with the RDF Priming mode enabled. The management queries are evaluated through the Sesame console application for convenience. The initial step is to evaluate a demo query that retrieves all the instances of the dbpedia:V8 concept:
The above query returns the following results:
As can be seen, the query returns many engines from different manufacturers. The RDF Priming module can be used to reduce the number of results returned by this query by targeting the query to specific parts of the global RDF graph, i.e. the parts of the graph that have been activated.
Change the default decay factor:
Change the firing threshold parameter:
Change the filter threshold:
The initial Activation Level is changed to reflect the specifics of the data set:
Adjust the Weight factors for a specific predicate so that it activates the relevant sub-set of the RDF graph, in this case the rdfs:subClassOf predicate:
The next step alters the Weight Factor of the rdf:type predicate so that it does not propagate activations to the classes from the activated instances. This is a useful technique when there are a lot of instances and a very large classification taxonomy which should not be broadly activated (as is the case with the DBpedia dataset).
If the example query is executed at this stage, it will return no results, because the RDF graph has no activated nodes at all. Therefore the next step is to activate two particular nodes, the Ford Motor Company dbpedia3:Ford_Motor_Company and one of the cars they build dbpedia3:1955_Ford, which came out of the factory with a very nice V8 engine:
Finally, tell the RDF Priming module to spread the activations from these two nodes:
This will normally take 8-10 seconds after which the example query can be re-evaluated with the following results:
As can be seen, the result set is smaller and most of the engines retrieved are made by Ford. However, there is an engine made by Jaguar which is most probably there because Ford owned Jaguar for some time in the past, so both manufacturers are somehow related to each other. This might also be the case for the other non-Ford engines returned, since BMW also owned Jaguar for some time. Of course, these remarks are a free interpretation of the results.
to return to the normal operating mode.
Notifications are a publish/subscribe mechanism for registering and receiving events from an OWLIM-SE repository whenever triples matching a certain graph pattern are inserted or removed. The Sesame API provides such a mechanism, where a RepositoryConnectionListener can be notified of changes to a NotifiyingRepositoryConnection. However the OWLIM-SE notifications API works at a lower level and uses the internal raw entity IDs for subject, predicate, object instead of Java objects. The benefit of this is that a much higher performance is possible. The downside is that the client must do a separate lookup up to get the actual entity values and because of this, the notification mechanism will only work when the client is running inside the same JVM as the repository instance. See the next section for the remote notification mechanism.
The subscriber should not rely on any particular order or distinctness of the statement notifications. Duplicate statements might be delivered in response to a graph pattern subscription in an order not even bound to the chronological order of the statements insertion in to the underlying triple store.
The purpose of the notification services is to enable the efficient and timely discovery of newly added RDF data. Therefore it should be treated as a mechanism for giving the client a hint that certain new data is available and not as an asynchronous SPARQL evaluation engine.
OWLIM's remote notification mechanism provides filtered statement add/remove and transaction begin/end notifications for a local or a remote OWLIM-SE repository. Subscribers for this mechanism use patterns of subject, predicate and object (with wildcards) to filter the statement notifications. JMX is used internally as a transport mechanism.
Registering and deregistering for notifications is achieved through the NotifyingOwlimConnection class, which wraps a RepositoryConnection object connected to an OWLIM repository and provides an API to add/remove notification listeners of type RepositoryNotificationsListener. Here is a simple example of the API usage:
The above example will work when the OWLIM repository is initialized in the same JVM that runs the example (local repository). If a remote repository is used (e.g. HTTPRepository) the notifying repository connection should be initialized differently:
For remote notifications, where the subscriber and the repository are running in different JVM instances (possibly on different hosts), a JMX remote service should be configured in the repository JVM. This is done by adding the following parameters to the JVM command line:
If the repository is running inside a servlet container, then these parameters must be passed to the JVM that runs the container and OWLIM. For Tomcat, this can be done using the JAVA_OPTS or CATALINA_OPTS environment variable.
where N is the consecutive number of the node we want to configure and PORTN is the port number of that node's JMX service. Cluster workers should also have their com.sun.management.jmxremote.* JVM parameters properly configured. OWLIM-Enterprise cluster master nodes will therefore be controlled and emit notifications using the same JMX port number.
In order to control whether only explicit or only implicit statements are considered during SPARQL query evaluation, some special context identifiers can be used with the FROM and FROM NAMED SPARQL constructs. The following table gives details:
Effectively, statements behave as though they have a context of http://www.ontotext.com/implicit or http://www.ontotext.com/explicit independent of whether they have an actual context or not. Various combinations of FROM and FROM NAMED are allowed in alignment with SPARQL semantics.
Internally, OWLIM uses integer identifiers (IDs) to index all entities (URIs, blank nodes and literals). Statement indices are made up of these IDs and a large data structure is used to map from ID to entity value and back. There are occasions, e.g. when interfacing to application infrastructure, when having access to these internal IDs can improve the efficiency of data structures external to OWLIM by allowing them to be indexed by an integer value rather than a full URI.
This section introduces a special OWLIM predicate and function that provide access to these internal IDs. The datatype of internal IDs is <http://www.w3.org/2001/XMLSchema#long>.
There are several more special graph URIs used in OWLIM-SE that can be used to control query evaluation.
Skip to end of metadata Go to start of metadata