The GraphDB Connectors provide extremely fast normal and faceted (aggregation) searches, typically implemented by an external component or a service such as Lucene but have the additional benefit of staying automatically up-to-date with the GraphDB repository data.
The Connectors provide synchronisation at the entity level, where an entity is defined as having a unique identifier (a URI) and a set of properties and property values. In terms of RDF, this corresponds to a set of triples that have the same subject. In addition to simple properties (defined by a single triple), the Connectors support property chains. A property chain is defined as a sequence of triples where each triple's object is the subject of the following triple.
The main features of the GraphDB Connectors are:
Each feature is described in detail below.
All interactions with the Lucene GraphDB Connector shall be done through SPARQL queries.
There are three types of SPARQL queries:
In general, this corresponds to INSERT adds or modifies data and SELECT queries existing data.
Each connector implementation defines its own URI prefix to distinguish it from other connectors. For the Lucene GraphDB Connector, this is http://www.ontotext.com/connectors/lucene#. Each command or predicate executed by the connector uses this prefix, e.g., http://www.ontotext.com/connectors/lucene#createConnector to create a connector instance for Lucene.
Individual instances of a connector are distinguished by unique names that are also URIs. They have their own prefix to avoid clashing with any of the command predicates. For Lucene, the instance prefix is http://www.ontotext.com/connectors/lucene/instance#.
All examples use the following sample data, which describes five fictitious wines: Yoyowine, Franvino, Noirette, Blanquito and Rozova as well as the grape varieties required to make these wines. The minimum required ruleset level in GraphDB is RDFS.
This version of the Lucene GraphDB Connector uses Lucene version 4.10.4.
Creating a connector instance is done by sending a SPARQL query with the following configuration data:
The configuration data has to be provided as a JSON string representation and passed together with the create command.
The create command is triggered by a SPARQL INSERT with the createConnector predicate, e.g., it creates a connector instance called my_index, which synchronises the wines from the sample data above:
The above command creates a new Lucene connector instance.
The "types" key defines the RDF type of the entities to synchronise and, in the example, it is only entities of the type <http://www.ontotext.com/example/wine#Wine> (and its subtypes). The "fields" key defines the mapping from RDF to Lucene. The basic building block is the property chain, i.e., a sequence of RDF properties where the object of each property is the subject of the following property. In the example, three bits of information are mapped - the grape the wines are made of, sugar content, and year. Each chain is assigned a short and convenient field name: "grape", "sugar", and "year". The field names are later used in the queries.
Grape is an example of a property chain composed of more than one property. First, we take the wine's madeFromGrape property, the object of which is an instance of the type Grape, and then we take the rdfs:label of this instance. Sugar and year are both composed of a single property that links the value directly to the wine.
Dropping a connector instance removes all references to its external store from GraphDB as well as all Lucene files associated with it.
The drop command is triggered by a SPARQL INSERT with the dropConnector predicate where the name of the connector instance has to be in the subject position, e.g., this removes the connector my_index:
Listing connector instances returns all previously created instances. It is a SELECT query with the listConnectors predicate:
?cntUri is bound to the prefixed URI of the connector instance that was used during creation, e.g., <http://www.ontotext.com/connectors/lucene/instance#my_index>, while ?cntStr is bound to a string, representing the part after the prefix, e.g., "my_index".
The internal state of each connector instance can be queried using a SELECT query and the connectorStatus predicate:
?cntUri is bound to the prefixed URI of the connector instance, while ?cntStatus is bound to a string representation of the status of the connector represented by this URI. The status is key-value based.
From the user point of view, all synchronisation happens transparently without using any additional predicates or naming a specific store explicitly, i.e., you should simply execute standard SPARQL INSERT/DELETE queries. This is achieved by intercepting all changes in the plugin and determining which abstract documents need to be updated.
Once a connector instance has been created, it is possible to query data from it through SPARQL. For each matching abstract document, the connector instance returns the document subject. In its simplest form, querying is achieved by using a SELECT and providing the Lucene query as the object of the query predicate:
The result binds ?entity to the two wines made from grapes that have "cabernet" in their name, namely :Yoyowine and :Franvino.
It is also possible to provide per query search options by using one or more option predicates. The option predicates are described in detail below.
The bound ?entity can be used in other SPARQL triples in order to build complex queries that fetch additional data from GraphDB, for example, to see the actual grapes in the matching wines as well as the year they were made:
The result looks like this:
It is possible to access the match score returned by Lucene with the score predicate. As each entity has its own score, the predicate should come at the entity level. For example:
The result looks like this but the actual score might be different as it depends on the specific Lucene version:
Consider the sample wine data and the my_index connector instance described previously. You can also query facets using the same instance:
It is important to specify the facet fields by using the facetFields predicate. Its value is a simple comma-delimited list of field names. In order to get the faceted results, use the facets predicate. As each facet has three components (name, value and count), the facets predicate binds a blank node, which in turn can be used to access the individual values for each component through the predicates facetName, facetValue, and facetCount.
The resulting bindings look like the following:
You can easily see that there are three wines produced in 2012 and two in 2013. You also see that three of the wines are dry, while two are medium. However, it is not necessarily true that the three wines produced in 2012 are the same as the three dry wines as each facet is computed independently.
It is possible to sort the entities returned by a connector query according to one or more fields. Sorting is achieved by the orderBy predicate the value of which is a comma-delimited list of fields. Each field can be prefixed with a minus to indicate sorting in descending order. For example:
The result contains wines produced in 2013 sorted according to their sugar content in descending order:
By default, entities are sorted according to their matching score in descending order.
Limit and offset are supported on the Lucene side of the query. This is achieved through the predicates limit and offset. Consider this example in which an offset of 1 and a limit of 1 are specified:
The result contains a single wine, Franvino. If you execute the query without the limit and offset, Franvino will be second in the list:
Snippet extraction is used to extract highlighted snippets of text that match the query. The snippets are accessed through the dedicated predicate snippets. It binds a blank node that in turn provides the actual snippets via the predicates snippetField and snippetText. The predicate snippets must be attached to the entity, as each entity has a different set of snippets. For example, in a search for Cabernet:
The query returns the two wines made from Cabernet Sauvignon or Cabernet Franc grapes as well as the respective matching fields and snippets:
It is possible to tweak how the snippets are collected/composed by using the following option predicates:
The option predicates are set on the query instance, much like the :query predicate.
You can get the total number of hits by using the totalHits predicate, e.g., for the connector instance my_index and a query that retrieves all wines made in 2012:
As there are three wines made in 2012, the value 3 (of type xdd:long) binds to ?totalHits.
The creation parameters define how a connector instance is created by the :createConnector predicate. Some are required and some are optional. All parameters are provided together in a JSON object, where the parameter names are the object keys. Parameter values may be simple JSON values such as a string or a boolean, or they can be lists or objects.
All of the creation parameters can also be set conveniently from the Create Connector user interface in the GraphDB Workbench without any knowledge of JSON.
The Lucene Connector supports custom Analyzer implementations. They may be specified via the analyzer parameter whose value must be a fully qualified name of a class that extends org.apache.lucene.analysis.Analyzer. The class requires either a default constructor or a constructor with exactly one parameter of type org.apache.lucene.util.Version. For example, these two classes are valid implementations:
FancyAnalyzer and SmartAnalyzer can then be used by specifying their fully qualified names, for example:
The RDF types of entities to sync are specified as a list of URIs. At least one type URI is required.
RDF data is often multilingual but you can map only some of the languages represented in the literal values. This can be done by specifying a list of language ranges to be matched to the language tags of literals according to RFC 4647, Section 3.3.1. Basic Filtering. In addition, an empty range can be used to include literals that have no language tag. The list of language ranges maps all existing literals that have matching language tags.
The fields define exactly what parts of each entity will be synchronised as well as the specific details on the connector side. The field is the smallest synchronisation unit and it maps a property chain from GraphDB to a field in Lucene. The fields are specified as a list of field objects. At least one field object is required. Each field object has further keys that specify details.
The name of the field defines the mapping on the connector side. It is specified by the key fieldName with a string value. The field name is used at query time to refer to the field. There are few restrictions on the allowed characters in a field name but to avoid unnecessary escaping (which depends on how Lucene parses its queries), we recommend to keep the field names simple.
The property chain (propertyChain) defines the mapping on the GraphDB side. A property chain is defined as a sequence of triples where the entity URI is the subject of the first triple, its object is the subject of the next triple and so on. In this model, a property chain with a single element corresponds to a direct property defined by a single triple. Property chains are specified as a list of URIs where at least one URI must be provided.
See Copy fields for defining multiple fields with the same property chain.
See Multiple property chains per field for defining a field whose values are populated from more than one property chain.
The default value (defaultValue) provides means for specifying a default value for the field when the property chain has no matching values in GraphDB. The default value can be a plain literal, a literal with a datatype (xsd: prefix supported), a literal with language, or a URI. It has no default value.
If indexed, a field is available for Lucene queries. True by default.
This option corresponds to Lucene's field option "indexed".
Fields can be stored in Lucene and this is controlled by the Boolean option "stored". Stored fields are required for retrieving snippets. True by default.
This options corresponds to Lucene's property "stored".
When literal fields are indexed in Lucene, they will be analysed according to the analyser settings. Should you require that a given field is not analysed, you may use "analyzed". This option has no effect for URIs (they are never analysed). True by default.
This option corresponds to Lucene's property "tokenized".
RDF properties and synchronised fields may have more than one value. If "multivalued" is set to true, all values will be synchronised to Lucene. If set to false, only a single value will be synchronised. True by default.
Lucene needs to index data in a special way, if it will be used for faceted search. This is controlled by the Boolean option "facet". True by default. Fields that are not synchronised for faceting are also not available for faceted search.
By default, the Lucene GraphDB Connector uses datatype of literal values to determine how they should be mapped to Lucene types. For more information on the supported datatypes, see Datatype mapping.
The datatype mapping can be overridden through the parameter "datatype", which can be specified per field. The value of "datatype" can be any of the xsd: types supported by the automatic mapping.
This section provides an overview of additional ways to define a field besides the regular field definitions composed of a field name and a property chain. The following methods are applicable in specific use cases.
Often, it is convenient to synchronise one and the same data multiple times with different settings to accommodate for different use cases, e.g., faceting or sorting vs full-text search. The Lucene GraphDB Connector has explicit support for fields that copy their value from another field. This is achieved by specifying a single element in the property chain of the form @otherFieldName, where otherFieldName is another non-copy field. Take the following example:
The snippet creates an analysed field "grape" and a non-analysed field "grapeFacet", both fields are populated with the same values and "grapeFacet" is defined as a copy field that refers to the field "facet".
Sometimes you have to work with data models that define the same concept (in terms of what you want to index in Lucene) with more than one property chain, e.g. the concept of "name" could be defined as a single canoncial name, multiple historical names and some unofficial names. If you want to index those together as a single field in Lucene you can define that as a multiple property chains field.
Fields with multiple property chains are defined as a set of separate virtual fields that will be merged into a single physical field when indexed. Virtual fields are distinguished by the suffix /xyz, where xyz is any alphanumeric sequence of convenience. For example, we can define the fields name/1 and name/2 like this:
The values of the fields name/1 and name/2 will be merged and synchronised to the field name in Lucene.
Filters can be used with fields defined with multiple property chains. Both the physical field values and the individual virtual field values are available:
The Lucene GraphDB Connector maps different types of RDF values to different types of Lucene values according to the basic type of the RDF value (URI or literal) and the datatype of literals. The autodetection uses the following mapping:
The datatype mapping can be affected by the synchronisation options too, e.g., a non-analysed field that has xsd:long values is indexed with a StringField.
The entityFilter parameter is used to fine-tune the set of entities and/or individual values for the configured fields, based on the field value. Entities and field values are synchronised to Lucene if, and only if, they pass the filter. The entity filter is similar to a FILTER() inside a SPARQL query but not exactly the same. Each configured field can be referred to, in the entity filter, by prefixing it with a "?", much like referring to a variable in SPARQL. Several operators are supported:
In addition to the operators, there are some constructions that can be used to write filters based not on the values but on values related to them:
The construction parent(?var) is used for going to a previous level in a property chain. It can be applied recursively as many times as needed, e.g., parent(parent(parent(?var))) goes back in the chain three times. The effective value of parent(?var) can be used with the in or not in operator like this: parent(?company) in (<urn:a>, <urn:b>), or in the bound operator like this: parent(bound(?var)).
The construction ?var -> uri (alternatively ?var o uri or just ?var uri) is used to access additional values that are accessible through the property uri. In essence, this construction corresponds to the triple pattern value uri ?effectiveValue, where ?value is a value bound by the field var. The effective value of ?var -> uri can be used with the in or not in operator like this: ?company -> rdf:type in (<urn:c>, <urn:d>). It can be combined with parent() like this: parent(?company) -> rdf:type in (<urn:c>, <urn:d>). The same construction can be applied to the bound operator like this: bound(?company -> <urn:hasBranch>), or even combined with parent() like this: bound(parent(?company) -> <urn:hasGroup>).
The URI parameter can be a full URI within < > or the special string rdf:type (alternatively just type), which will be expanded to http://www.w3.org/1999/02/22-rdf-syntax-ns#type.
The construction graph(?var) is used to access the RDF graph of a field's value. The typical use case is to sync only explicit values: graph(?a) not in (<http://www.ontotext.com/implicit>). The construction can be combined with parent() like this: graph(parent(?a)) in (<urn:a>).
Entity filters can be combined with default values in order to get more flexible behaviour.
A typical use-case for an entity filter is having soft deletes, i.e., instead of deleting an entity, it is marked as deleted by the presence of a specific value for a given property.
Given the following RDF data:
If you create a connector instance such as:
The entity :beta is not synchronised as it has no value for city.
To handle such cases, you can modify the connector configuration to specify a default value for city:
The default value is used for entity :beta as it has no value for city in the repository. As the value is "London", the entity is synchronised.
Sometimes data represented in RDF is not well suited to map directly to non-RDF. For example, if you have news articles and they can be tagged with different concepts (locations, persons, events, etc.), one possible way to model this is a single property :taggedWith. Consider the following RDF data:
Now, if you map this data to Lucene so that the property :taggedWith x is mapped to separate fields taggedWithPerson and taggedWithLocation according to the type of x (we are not interested in events), you can map taggedWith twice to different fields and then use an entity filter to get the desired values:
The six articles in the RDF data above will be mapped as such:
This can be checked by issuing a faceted search for taggedWithLocation and taggedWithPerson:
If the filter was applied, you should get only :Berlin for taggedWithLocation and only :Einstein and :Mozart for taggedWithPerson:
The following diagram shows a summary of all predicates that can administer (create, drop, check status) connector instances or issue queries and retrieve results. It can be used as a quick reference of what a particular predicate needs to be attached to. For example, to retrieve entities, you need to use :entities on a search instance and to retrieve snippets, you need to use :snippets on an entity. Variables that are bound as a result of a query are shown in green, blank helper nodes are shown in blue, literals in red, and URIs in orange. The predicates are represented by labelled arrows.
Even though SPARQL per se is not sensitive to the order of triple patterns, the Lucene GraphDB Connector expects to receive certain predicates before others so that queries can be executed properly. In particular, predicates that specify the query or query options need to come before any predicates that fetch results.
The diagram in Overview of connector predicates provides a quick overview of the predicates.
No special procedures are required for upgrading from:
GraphDB prior to 6.2 shipped with version 3.x of the Lucene GraphDB Connector that had different options and slightly different behaviour and internals. Unfortunately, it is not possible to migrate existing connector instances automatically. To prevent any data loss, the Lucene GraphDB Connector will not initialise, if it detects an existing connector in the old format. The recommended way to migrate your existing instances is:
You might also need to change your queries to reflect any changes in field names or extra fields.
Prior to 6.2, a single field in the config could produce up to three individual fields on the Lucene side, based on the field options. For example, for the field "firstName":
The current version always produces a single Lucene field per field definition in the configuration. This means that you have to create all appropriate fields based on your needs. See more under List of creation parameters.
You can easily migrate your existing lucene4 plugin setup to the new connectors interface.
We have changed the syntax for the search queries to be able to match our needs for new features and better design. Here is an example query using the lucene4 plugin:
and here is the connector variant:
For more information on the new syntax and how everything is linked together, see Overview of connector predicates.
Skip to end of metadata Go to start of metadata