The background knowledge needed to make best use of GraphDB comprises some basic concepts and general understanding of the Sesame framework. This chapter provides introductions to both.
The Semantic Web represents a broad range of ideas and technologies that attempt to bring meaning to the vast amount of information available via the Web. The intention is to provide information in a structured form so that it can be processed automatically by machines. The combination of structured data and inferencing can yield much information not explicitly stated.
The Semantic Web solves such issues by adopting unique identifiers for concepts and the relationships between them. These identifiers, called "" (URIs) (a "resource" is any 'thing' or 'concept') are similar to Web page URLs, but do not necessarily identify documents from the Web. Their sole purpose is to uniquely identify objects or concepts and the relationships between them.
The use of URIs removes much of the ambiguity from information, but the Semantic Web goes further by allowing concepts to be associated with hierarchies of classifications, thus making it possible to infer new information based on an individual's classification and relationship to other concepts. This is achieved by making use of – hierarchical structures of concepts-- to classify individual concepts.
The World-Wide Web has grown rapidly and contains huge amounts of information that cannot be interpreted by machines. Machines cannot understand meaning, therefore they cannot understand Web content. For this reason, most attempts to retrieve some useful pieces of information from the Web require a high degree of user involvement – manually retrieving information from multiple sources (different Web pages), 'digging' through multiple search engine results (where useful pieces of data are often buried many pages deep), comparing differently structured result sets (most of them incomplete), and so on.
One approach for attaching semantic information to Web content is to embed the necessary machine-processable information through the use of special meta-descriptors (meta-tagging) in addition to the existing meta-tags that mainly concern the layout.
Within these meta tags, the (the pieces of useful information) can be uniquely identified in the same manner in which Web pages are uniquely identified, i.e. by extending the existing URL system into something more universal – a URI (Uniform Resource Identifier). In addition, conventions can be devised, so that resources can be described in terms of properties and values (resources can have properties and properties have values). The concrete implementations of these conventions (or vocabularies) can be embedded into Web pages (through meta-descriptors again) thus effectively 'telling' the processing machines things like:
[resource] John Doe has a [property] web site which is [value] www.johndoesite.com
The developed by the World Wide Web Consortium (W3C) makes possible the automated semantic processing of information, by structuring information using individual statements that consist of: Subject, Predicate, Object. Although frequently referred to as a 'language', RDF is mainly a data model. It is based on the idea that the things being described have properties which have values, and that resources can be described by making statements. RDF prescribes how to make statements about resources, in particular, Web resources, in the form of subject-predicate-object expressions. The 'John Doe' example above is precisely this kind of statement. The statements are also referred to as "", because they always have the subject-predicate-object structure .
A unique Uniform Resource Identifier (URI) is assigned to any resource or thing that needs to be described. Resources can be authors, books, publishers, places, people, hotels, goods, articles, search queries, and so on. In the Semantic Web, every resource has a URI. A URI can be a URL or some other kind of unique identifier. Unlike URLs, URIs do not necessarily enable access to the resource they describe, that is, in most cases they do not represent actual web pages. For example, the string "http://www.johndoesite.com/aboutme.htm", if used as a URL (Web link) is expected to take us to a Web page of the site providing information about the site owner, the person John Doe; the same string can however be used simply to identify that person on the Web (URI) irrespective of whether such a page exists or not.
To make the information in following sentence
"The web site www.johndoesite.com is created by John Doe."
machine-accessible, it must be expressed in the form of an RDF statement, i.e. a subject predicate object triple:
"[subject] the web site www.johndoesite.com [predicate] has a creator [object] called John Doe."
The respective RDF terms for the various parts of the statement are:
Next, each member of the subject-predicate-object triple should be identified using its URI, for example:
Note that in this version of the statement, instead of identifying the creator of the web site by the character string "John Doe", we used a URI, namely "http://www.johndoesite.com/aboutme". An advantage of using a URI is that the identification of the statement's subject can be more precise, i.e. the creator of the page is neither the character string "John Doe", nor any of the thousands of other people with that name, but the particular John Doe associated with that URI (whoever created the URI defines the association). Moreover, since there is a URI to refer to John Doe, he is now a full-fledged resource and additional information can be recorded about him simply by adding additional RDF statements with John's URI as the subject.
There are several conventions for writing abbreviated RDF statements, as used in the RDF specifications themselves. This shorthand employs an XML qualified name (or QName) without angle brackets as an abbreviation for a full URI reference. A QName contains a prefix that has been assigned to a namespace URI, followed by a colon, and then a local name. The full URI reference is formed from the QName by appending the local name to the namespace URI assigned to the prefix. So, for example, if the QName prefix "foo" is assigned to the namespace URI "http://example.com/somewhere/", then the QName "foo:bar" is shorthand for the URI "http://example.com/somewhere/bar".
Objects of RDF statements can (and very often do) form the subjects of other statements leading to a graph like representation of knowledge (the is defined in ). Using this notation, a statement is represented by:
So the RDF statement above could be represented by the following graph:
This kind of graph is known in the artificial intelligence community as a "semantic net" .
In order to represent RDF statements in a machine-processable way, RDF uses mark-up languages, namely (and almost exclusively) the Extensible Mark-up Language (XML). Because an abstract data model needs a concrete syntax in order to be represented and transmitted, RDF has been given a syntax in XML. As a result, it inherits the benefits associated with XML. However, it is important to understand that other syntactic representations of RDF, not based on XML, are also possible; XML-based syntax is not a necessary component of the RDF model. XML was designed to allow anyone to design their own document format and then write a document in that format. RDF defines a specific XML mark-up language, referred to as RDF/XML, for use in representing RDF information and for exchanging it between machines. Written in RDF/XML, our example will look as follows:
Also observe that the rdf:about attribute of the element rdf:Description is, strictly speaking, equivalent in meaning to that of an ID attribute, but it is often used to suggest that the object about which a statement is made has already been "defined" elsewhere. Strictly speaking, a set of RDF statements together simply forms a large graph, relating things to other things through properties, and there is no such concept as "defining" an object in one place and referring to it elsewhere. Nevertheless, in the serialized XML syntax, it is sometimes useful (if only for human readability) to suggest that one location in the XML serialization is the "defining" location, while other locations state "additional" properties about an object that has been "defined" elsewhere. There is much more to RDF/XML logic and syntax than can be covered here. For a discussion of the principles behind the modelling of RDF statements in XML (known as 'striping'), together with a presentation of the available RDF/XML abbreviations and other details and examples about writing RDF in XML, see the (normative) RDF/XML Syntax Specification from the W3C .
Properties are a special kind of resource: they describe relationships between resources, for example written by, age, title, and so on. Properties in RDF are also identified by URIs (in most cases, these are actual URLs). Therefore, properties themselves can be used as the subject in other statements which allows for an expressive ways to describe properties, e.g. by defining property hierarchies.
A named graph (NG) is a set of triples named by a URI. This URI can then be used outside or within the graph to refer to it . The ability to name a graph allows separate graphs to be identified out of a large collection of statements and further allows statements to made about graphs.
Named graphs represent an extension of the RDF data model, where quadruples <s,p,o,ng> are used to define statements in an RDF multi-graph. This mechanism allows, for example, the handling of provenance when multiple RDF graphs are integrated into a single repository. For information on the semantics and the abstract syntax of named graphs refer to .
From the perspective of GraphDB, named graphs are important, because comprehensive support for SPARQL (the most popular RDF query language and current W3C recommendation – see section 22.214.171.124) requires NG support.
While being a universal model that lets users describe resources using their own vocabularies, RDF does not make assumptions about any particular application domain, nor does it define the semantics of any domain. It is up to the user to do so using an vocabulary.
RDF Schema is a vocabulary description language for describing properties and classes of RDF resources, with a semantics for generalisation hierarchies of such properties and classes. Be aware of the fact that the RDF Schema is conceptually different from the XML Schema, even though the common term schema suggests similarity. XML Schema constrains the structure of XML documents, whereas RDF Schema defines the vocabulary used in RDF data models. Thus, RDFS makes semantic information machine-accessible, in accordance with the Semantic Web vision. RDF Schema is a primitive ontology language. It offers certain modelling primitives with fixed meaning.
The RDFS facilities are themselves provided in the form of an RDF vocabulary; that is, as a specialised set of predefined RDF resources with their own special meanings. The resources in the RDFS vocabulary have URIs with the prefix http://www.w3.org/2000/01/rdf-schema# (conventionally associated with the namespace prefix rdfs). Vocabulary descriptions (schemas) written in the RDFS language are legal RDF graphs. Hence, systems that process RDF information that do not understand the additional RDFS vocabulary can still interpret a schema as a legal RDF graph consisting of various resources and properties. However, such a system will be oblivious to the additional built-in meaning of the RDFS terms. To understand these additional meanings, software that processes RDF information must be extended to include these language features and to interpret their meanings in the defined way.
A class can be thought of as a set of elements. Individual objects that belong to a class are referred to as instances of that class. A class in RDFS corresponds to the generic concept of a type or category, somewhat like the notion of a class in object-oriented programming languages such as Java. RDF classes can be used to represent any category of objects, such as web pages, people, document types, databases or abstract concepts. Classes are described using the RDF Schema resources rdfs:Class and rdfs:Resource, and the properties rdf:type and rdfs:subClassOf. The relationship between instances and classes in RDF is defined using rdf:type.
In addition to describing the specific classes of things they want to describe, user communities also need to be able to describe specific properties that characterise those classes of things (such as 'numberOfBedrooms' to describe an apartment). In RDFS, properties are described using the RDF class rdf:Property, and the RDFS properties rdfs:domain, rdfs:range and rdfs:subPropertyOf.
RDFS also provides vocabulary for describing how properties and classes are intended to be used together. The most important information of this kind is supplied by using the RDFS properties rdfs:range and rdfs:domain to further describe application-specific properties.
These statements indicate that ex:Person is a class, ex:author is a property, and that RDF statements using the ex:author property have instances of ex:Person as objects.
These statements indicate that ex:Book is a class, ex:author is a property, and that RDF statements using the ex:author property have instances of ex:Book as subjects, see .
RDFS provides the means to create custom vocabularies. However, it is generally easier and better practice to use an existing vocabulary created by someone else who has already been describing a similar conceptual domain. Such publicly available vocabularies, called "shared vocabularies" are not only cost-efficient to use, but they also promote the shared understanding of the described domains.
the predicate dc:creator, when fully expanded as a URI, is an unambiguous reference to the creator attribute in the Dublin Core metadata attribute set, a widely-used set of attributes (properties) for describing information of this kind. So this triple is effectively saying that the relationship between the web site (identified by http://www.johndoesite.com/) and the creator of the site (a distinct person, identified by http://www.johndoesite.com/aboutme) is exactly the property identified by http://purl.org/dc/elements/1.1/creator. This way, anyone familiar with the Dublin Core vocabulary or those who find out what dc:creator means (say, by looking up its definition on the Web) will know what is meant by this relationship. In addition, this shared understanding based upon using unique URIs for identifying concepts is exactly the requirement for creating computer systems that can automatically process structured information.
An example of a shared vocabulary that is readily available for reuse is The Dublin Core, which is a set of elements (properties) for describing documents (and hence, for recording metadata). The element set was originally developed at the March 1995 Metadata Workshop in Dublin, Ohio, USA. Dublin Core has subsequently been modified on the basis of later Dublin Core Metadata workshops and is currently maintained by the Dublin Core Metadata Initiative.
In general, an ontology formally describes a (usually finite) domain of related concepts (classes of objects) and their relationships. For example, in a company setting, staff members, managers, company products, offices, and departments might be some important concepts. The relationships typically include hierarchies of classes. A hierarchy specifies a class C to be a subclass of another class C' if every object in C is also included in C'. For example, all managers are staff members.
Ontologies are important because use ontologies as semantic schemata. This makes automated reasoning about the data possible (and easy to implement) since the most essential relationships between the concepts are built into the ontology.
Formal (KR) is about building models. The typical modelling paradigm is mathematical logic, but there are also other approaches, rooted in the information and library science. KR is a very broad term; here we only refer to its mainstream meaning of the world (of a particular state of affairs, situation, domain or problem), which allow for automated reasoning and interpretation. Such models consist of ontologies defined in a formal language. Ontologies can be used to provide formal semantics (i.e. machine-interpretable meaning) to any sort of information: databases, catalogues, documents, Web pages, etc. Ontologies can be used as semantic frameworks: the association of information with ontologies makes such information much more amenable to machine processing and interpretation. This is because ontologies are described using logical formalisms, such as , which allow automatic inferencing over these ontologies and datasets that use them, i.e. as a vocabulary. An important role of ontologies is to serve as schemata or "intelligent" views over information resources. Comments in the same spirit are provided in . This is also the role of ontologies in the Semantic Web. Thus, they can be used for indexing, querying, and reference purposes over non-ontological datasets and systems, such as databases, document and catalogue management systems. Because ontological languages have a formal semantics, ontologies allow a wider interpretation of data, i.e. inference of facts which are not explicitly stated. In this way, they can improve the interoperability and the efficiency of the usage of arbitrary datasets.
Ontologies can be classified as light-weight or heavy-weight according to the complexity of the KR language and the extent to which it is used. Light-weight ontologies allow for more efficient and scalable reasoning, but do not possess the highly predictive (or restrictive) power of more powerful KR languages. Ontologies can be further differentiated according to the sort of conceptualisation that they formalise: upper-level ontologies model general knowledge, while domain and application ontologies represent knowledge about a specific domain (e.g. medicine or sport) or a type of application, e.g. knowledge management systems. Basic definitions regarding ontologies can be found in , , , and .
(KB) is a broader term than ontology. Similar to an ontology, a KB is represented in a KR formalism, which allows automatic inference. It could include multiple axioms, definitions, rules, facts, statements, and any other primitives. In contrast to ontologies, however, KBs are not intended to represent a shared or consensual conceptualisation. Thus, ontologies are a specific sort of a KB. Many KBs can be split into ontology and instance data parts, in a way analogous to the splitting of schemata and concrete data in databases. A broader discussion on the different terms related to ontology and semantics can be found in .
PROTON  is a light-weight upper-level schema-ontology developed in the scope of the SEKT project . It is used in the KIM system  for semantic annotation, indexing and retrieval. We will also use it for ontology-related examples within this section. PROTON  is encoded in OWL Lite  and defines about 542 entity classes and 183 properties, providing good coverage of named entity types and concrete domains, i.e. modelling of concepts such as people, organizations, locations, numbers, dates, addresses, etc. A snapshot of the PROTON class hierarchy is shown above.
The topics that follow take a closer look at the logic that underlies the retrieval and manipulation of semantic data and the kind of programming that supports it.
Logic programming involves the use of logic for computer programming, where the programmer uses a declarative language to assert statements and a reasoner or theorem-prover is used to solve problems. A reasoner can interpret sentences, such as IF A THEN B, as a means to prove B from A. In other words, given a collection of logical sentences, a reasoner will explore the solution space in order to find a path to justify the requested theory. For example, to determine the truth value of 'C' given the following logical sentences:
a reasoner will interpret the IF..THEN statements as rules and determine that C is indeed inferred from the KB. This use of rules in logic programming has led to 'rule-based reasoning' and 'logic programming' becoming synonymous, although this is not strictly the case.
In LP, there are rules of logical inference that allow new (implicit) statements to be inferred from other (explicit) statements, with the guarantee that if the explicit statements are true, so are the implicit statements.
There must also be a reasonable time frame for the entire inference process. To this end, much research has been carried out to determine the complexity classes of various logical formalisms and reasoning strategies. Generally speaking, to reason with Web-scale quantities of data requires a low-complexity approach. A tractable solution is one whose algorithm requires finite time and space to complete.
From a more abstract viewpoint, the subject of the previous topic is related to the foundation upon which logical programming resides, which is logic, particularly in the form of (also known as "first order logic"). Some of the specific features of predicate logic render it very suitable for making inferences over the Semantic Web, namely:
The languages of RDF and OWL (Lite and DL) can be viewed as specialisations of predicate logic. One reason for such specialised languages to exist is that they provide a syntax that fits well with the intended use (in our case, Web languages based on tags). The other major reason is that they define reasonable subsets of logic. This is important because there is a trade-off between the expressive power and the computational complexity of certain logic: the more expressive the language, the less efficient (in the worst case) the corresponding proof systems. As previously stated, OWL Lite and OWL DL correspond roughly to a , a subset of predicate logic for which efficient proof systems exist.
Another subset of predicate logic with efficient proof systems comprises the so-called rule systems (also known as or definite logic programs). A rule has the form:
Both approaches have important applications. The deductive approach, however, is more relevant for the purpose of retrieving and managing structured data. This is because it relates better to the possible queries that one can ask, as well as to the appropriate answers and their proofs.
Description Logic (DL) have historically evolved from a combination of frame-based systems and predicate logic. Its main purpose is to overcome some of the problems with frame-based systems and to provide a clean and efficient formalism to represent knowledge. The main idea of DL is to describe the world in terms of 'properties' or 'constraints' that specific 'individuals' must satisfy. DL is based on the following basic entities :
The family of description logic system consists of many members that differ mainly with respect to the constructs they provide. Not all of the constructs can be found in a single DL system. For a listing of some concrete constructs with a brief explanation of their semantics, refer to .
In order achieve the goal of a broad range of shared ontologies using vocabularies with expressiveness appropriate for each domain, the Semantic Web requires a scalable high-performance storage and reasoning infrastructure. The major challenge towards building such an infrastructure is the expressivity of the underlying standards: RDF , RDFS , OWL  and OWL 2 . Even though RDFS can be considered a simple KR language, it is already a challenging task to implement a repository for it, which provides performance and scalability comparable to those of relational database management systems (RDBMS). Even the simplest dialect of OWL (OWL Lite) is a description logic (DL) that does not scale due to reasoning complexity. Furthermore, the semantics of OWL Lite are incompatible with that of RDF(S), see .
OWL DLP is a non-standard dialect, offering a promising compromise between expressive power, efficient reasoning, and compatibility. It is defined in  as the intersection of the expressivity of OWL DL and logic programming. In fact, OWL DLP is defined as the most expressive sub-language of OWL DL, which can be mapped to . OWL DLP is simpler than OWL Lite. The alignment of its semantics to RDFS is easier, as compared to OWL Lite and OWL DL dialects. Still, this can only be achieved through the enforcement of some additional modelling constraints and transformations. A broad collection of information related to OWL DLP can be found on .
Experience with using OWL has shown that existing ontologies frequently use very few constructs outside the DLP language .
In  ter Horst defines RDFS extensions towards rule support and describes a fragment of OWL, more expressive than DLP. He introduces the notion of of one (target) RDF graph from another (source) RDF graph on the basis of a set of entailment rules R. R-entailment is more general than the used by Hayes  in defining the standard RDFS semantics. Each rule has a set of premises, which conjunctively define the body of the rule. The premises are "extended" RDF statements, where variables can take any of the three positions.
In this document we refer to this extension of RDFS as "OWL Horst". As outlined in , this language has a number of important characteristics:
In Figure 3, the pink box represents the range of expressivity of OWLIM, i.e. including OWL DLP, OWL Horst, OWL 2 RL, most of OWL Lite. However, none of the rule sets include support for the entailment of typed literals (D-entailment); more details on the semantics supported by OWLIM can be found in section 4.6.
OWL 2  is a re-work of the OWL language family by the OWL working group. This work includes identifying fragments of the OWL 2 language that have desirable behavior for specific applications/environments.
The original OWL specification , now known as OWL 1, provides two specific subsets of OWL Full designed to be of use to implementers and language users. The OWL Lite subset was designed for easy implementation and to provide users with a functional subset that provides an easy way to start using OWL.
The OWL DL (where DL stands for "Description Logic") subset was designed to support the existing Description Logic business segment and to provide a language subset that has desirable computational properties for reasoning systems.
In this section, we introduce some query languages for RDF. This may beg the question as to why we need RDF-specific query languages at all instead of using an XML query language. The answer is that XML is located at a lower level of abstraction than RDF. This fact would lead to complications if we were querying RDF documents with an XML-based language. The RDF query languages explicitly capture the RDF semantics in the language itself.
RQL (RDF Query Language) was initially developed by the Institute of Computer Science at Heraklion, Greece, in the context of the European IST project MESMUSES.3. RQL adopts the syntax of OQL (a query language standard for object-oriented databases), and, like OQL, is defined by means of a set of core queries, a set of basic filters, and a way to build new queries through functional composition and iterators.
SPARQL (pronounced "sparkle") is currently the most popular RDF query language; its name is a recursive acronym that stands for "SPARQL Protocol and RDF Query Language". It was standardised by the RDF Data Access Working Group (DAWG) of the World Wide Web Consortium, and is now considered a key Semantic Web technology. On 15 January 2008, SPARQL became an official W3C Recommendation.
SeRQL (Sesame RDF Query Language, pronounced "circle") is an RDF/RDFS query language developed by Sesame's developer – Aduna – as part of Sesame. It selectively combines the best features (considered by its creators) of other query languages (RQL, RDQL, N-Triples, N3) and adds some features of its own. As of this writing, SeRQL provides advanced features not yet available in SPARQL. Some of SeRQL's most important features are:
The two principle strategies for rule-based inference are forward-chaining and backward-chaining.
Both of these strategies has its advantages and disadvantages, which have been well studied in the history of KR and expert systems. Attempts to overcome the weak points have led to the development of various hybrid strategies (involving partial forward- and backward-chaining) which have proven efficient in many contexts.
Imagine a repository which performs total forward-chaining, i.e. it tries to make sure that after each update to the KB, the inferred closure is computed and made available for query evaluation or retrieval. This strategy is generally known as . In order to avoid ambiguity with various partial materialisation approaches, let us call such an inference strategy, taken together with the monotonic entailment. When new explicit facts (statements) are added to a KB (repository), new implicit facts will likely be inferred. Under a monotonic logic, adding new explicit statements will never cause previously inferred statements to be retracted. In other words, the addition of new facts can only monotonically extend the inferred closure. Assumption, total materialisation.
The principle advantages and disadvantages of the total materialisation are discussed at length in ; here we provide just a short summary of them:
Probably the most important advantage of the inductive systems, based on total materialisation, is that they can easily benefit from RDBMS-like query optimization techniques, as long as all the data is available at query time. The latter makes it possible for the query evaluation engine to use statistics and other means in order to make 'educated' guesses about the 'cost' and the 'selectivity' of a particular constraint. These optimisations are much more complex in the case of deductive query evaluation.
Over the last decade the Semantic Web has emerged as an area where semantic repositories became as important as HTTP servers are today. This perspective boosted the development, under W3C driven community processes, of a number of robust metadata and ontology standards. Those standards play the role, which SQL had for the development and spread of the relational DBMS. Although designed for Semantic Web, these standards face increasing acceptance in areas like Enterprise Application Integration and life sciences.
Sesame is a framework for storing, querying and reasoning with RDF data. It is implemented in Java by Aduna as an open source project and includes various storage back-ends (memory, file, database), query languages, reasoners and client-server protocols.
Sesame supports the W3C's SPARQL query language and Aduna's own query language SeRQL. It also supports most popular RDF file formats and query result formats.
A schematic representation of Sesame's architecture is shown in Figure 4. Following is a brief overview of the main components.
Figure 4 -- The Sesame architecture
Reproduced from the Sesame 2 online documentation at http://www.openrdf.org
The SAIL API is a set of Java interfaces that support the storage and retrieval of RDF statements. The main characteristics of the SAIL API are:
Other proposals for RDF APIs are currently under development. The most prominent of these are the Jena toolkit and the Redland Application Framework. The SAIL shares many characteristics with both approaches, however an important difference between these two proposals and SAIL, is that the SAIL API specifically deals with RDFS on the retrieval side: it offers methods for querying class and property subsumption, and domain and range restrictions. In contrast, both Jena and Redland focus exclusively on the RDF triple set, leaving interpretation of these triples to the user application. In SAIL, these RDFS inferencing tasks are handled internally. The main reason for this is that there is a strong relationship between the efficiency of inference and the actual storage model being used. Since any particular SAIL implementation has a complete understanding of the storage model (e.g. the database schema in the case of an RDBMS), this knowledge can be exploited to infer, for example, class subsumption more efficiently.
Skip to end of metadata Go to start of metadata