Data Annotation ontology and some design notes
Introduction
Annotations are a unification of several concepts from BM's RS spec:
- links to semantic objects, or their parts
- relations between semantic objects
- annotation of semantic objects
- proposing new versions for data
- discussions (forums), be that global or for a specific semantic object
We call the root-level semantic objects "Museum Objects" (MO), even though this would be a misnomer in some cases.
Below it's understood that when we talk about semantic objects, we can also address their parts.
Data vs Annotation Layer
We distinguish between two layers:
- the Data layer includes factual statements;
eg Rembrandt is the author of Susanna.
RS3.2 imports most data using a Data Migration: only atomic data can be edited by a user - the Annotation layer includes attribute assignments, comments, who made them etc;
eg Bredius said in 1935 that "Rembrandt is the author of Susanna", qualified this with "or studio of", and made some extra remark.
RS3.2 handles annotations fully: a user enters all annotation fields
Values from the Annotation layer can enter the Data layer only on Approval by Project Admin.
- DO: is this equally true of general annotations. In the BM records there are many has_note literals.
- Vlado: it doesn't matter whether P3_has_note (or a sub-property) is used, but to what it is attached. The distinction "Data vs Annotation" is at a higher level: is this a property of a museum object or of E13_Attribute_Assignment
- DO: I assume that these are different from a research comment and would not enter the data layer.
- Vlado: research comments are in the Annotation layer, factual notes are in the Data layer. (Where you draw the line between them maybe a but fuzzy and is a matter of discussion and decision). But most of the stuff I've seen in the BM export is at Data layer
- DO: one question is about what happens when these things are in the public RDF stores and therefore can be queried by the public. What SPARQL query would be used to not just get the original values at the data layer, the annotated values at the data layer and all the values and annotations at the annotation layer.
- Vlado: my understanding is that the public only sees final results, i.e. approved data proposals. These are updated in the Data layer upon project admin approval.
If you want to see all data about a particular field, you only need to query the Annotation layer since all migrated data is also represented in (linked from) the Annotation layer
- Vlado: my understanding is that the public only sees final results, i.e. approved data proposals. These are updated in the Data layer upon project admin approval.
Representation Approach
We considered 3 implementation Alternatives, described in the article at Property Types and Annotations. We follow Alternative2:
- Annotations are represented as crm: E13_Attribute_Assignment. See E13 Attribute Assignment@crm and graphical example at attribute_assignment@crmg.
We use P140_assigned_attribute_to, analogs of P141_assigned, and some additional properties. - We also use a property inspired from RDF's Reification vocabulary (rdf:predicate)
Open Issues
"TBD" marks items To Be Decided before we proceed to implementation.
- "TBD BM" are business-level and are for the British Museum to decide
- "TBD Onto" (or just "TBD") are technical and are for us to decide
Annotations
An annotation:
- is most of all a comment (note) that can include Rich Text and Links
- can be bound to a semantic object or its part or be free-standing (eg a project-level discussion); and/or reply_to another Annotation
- is made by someone at a certain time (can be a system user recently, or some art researcher in the past)
- can criticise/replace an old value (other_object) and/or propose a new value (object)
- can have various dispositions/types/states, eg:
agree, justify, disagree, original, proposed, approved, promoted, disapproved, deleted, publishable, published
Annotations are represented as E13_Attribute_Assignment, but we don't use P141_assigned, and use several other properties
Annotation Point
Annotation Points (AP) are all places in semantic objects or their parts or properties that researchers can talk about. Consider this data example (property chain).
| It includes:
A few AP that we want to discuss are shown in red.
|
APs consist of up to 4 fields (URIs). They are more complicated than Web links, because they allow a variety of targets:
- rso:root: the MO (2) (formerly called "main_subject")
- crm: P140_assigned_attribute_to: the previous node (3) (formerly called "subject")
- rso:property: the property (4)
- rso:object: the value (5) (formerly called P141_old_object)
The representations of each AP (and the allowed combinations) are:
AP | root | P140 | property | object | Meaning |
1 | Unattached comment. No AP Link is involved | ||||
2 | x | MO as a whole. A Semantic Link | |||
3 | x | x | x | x | Internal node (object). P140 is the previous node. An AP Link |
4 | x | x | x | Specific role. An AP Link | |
5 | x | x | x | x | Specific role and object. An AP Link |
Notes:
- root and P140 may coincide, if the property is right off the MO (eg P57_has_number_of_parts)
- TBD: (3) could have all 4 fields filled out, if we record the previous node and property (P140=<obj/2926/part/1>, property=P108i_was_produced_by, and P141=<obj/2926/part/1/production>). But since we can't propose a object, there's not much point to record other_object
- TBD: in discussion 20120117 it was suggested that (5) may add unnecessary complexity for the user (too many APs to select from)
Annotation URIs
One option is to create Annotation URIs as random GUIDs.
However, we decided to use URIs that reflect the AP for easier tracking/debugging.
URIs come in two slightly different forms, depending on when are they created:
- created by a user at runtime: uses a unique part (if possible short and numeric) denoted below as "U"
- precreated during data migration: uses a counter
- for most fields the counter is "1", since an AP has at most one Remark field
- for <opm._verblijfplaats> collection remark, use the <collectie> counter, eg for the second collection the URI is:
<obj/2926/P49_has_former_or_current_keeper/a/2>
Created at | AP | root | P140 | property | URI form | URI example |
runtime | 1 | "http://www.researchspace.org/a/" & U | http://www.researchspace.org/a/U | |||
precreated | 2 | x | root & "/a/1" | <obj/2926/a/1> | ||
runtime | 2 | x | root & "/a/" & U | <obj/2926/a/U> | ||
precreated | 3 | x | x | P140 & "/a/1" | <obj/2926/part/1/production/a/1> | |
runtime | 3 | x | x | P140 & "/a/" & U | <obj/2926/part/1/production/a/U> | |
precreated | 4,5 | x | x | x | P140 & "/" & LNAME(property) & "/a/1" | <obj/2926/part/1/production/P14_carried_out_by/a/1> |
runtime | 4,5 | x | x | x | P140 & "/" & LNAME(property) & "/a/" & U | <obj/2926/part/1/production/P14_carried_out_by/a/U> |
Please note that the presence or absence of "object" does not affect the URI
AP Names
RS generates a user-friendly name for each AP.
URI field | text generated from | example |
root | P2_has_type.rdf:label and P102_has_title[P2_has_type=rst-note:title-primary].P3_has_note | painting "Susanna" |
P140_assigned_attribute_to | P2_has_type.rdf:label or rdf:type.rdf:label (if different from root) | Production |
property | rdf:label | carried out by |
object | "literal" or rdf:label or P1_is_identified_by.P3_has_note | Rembrandt |
The names of the example APs are (RS adds punctuation as shown here):
2. painting "Susanna"
3. painting "Susanna": Production
4. painting "Susanna": Production: carried out by
5. painting "Susanna": Production: carried out by: Rembrandt
- We'll add rdfs:label to each CRM and RSO class and property that we use (RS-322@jira).
Erlangen CRM doesn't include such (it includes rdfs:comment, but these are scope notes, which are overly long and not useful) - Limitation: this doesn't identify "subject" unambiguously.
Eg in (4) and (5) you can't tell whether Production is about the main part (painting) or another part (frame).
So a link to the frame's maker will sound pretty similar:painting "Susanna": Production: carried out by: Willem de Vries
This could be fixed if we add Production.has_type and distinguish rst-production:painting vs rst-production:frame
The name is saved in rso: P3_has_title of precreated APs and is used as follows:
- When linking to an AP, the name is displayed in lieu of the link
- When a user makes a New annotation for an AP, or Replies to an existing annotation, has_title is copied. Then the user can edit it
Limitation: if the user changes rso:object, the link name is not updated automatically (the other AP fields cannot be changed)
Annotation Title and Description
- rso: P3_has_title: a short 1-line description without special markup. Printed in the threaded discussion view (tree)
- rso: P3_has_description: the most important part of an annotation, mandatory.
- A multi-line Rich Text field using HTML with special markup for links of various kinds.
- We assume the markup is sufficient to identify the links, so the links don't need to carry extra info (eg text positions)
Links
Links point to various objects and are crucial for several RS use cases. They are:
- Pre-created for parts of semantic objects, so they can be pointed to
- Referenced by invoking the Link tool when editing an annotation Description
- Saved to Data Basket, etc
Link kinds:
- Web Link: the usual URLs to web pages or other internet resources
- DMS Link: TBD: do we need direct links to a digital asset in Nuxeo (using nuxeo_uid)?
For RS3.2 we won't have such, since DMS assets have corresponding semantic objects, so we'll use semantic links - Semantic Link: to any Annotation Point. Represented the same as an Annotation, but only the AP properties
- Limitation: in RS3, Semantic Links can point only to the RS repository, not to other semantic repositories.
rso:has_link: multi-valued property (array) of Semantic Links embedded in the Annotation Description.
- This is an indexing (caching) array used to find all annotations referring to an AP, a MO, a property-instance etc.
- Does not collect web links, since it's enough that they are stored in Description
- Holds the URI of the AP node or a specific Annotation (in both cases a crm:E13_Attribute_Assignment)
- If the text is edited, all links must be replaced
RS-305
RS-317
Annotation Reply
- rso:reply_to: previous annotation that this is a reply to (i.e. forms threaded discussions).
Copied Fields
When a user makes a New annotation for an AP, or Replies to an existing annotation, several fields are copied:
- rso:root, crm: P140_assigned_attribute_to, rso:property
Cannot be changed, i.e. a discussion thread cannot jump from one AP to another - rso: P3_has_title
This is similar to email: when you reply, the subject is copied. Can be edited by the user - rso:object (the value) is not copied, since a reply could be about an alternative value for the same property-instance
Annotation Author and Date
- crm: P14_carried_out_by: who made the annotation.
- This can be:
- an art researcher (from data migration), i.e. a node from thesaurus rkd-person:
- a system user (recorded by RS).
(RS3.3 or later) Integrate DMS User Management (add/delete) to synchronize to a rso-user: thesaurus
- Pro: the person name is found in the same place: P14_carried_out_by.P131_is_identified_by.P3_has_note
- Pro: if a user's name is changed, it will be updated in all his annotations
- This can be:
- crm: P4_has_time-span: when the annotation was made.
Points to an intermediate node that holds the date. Imprecision (date range) is not supported.- if by an art researcher in the past: crm: P82_at_some_time_within ^^xsd:gYear (from data migration)
- if within RS: crm: P82_at_some_time_within ^^xsd:dateTime (datetime that is precise to the second, recorded by the system)
Object
An annotation may propose a new value
- rso:object: value proposed by the annotation (formerly called P141_new_object)
Limitation: the user can propose only atomic objects: literals or thesaurus values.
- RS3.2 won't be able to propose a new compound object (eg Exhibition).
- Reason: to enter new compound objects, we need specific data entry forms for each; RForms and CRM flexibility notwithstanding
- TBD: this may be too severe, since there are almost-atomic objects that maybe we should accommodate, eg:
- <obj/MMM/title/N>: title together with type
- <obj/MMM/part/1/production/date>: time-span that can be a single date (P82) or interval (P82a, P82b)
- Constraint: applicable only if rso:property is present
The system helps the user enter an appropriate value:
- Value checking:
- literal: preserving the old type, while allowing some variation (xsd:date vs xsd:gYearMonth vs xsd:gYear)
- thesaurus value: selecting from the old thesaurus (eg rkd-artist)
- To determine the appropriate type, RS uses the old value, since we don't have data type info in the schema
- RS uses any old value at the AP, no matter whether from the Data Layer or another Annotation
- All objects at an AP share the same type or thesaurus (RS-323@jira will verify this)
- We need to correct/verify numeric types emitted by the data migration (RS-324@jira):
- Ensure all numeric quantities (eg width) use xsd:double even if it's a whole number.
- has_number_of_parts should be xsd:positiveInteger (if xsd:integer then else someone can propose -1)
Other Object
An annotation may talk about another value (in the Data Layer or proposed by someone else):
- rso:other_object: old value that's being criticised or justified
- Selected amongst the distinct rso:object's at this AP, except the rso:object of the current Annotation
Meaning of different combinations:
other_object | object | meaning |
0 | 0 | Just a comment |
0 | 1 | Add new value |
1 | 0 | Remove existing value |
1 | 1 | Replace value |
- This meaning is enacted upon Approval of the annotation
- The action on "other_object" is enacted only if it's in the Data Layer and disposition is "criticise"
Other Annotation Binding
"Binding" means higher-level entities that the annotation can be attached to (associated with).
The AP fields and reply_to are such, and in the future we may add more:
- (RS3?) rso:project: in which project the annotation is created (we currently have only one project, but future iterations will introduce several projects and address security)
- (RS3?) rso:root could point to an annotation (A1), if we want to record an Approval action as an annotation (A2)
This is different from reply_to, since A2 acts upon A1, instead of merely replying to it
Dispositions
Dipositions are various "flags" (eg types, states): fixed-thesaurus values with special meaning to RS. This means that:
- RS actively checks various constraints before allowing a disposition to be set
- we need to have special business and UI logic for most dispositions. Eg for UI:
- It's best to present dispositions relating to new_object right before the control for entering the new value
- The reply_to dispositions should be next to the button "Reply" (vs Start new discussion)
- Some dispositions are not entered directly through a listbox, but as the result of a certain "workflow operation"
- Some disposition operations are allowed only in certain states, and to a certain user role (the Project Admin)
- Sometimes the display is filtered by disposition, eg
- Project Admin has a special "proposed" view
- "deleted" annotations are shown only to Project Admin, but hidden from normal users
Field | Disposition | Constraint | Meaning |
P2_reply_disposition | agree | reply_to | Agree with the replied to comment |
P2_reply_disposition | disagree | reply_to | Disagree with the replied to comment |
P2_other_disposition | justify | other_object | Embedded link(s) provide justification for other_object ("object" is always supposed to be justified) |
P2_other_disposition | criticise | other_object | Embedded link(s) provide justification against other_object |
P2_annotation_status | original | Created from the original migrated data | |
P2_annotation_status | proposed | Normal annotation. If "object" then new value is proposed |
A precreated AP with no data (Algorithm to precreate links RS-320@jira) has P2_annotation_status of "invisible". It is neither shown nor indicated to the user.
Disposition Discussions
The rest are under discussion; they wont be implemented in RS3.3 but in future iterations. Jana things that deprecated, deleted, promoted are flags related to discussion visibility/moderation, and not statuses; Vlado is not so sure.
Field | Disposition | Constraint | Meaning |
P2_annotation_status | approved | was proposed | Project Admin has approved comment and/or proposed value. If object, it's promoted to Data Layer. If other_object is criticised, it's removed from Data Layer |
P2_annotation_status | disapproved | was proposed | Project Admin has disapproved annotation, so it won't appear in the "proposed" list anymore |
P2_annotation_status | promoted | Promoted to project-level discussion by Project Admin. (Comments without semantic link start off this way) | |
P2_annotation_status | deprecated | Deprecated by Project Admin. Cannot be replied to (so the discussion is cut off at this point) | |
P2_annotation_status | deleted | Removed from view by Project Admin. Can be viewed only by Project Admin. Is the discussion below also deleted? | |
P2_annotation_status | publishable | was approved | Set by Project Admin to indicate high quality of writing and/or importance, worthy of publication |
P2_annotation_status | published | was publishable | Published to project results web site |
![]() | TBD BM: the table above is only a sample and the result of a few hours of thinking.
|
- TBD: DO: I wonder whether projects can add their own Dispositions to the core set?
- Vlado: sure, would be a nice complication
The big question is whether those are actionable (have meaning to RS) or not (have meaning to the users only)
- Vlado: sure, would be a nice complication
- DO: What happens when someone proposes a new value
- Vlado: project admin looks at the proposal and then approves or rejects
- DO or assigns another disposition?
- Vlado: such as?
- DO: I assume that his workflow action will also be in the versioning as all other annotations are. The project admin may accept but put his/her own reason or annotation
- DO: this process of approval and rejection is recorded how?
- Vlado: this is for a future iteration, but basically it changes the disposition. Do we need to record who when did it (sort of Annotation of the Annotation)?
- DO: We should already have the concept of an annotation of an annotation. Decision should be recorded as a normal annotation but attributed to the project admin
- Vlado: This should be possible by following linked data ideas (an Annotation has a URI).
But we have not thought this outIt makes for some recursive structures that I don't have a grasp of (do you?)
The project admin can certainly Reply to his heart's content, but I thought a decision is enacted by data manipulation, not by recording Annotation over Annotation
- Vlado: This should be possible by following linked data ideas (an Annotation has a URI).
- DO: Will another "team" be able to see the sequence of "types" applied? (Guess you mean "projects" and "values" respectively?)
- Vlado: My understanding is that other people (general public or other projects) only see published results
- DO: yes - to what extent can projects publish stuff as they go along - worth bearing in mind - at what granularity can this be done?
- Vlado: my understanding was that projects publish only at the end
Guess that's completely wrong
- Vlado: my understanding was that projects publish only at the end
- DO: should the audit trail say: person A proposed, Person B Criticised, Person C proposed something else, etc.
- Vlado: the discussion thread shows the annotations. What's printed for each is a matter of discussion (I think current spec says who, when, title). Do we want to print Dispositions: all or only some of them?
See bigger discussions in last section
Property Qualifiers
Sometimes a value needs to be qualified by an extra type, eg:
- Title type: primary, other...
- Identifier type: RKD_priref, RKD_object_record...
- Attribution (author) qualification: of, studio of, circle of...
- Author role: master craftsman, understudy...
Cases 1,2 are represented in the Data Layer since the respective objects are not shared and there are CRM classes for them (E35_Title, E42_Identifier).
The other cases are more problematic, since they map to a property type P14.1_in_the_role_of@crm over P14_carried_out_by. Property Types and Annotations: sec 1.4 gives some examples and sec.3 discusses implementation alternatives.
- case 3 is represented the same as a disposition (see next section)
- case 4 doesn't occur in Rembrandt data
RS3.2: print amongst all other dispositions (see example in next section)
RS3.3?: allow the user to enter/edit them
- collect Qualifiers to determine applicable thesauri
- skip all Dispositions thesauri (because dispositions cannot be edited directly)
Representation of Dispositions
- Dispositions are saved in sub-properties of P2_has_type (eg P2_annotation_status)
- Disposition values come from respective thesauri rst-* (eg rst-annotation-status)
- Dispositions are printed in a uniform way, using P2_has_type.P3_has_label (from the meta-thesaurus) and P3_has_label (from the respectieve thesaurus), eg
Attribution Qualification: stiduo of, Reply: agree, Other value: criticise, Status: proposed
Annotation Example
Annotation Picture
Discussion on Annotations and Data Updates
Dominic 20111230
DO: We need to specify how we deal with the original and have a definition for this to inform the business rule. Does original mean the first version or the version that comes from the owning organisation. If the latter then how an update from the owning organisation dealt with. Perhaps the owning version should always be the original or at least marked as original. If this version is updated then does this create a new owner original in which case what happens to the previous owners version and the associated annotations.
For example, the British Museum provides the data for the Rosetta Stone. This is indicated as the original. The research team create an annotation against the original. The BM change their record and new data is uploaded. Is this still considered the original. If the new version is the original then something needs to happen to the old original (a new version marked as an old original) and any associated annotations which reference the old original.
06.DAUC03.R2 Again what happens if a version is annotated by another user. The original user can't just come back and amend because the annotation might then become meaningless. E.g. I state that the object is red. An annotation says that they thought it was blue. The author of the original data goes back and amends from red to blue. The annotation then doesn't make sense. The original user might create another version to say that it is blue and then give it the correct status. Does this mean that what should happen is that the status of the original should be changed (and indeed a prompt given to avoid conflict). For example is the red version was given the approved status but this was a mistake and the blue version should be approved then the status of the red version should change to rejected. The status flags therefore need to have some business rules that are converted into UI components like conflict prompts. E.g. "Another version of the field is also maked as 'approved' do you want to override", y/n. If yes then, "the version X will be change to ..."
Vlado 20120102
I think that Data Updates are out of scope for RS3.
- I understand it's important, but it's also very complicated.
- You mentioned about a month ago that you had a writeup from someone (Lec? or Ken?), can you send it?
- I think your comments are about a CHANGED original value. But even the basic problem of reattaching annotations to unchanged values is complicated.
- the graph nature of RDF is both a strength and a weakness here:
- pro: unlike a relational schema, you can delete a data value, while annotations will hang around.
When you reinsert the unchanged value, annotations will magically reattach themselves, if URIs are stable. - cons: but how do you reliably recognize what to delete and what to keep?
- pro: unlike a relational schema, you can delete a data value, while annotations will hang around.
Dominic 20120111 on Versioning
We still need to understand exactly how Versioning will work starting from the original entry from the owning institution (if available).
- If all proposals are rejected then what is the default value
- The original (it stays in the data layer
- Please note something like Cuneiform Digital library (CDLI).
- An owning organisation (BM) might send some data (original).
- The receiving project (CDLI) might then add some data. They might provide a more precise dating for a tablet (This is a practical example I have seen).
- They also add additional fields, eg from other data sources such as a previous catalogue.
- In this event the original data is not the same as the CDLI data, although it is attributed to the British Museum
There are 4 situations:
- The project imports a data record from the owning institution from their collection records system to start the process of research
- The research starts and information about an object is created by the researchers and later the data record from the owning institution is imported
- The owning institution manually inserts the data.
- (Vlado: is this the 4th one?) The data from the owning institution is unavailable.
Vlado 20120113
- We will precreate AP Links for all APs as E13_Attribute_Assignment with no data fields
- But we could record accurate attributions for the original (migrated) data, simply by putting P14_carried_out_by id:BritishMuseum
- Seems we're talking about mixing (migrating) data from several institutions...
TODO Vlado: consider the above and reply more
Dominic 20120113
I think that originals might need a different flag e,g, OV1, OV2 (e.g. original version 1, original version 2) as opposed to project versions V1, V2. Eg V1 may be linked to OV2 but V2 may be linked to OV3.
6 Comments
comments.show.hideDec 23, 2011
Vladimir Alexiev
Jan 16, 2012
jana.parvanova
On dispositions:
I think that approved, proposed, original etc. are part of the approval workflow and are annotation statuses, not old/new dispositions.
On out initial discussions we were looking into the following examples:
1) User may reject an old value and propose a new one with a single annotation. (annotation has both old and new value)
2) User may just propose a new value without rejecting old one (annotation has only new value)
3) User may just reject old value without suggesting new one (annotation has only old value)
In all cases we need to have full set of annotation statuses available without coupling them with old/new value dispositions - that is proposed, approved etc. Admins are not supposed to approve only parts of the annotation (i.e. only new or only old values).
(Same for "original".)
All in all, I think that new/obj dispositions are overcomplication and not needed at this stage. It is not obvious for me they add significant values to the application and I am not sure how they relate to actual use cases.
Jan 16, 2012
jana.parvanova
I think we should have the following statuses/dispositions:
1. Workflow statuses at annotation level (not and new_obj): proposed, approved, disapproved. Maybe publishable and published - but I prefer to decide this when we start working on publishing.
2. Dispositions to old_obj: justify, criticize. I am not sure about original.
3. Flags at annotation level related to discussions visibility/moderation - deprecated, deleted, promoted
I am not sure we currently need agree/disagree as these don't have operational semantics.
I also don't think we need invisible/normal - as invisible are just the ones that don't have titles.
Jan 17, 2012
Vladimir Alexiev
2a. Jana: migrated Remarks should go in new_object not old_object since the semantics is "new_object is the thing asserted by the annotation"
Vlado: ok
4. "don't have titles": I think it's "safer" to filter by status, and it's more convenient to to filter by status only (which you'll have to do eg for a Project Admin display). I've removed status Invisible: precreated APs will simply have no status
Jan 20, 2012
dominic.oldman
I think that originals work like this.
The data from an owning organisation has a version - say O1. Projects annotate the data and propose new values. v1 & v2 & v3. If the Owning organisation puts up fresh data, then this becomes O2. In this way it is clear to what are owning versions and what are project versions.o1,v1,v2,v3,o2,v4 etc.
Feb 06, 2012
jana.parvanova
There are some problems with AP Names when it comes to objects as described: ""literal" or rdf:label or P1_is_identified_by.P3_has_note"
Please, take a look at the following cases: