annotation-paradigm 3.0¶






Contents:
Queries As Annotations¶


Application¶
The application consists of two parts, a set of perl scripts for data conversion and a Web2Py web application for the end user.
Data conversion¶
A number of Perl scripts to extract the data from the Emdros WIVU database by means of MQL queries. After that the MQL output is transformed into several SQL files, to be bulk imported by an SQL database.
There is a database, wivu for the WIVU textual data, and there is a database, oannot, for the annotation data, which are feature data, also coming from the WIVU database, and query data, imported from a set of example queries.
History¶
Author: Dirk Roorda, DANS.
This is the motivation, history and explanation of the Queries-Features-As-Annotations (demo)-application.
The underlying idea has been presented first at workshop Biblical Scholarship and Humanities Computing workshop on 2012-02-06/10. See also Lorentz Workshop. Currently the idea is being worked out in a CLARIN-NL project called SHEBANQ.
Hopefully SHEBANQ will spawn a hub for doing qualitative and quantitative analysis on LAF resources. See :ref:`LAF Fabric`.
I am have also been working on a idea for dealing with versions of all kinds: :ref:`Portable Annotations` or: annotations that remain usable across variants of resources.
Another example of the versatility of annotations as a carrier for the results of scholarship is :ref:`Topics As Annotations`, for which I am currently developing the Topics-As-Annotations (demo)-application.
Queries and their results are meaningful objects in the scholarly record, but how do we preserve them? We explore the idea to store queries on a corpus as annotations to that corpus.
Contributors:
- Henk Harmsen presenter of On the Origin of Scriptures, highlighting the care that is needed to preserve the texts and the research on them through the digital ages.
- Dirk Roorda presenter of Datastructures for origins, interpretation and tradition of text, also known as provenance, analysis and sharing (presentation available at the workshop site.
These colleagues of mine have assisted in the development of the demo
The Case¶
Linguistic Text Database¶

The WIVU group in Amsterdam, led by Eep Talstra has crafted a linguistic database of the Hebrew Old Testament. It is the fruit of decades of work, still continuing.
The acronym WIVU is not officially known on the website of the Free University Amsterdam. But there is a current NWO project Data and Tradition. The Hebrew Bible as a linguistic corpus and as a literary composition, initiated by Eep Talstra, that continues the WIVU work. Oliver Glanz is associated with it.
The text with all its features is stored in a special database (EMDROS). This database supports a pragmatic subset of the query language QL, defined in [Doedens]. The subset is called MQL. EMDROS is a front-end on more conventional database technologies such as MySQL or SQLite.
Scholars use MQL queries to shed light on exegetical questions. It is possible to define quite sophisticated queries into text corpora, of which the Hebrew Bible is an example. These queries do not only mine the surface forms of the texts, but also the annotated features that have been added over time. Depending on the amount and detail of these features, the queries will yield a treasure of information, some of which will be fed into the text database in subsequent stages.


The WIVU database is included and integrated in Logos/SESB Bible study software.
SESB 3.0 also contains the Biblia Hebraica Stuttgartensia with text critical apparatus and linguistic WIVU database, plus the Biblia Hebraica Stuttgartensia: Workgroep Informatica Constituency Tree Analysis. The new WIVU database allow for a precise overview on how the single textual elements of a specific text passage are analyzed and hierarchically organized.
The WIVU database is not the only linguistically marked up database of the Hebrew Bible. To get a feel for how these databases are used, have a look at the Logos users’ forum.
While this approach is beneficial for analysis and interpretation, it poses challenges for sharing and tradition.
All these people that have bought their individual copies of the WIVU database, running their queries locally, what means for sharing their research do they have? Of course they can publish their results in scholarly articles, but how is one to get an overview of all the research that is going on?
Here we propose an approach where queries and their results are being visualised together with the canonical sources around which they have been constructed.
A first attempt by DANS for sharing the research done with the WIVU database using the EMDROS tool has resulted in a website hosted by DANS where interested people can experiment with MQL queries on a 2008 version of the WIVU database.
After that the WIVU database has been listed in an inventory of language resources by the European CLARIN project.

Open Annotation¶

Annotation is a pervasive research activity. It is the vehicle to enrich the data of science by the insights of experts. Annotations are often locked into the sytems that hold the data or the tools by which scientists perform their analysis. They might be using programs on their own personal computers, or work in an institutional environment around a database for which no public access has been set up. Where it is not possible to address the things that annotations are about, the annotations themselves cannot be shared in ways that do justice to their potential.
It is very fitting in our network age that there is an initiative to liberate annotations from their specialised contexts and share them freely on the web: the Open Annotation Collaboration (OAC).
The first task is to identify a general, abstract model to which all annotations conform. The OAC model talks about annotations as entities having one body and one or more targets. The body is what you say about the targets.
The second task is to employ web-technology to underpins that model. OpenAnnotation has chosen the emerging Linked Data framework (see also Linked Data Book) for giving web-reality to its abstract concepts of bodies, targets and what connects them together.
Additional information about an annotation can be supplied as well. This is metadata about an annotation. This is the place to tell who made the annotation, when, and why, in what context, referring to which publications, etc.
Metadata is not modeled by the OpenAnnotation Collaboration. It is attached to the annotation by usual Linked Data means.
Taking all this togehter means that: here we have a framework that is working on the most pervasive technology of the information age, with a nearly unlimited capacity of making connections where there used to be no connections. It is suited to express the results of much hard research work. Last but not least it has the promise of new discovery by new ways of visualising patterns.
The Idea¶
Queries as Annotations¶
So much for annotations. What about queries into databases? Where annotations are rather passive, static comments, queries are active, dynamic forays into landscapes of data. What do the benefits of OpenAnnotation have to do with queries? If we want to preserve research output, we should preserve queries!
Yes, and here is how. First we must realise how difficult it is to preserve queries in their dynamic form. Several problems stand in our way:
- real-life databases are changing all the time. Running a query now will give different results from running a query tomorrow.
- software itself is difficult to preserve. The whole stack of <operating system - database management system - query language - publishing manager> is changing at such a rate, that if you do nothing special, the query that runs fine now, will not run at all in two years time.
Yet, when a scholar is grounding the conclusions of a journal article on the results of a query, it is important that there is some permanent record of the exact results. It matters less that the query cannot repeated exactly. It matters more that when the query is repeated, the differences with previous runs can be spotted.
Preserving the record of a query (and its results) is a whole lot different from preserving the capability of running that query with exactly the same results for the indefinite future. This record is not dynamic but static, not active but passive. So it makes sense to turn to OpenAnnotaion again and see what they can do for query records.
The central idea is:
a query on a text database can be preserved as an annotation to all its result occurrences in that database.
In the list below the exact correspondence between the annotation concept and the query(record) concept is shown, together with the WIVU database as an example.

Benefits of Queries-As-Annotations¶

Is it worthwhile to preserve queries as annotations? What can you do with it? Well, the best way to find out is to start building it and see what happens.
We (Eko and Dirk) have built a demo web application that shows the surface text of the WIVU database in the left column, and the relevant queries in the right column. A query is relevant if it has results occurring in the text of the left column. You can click on a query to highlight the actual results on the left.
When DANS built the DANS-WIVU website back in 2008, we used 22 real-life example queries. Now in 2012 we use a new version of the database, called BHS3, and we have adapted those 22 queries to the new datamodel. Yes, the data model has changed, most of the queries did not run out of the box.
Users of bible study software are familiar with searching (=querying). But it is just one direction of information that is usually supported: given the query, you get the results. What is more difficult to obtain, is: given text passages, what are the queries of which they are results?
This latter direction makes only sense if there is a set of queries that have a significant research value. Not every search command or query is equally valuable. It depends on the underlying research question, if there is one, the identity of the person/project that crafted the query, and associated publications. Queries-As-Annotations have the potential to make this network visible. And the source texts themselves are part of this network and provide access to it.
The present demo application lets scholars stumble upon each other’s research questions by showing the published queries, including metadata, next to their results.
If there is a significant amount of published queries, an interesting network of researchers, research questions, query instructions and text passages will be revealed.
Anchored Sources¶
The query-as-annotation idea is most easily implemented and most effective if the sources against which the queries are executed, are stable.
If they are not stable, the idea still works if the atomic elements in the sources are addressable by stable addresses.
If the addresses are not stable, the idea can still be put to work if we can translate addresses from one version to another.
This requirement can be summarised by saying that the sources should be anchored:
all atomic elements of the sources should have well-defined addresses.
In the WIVU case, the words are called monads and they have a sequence number in the whole corpus. We have used these sequence numbers as our anchors in for our demo application.
There are more versions of the Hebrew texts that collectively comprise the Old Testament. If we do not want to tie our queries to a specific version, several approaches could be followed. We could make databases for the individual versions, assign local monad numbers to the words, and then compute a mapping between the monad numbers in such a way that the numbers of corresponding words are mapped.
A radically other approach is to integrate all versions in a super-version. Each word will carry an extra feature, called version-membership, which specifies the versions to which the word belongs. If the addresses of the words are assigned in such a way that it is possible to add new words between existing words, than we can keep anchors stable even if new versions are discovered and added.
The query-as-annotation approach is suited also to this super version. Even if the assignment of words to versions is not stable, and follows the successive waves of scholarship, the words themselves have fixed addresses, and can be targeted by query records.
Independence of Analytical Tools¶
In order to show the query record, it is not necessary to preserve the analytical machinery in working order.
A positive consequence is that there is no brake on the continuous evolution of analysis tools. There is no legacy to be carried around. When showing query results, it is not needed to show all the analytical features carried by the words in the database. That makes it easier to build web interfaces that are optimised for sharing research and discovering larger patterns.
Of course, it remains important to preserve the data of the analytical machinery, i.e. all the features and their organisation by the database. But we do not have to preserve them in a way that we can run the analytical tools on them. It is sufficient that the data is transparent en well-documented.
Transparent means: open to inspection, not hidden behind opaque binary formats. The tools that see through these binary formats may not exist anymore somewhere in the future.
Well-documented means that it will be clear to future users of the data what the features mean, what the quality of the value assignments are, and for what purposes the data have been collected and used.
Solution of Digital Preservation¶
Considering the field of biblical scholarship, the nature of the analysis that takes place, the kind of research results that are communicated, it can be concluded that we can do a good job in digital preservation of research data if:
- we use and preserve anchored sources
- we preserve static analytical data in trusted digital repositories
- build and maintain an Linked Data web of Queries-As -Annotations
It is not perfect, in that we will not be able to completely reproduce every detail of past research. But it is still good value to be able to see a well-documented, well connected track record of research questions and answers.
The Work¶
Components¶

Annotation Database¶
For maximum interoperability, the Open Annotations should be stored in an RDF triple store and published on the Web.
Then other researchers can discover the queries, view their metadata properties, follow their results. Via the property researcher they can find the one who has crafted the query, and then find other queries by the same person. Through the property research-question they can observe the research programme behind the query, and find useful keywords to search for related material.
Nevertheless, this demo stores the queries in a local database. This might be a typical situation. In order to use the annotations in a rendering application, it might be useful to import them from a triple store into a local database.
Conversely, when queries are locally added, they must be exported from the local database to a global triple store.
Although we store the queries in a local database, they are modeled as annotations with bodies, targets and separate metadata. It is conceivable that quite different annotations (not derived from queries) on the same source (Hebrew Bible) are added to the store. This will in no way break the application.
Source Database¶
We work with a version of the Hebrew Bible that comes from the WIVU group. The version is called BHS3.
This is a feature-rich source, in the literal sense: words and clusters of words are objects, and objects have many features.
We compiled a source text in XML by extracting the text feature from the words, the verse-, chapter- and book- objects, and the monad numbers (i.e. the sequence numbers of the individual words in the complete text).
These monad numbers are our anchors. They will be used to identify the occurrences of the query results.
Here it is clearly visible that we do almost nothing with the analytical machinery. It’s only the surface text and the anchors that we need.
That being said, it is not that difficult to also export some of the features and show them in the interface, possibly on demand. If we take that too far, the web interface will be a competitor of the commercial bible study software in which the WIVU is packaged.
Whether that is a good thing or a bad thing depends on your perspective. From the perspective of DANS , as an enabler of data re-usage, it is definitely a good thing!
Rendering Application¶

The web-application that renders the sources shows individual chapters of individual books of the Hebrew Bible.
The browser-screen is divided into a left, middle and a right column.
The middle column contains selection controls for books and chapters, and below that displays the selected chapter. The text is rendered in Hebrew characters, from right to left. This is the primary data, the source, and it is pulled out of a mysql database, called wivu.
The left and right columns contain annotations, which are pulled out of a different database, called oannot. This database has tables for bodies, targets and metadatarecords for annotations. It has a table for annotations as well. An annotation record consists of an identifier only; the body and its targets are connected to it by means of link tables.
The left column contains highlight controls for features. Features are annotations that have bodies of the form key = value, and they can be distinguished by other annotations through a field called type in the metadata record. The targets of a feature annotation are all the words that carry that feature-value.
The right column contains the queries that have results in the selected chapter. The results can be highlighted per query, and more information about the query can be shown on demand.
Implementation notes¶
The current version is built as a web application inside a web2py framework.
In a previous version the rendering of the source texts in the middle column has been done with the Java-Hibernate framework: the database objects are made accessible to Java and from there to the webserver. However, in a second version we changed to something much more light-weight. The program does not make very extensive use of the data, so it is not really worthwhile to introduce a data abstraction layer. More over, this layer makes the performance of the application as a whole less perspicuous: there is a heavy overhead at first, and a smoother behaviour later on. We needed more control both with respect to functionality and to performance.
The rendering of the queries in the right hand column is done in a different way. We needed more intricate queries to select the relevant items. There is also another reason not to knit left and right too closely together. The annotations are to come from a different world, on demand, as a kind of stand-off mark-up. They are an overlay over the sources themselves. The evolution of rendering the sources should not be tethered to the evolution of rendering the annotations.
A short remark about the highlighting functionality. This is realised by client side javascript. The query that fetches the query-cum-results gathers enough information to know which words should be highlighted per query. This information is written out to javascript associative arrays inside a <script> tag in the body of the html that is sent to the browser. For the features in the left column it works in the same way, essentially.
The success of the web2py framework appears from the following observation: the amount of our own coding on top of the framework is embarrassingly small. Here is an overview of the number of lines we coded in each formalism.
language | web app | data preparations |
---|---|---|
sql | 90 | 80 |
python | 250 | x |
perl | x | 650 |
javascript | 300 | x |
html | 50 | x |
css | 60 | x |
shell-script | x | 280 |
Data Statistics¶
Here is an overview of how many source data and annotations this application deals with.
quantity | amount | extra info |
---|---|---|
source texts | 90 | 80 |
all annotations | 250 | x |
all targets | x | 650 |
query annotations | 300 | x |
query targets | 50 | x |
feature annotations | 60 | x |
feature targets | x | 280 |
Functionality¶
Basic Features (implemented)¶
The source text is rendered (per individual chapter) with results of selected queries highlighted.
This shows the beginning of the experience that you can navigate over a web of research questions, researchers, query instructions, query results and source texts.
The most critical part is to encounter results in source texts and navigate to the relevant queries.
Extensions (not implemented)¶
- More features of the source text * selective highlighting of part-of-speech features of words * mark clauses and phrases by brackets
- More navigational aids * different colors for different queries * better representation of the structure of a query result
- Better display of metadata * show full details of a query on demand (e.g. on hover)
- Query management * Make a straightforward list interface on the metadata records of the queries * Make all columns sortable * Allow editing of fields * Allow adding new queries
Adding queries (not implemented)¶
There are two ways of adding queries: as a generic annotation or as a verified query.
Generic Annotation
Adding as a generic annotation means that the user supplies an annotation body instruction and specifies the targets in the source text and provides metadata, and there will be no checking whatsoever. The body does not need to be a query instruction, and the targets do not have to be the result of any query. It is just a free annotation, that can say anything about anything.
Of course, it is perfectly possible that a user runs a query on his system at home, collects the results, and puts everything into an annotation. As long as the anchors for the words (the monad numbers) in his own version of the sources correspond exactly with the anchors in the server version of the sources, all is well.
Verified Query
Adding as a verified query means that the annotation to be added really is interpreted as a query. The user may or may not send his own results along. In any case, the query will be run against the database of the server, and those results will be stored when the query is saved as an annotation. If the results do not agree with the results specified on beforehand by the user, he will get a warning.
Linked Data Interoperability¶
Having the annotations in a local database falls short of the goal of sharing the query results freely on the web. So that will be the next thing to do: set up a triple store and export/import mechanisms between the triple store and the local database. The triple store can the be queried by quite other means: SPARQL queries. The annotations become discoverable, and it will be possible to collect annotations from various sources.
Timeline¶
Below are the most significant events in the history and future (?) of this demo, listed in reversed temporal order.
2013-05-02 Workshop Electronic Tools in Biblical Hebrew Research and Teaching¶
A workshop where the methods of Biblical research are demonstrated, a preview on new ways of sharing the results of those results is given. Eep Talstra, the retiring professor gives a concrete illustration of sorting, searching and simulation on the basis of the Hebrew Text database he and his group have developed. After the workshop his successor, Wido van Peursen takes over with his inaugural lecture “Grip op Grillige Gegevens” (getting to grip with capricious/raw data).
2013-05-01 SHEBANQ project starts¶
Today we had a kick-off of the CLARIN-NL project that aims to implement the idea of queries-as-annotations into a production system, running at DANS which acts as a CLARIN centre here.
2012-12-18 SHEBANQ project granted by CLARIN-NL¶
System for HEBrew text: ANnotations for Queries and markup is a 100 k€ project proposal for call 4 of the Dutch section of the European project CLARIN. With this project the Old Testament group at the VU University Amsterdam, lead by Wido van Peursen, and DANS will represent the renowned WIVU database in the Linguistic Annotation Format. It will then be possible to save meaningful queries on that database as (open) annotations. We will build a query saver as a web application. A project like this is important to get digital biblical scholarship out of the local computer systems into the clouds.
2012-10-29 Paper presented at the ASIST conference¶
Here are the
slides.
And here is the
article
in the
proceedings
(warning: this is a slow link, the result is a 6MB image).
Much more convenient is our final submission, which is a
searchable pdf
.
2012-07-25 Paper accepted for the ASIST conference¶
With Charles van den Heuvel I submitted a paper ” Annotation as a New Paradigm in Research Archiving ” where we advertise new practices for archives in order to really support modern research. The showcases are the CKCC project and the Hebrew database. The paper has been accepted and will be presented on 29 October 2012 in Baltimore.
2012-03-26 Towards portable annotations¶
Version 3.0: Preparation for portable annotations. The targets in the texts can be addressed by the word sequence number in the WIVU text. Now the address consists of book-chapter-verse + local word sequence number, i.e. the sequence number in the verse. We prepared the Westminster version of the Hebrew Bible with the aim of displaying it together with the WIVU version, in such a way that words common to both versions have common addresses.
2012-03-19/20 Interedition Symposium at Huygens ING¶
(abstract accepted) (to present and enhanced demo compared to the one presented at Lorentz Workshop)
2012-03-14 Features as Annotations¶
Version 2.1: After a recent, internal bootcamp we have brought new features to the demo: features as annotations. The user can select a number features-value pairs and highlight them in a color of his choice.
2012-03-07 Ported to web2py¶
The demo has been ported to the convenient, python-based web2py framework. Ready to receive more functionality.
2012-02-06/10 Lorentz workshop¶
(encouragement to work further on ways of sharing research, see Lorentz Workshop, queries-as-annotations still stand, but the idea must be worked out more compellingly).
2012-01-30/02-03 DANS mini bootcamp: Eko and Dirk¶
Building a demo for the sole purpose of realising the query-as-annotation idea in a real research context. It is not optimised for performance, it has no security measures whatsoever, and the user interface is bleak. We await feedback and suggestions from the participants of the Lorentz workshop. Depending on that, we hope to improve and add and present an improved version on a following workshop. The demo and its documentation are hosted on a cloud server , hired by DANS.
2012-01-11/14 Interedition Bootcamp¶
Inspiration as to the content and the implementation of the query-as-annotation idea. Dirk participated in a Linked Data subgroup. We built services to automatically annotate place names in Arabic, Greek and English texts. This we did by looking up each and every word in the Geonames database. The hits were translated into Open Annotations.
Topics As Annotations¶


letter from Christiaan Huygens
Application¶
The application consists of two parts, a set of perl scripts for data conversion and a Web2Py web application for the end user.
Data conversion¶
A number of Perl scripts to extract the data from the CKCC data. After that the output is transformed into several SQL files, to be bulk imported by an SQL database.
There is database, ckcc for the CKCC textual data, and there is a database, cannot, for the annotation data, which are topics and keywords.
Web Application¶
The Web Application is based on the web2py framework. The directory taa_1_0 contains the complete web app. It can be deployed on a web2py installation, but it will only work if it can connect to the relevant databases. Description ===========
Author: Dirk Roorda (References).
This is the motivation and explanation of the Topics/Keywords-As-Annotations (demo)-application (References). I am preparing a joint paper with Charles van den Heuvel (References) and Walter Ravenek (References) about the underlying idea. I am also working on a new idea: Portable Annotations (References) or: annotations that remain usable across variants of resources. Another example of the versatility of annotations as a carrier for the results of scholarship is Queries/Features as Annotations (References) demo application (QFA). Queries and their results are meaningful objects in the scholarly record, but how do we preserve them? We explore the idea to store queries on a corpus as annotations to that corpus. Contributors:
- Walter Ravenek (References), who delivered the data and annotations.
- Charles van den Heuvel (References).
- Eko Indarto (References)
- Vesa Åkerman (References)
- Paul Boon (References)
The TKA (References) demo app is a visualiser of topic annotations on a corpus of 3000+ letters by the 17th century Dutch scholar Christiaan Huygens.
The Case¶
Topic detection and modelling¶
Extracting topics from texts is as useful as it is challenging. Topics are semantic entities that may not have easily identifiable surface forms, so it is impossible to detect them by straightforward search. Topics live at an abstraction level that does not care about language differences, let alone spelling variations. On the other hand, if you have a corpus with thousands of letters in several historical languages, and you want to know what they are about without actually reading them all, a good topic assignment is a very valuable resource indeed.
There are several ways to tackle the problem of topic detection, and they vary in the dimensions of the quality of what is detected, the cost of detection, and the ratio between manual work an automatic work. Several of these methods have been (and are being) tried out in the Circulation of Knowledge Project a.k.a. CKCC* (References). See also a paper by Dirk Roorda (References), Charles van den Heuvel (References) en Erik-Jan Bos (References), delivered at the Digital Humanities Conference in 2010 (References) and a paper by Peter Wittek and Walter Ravenek (References) about the topic modeling methods that have been tried out (References).
Preserving (intermediate) results¶

manual keyword assignments
The purpose of the present article and demo is not to delve in topic detection methods. Our perspective is: how can we gather the results of work done and make it reusable for new attempts at topic modelling and detection? Or for other ways to uncover the semantic contents of the corpora involved?
At this moment, CKCC (References) has not obtained fully satisfactory results in topic modelling. But there are:
- results in automatic keyword assignments,
- manual assignments of topics in a subset of the corpus
- automatic topic detection on the basis of the LDA algorithm
The outcomes of this work are typically stored in databases, or bunches of files of type text or csv. They involve internal identifiers for the letters. They cannot be readily visualised.
I propose to save this work as annotations, targeting the corpus.
The interface¶
TKA (References) is a simple demonstration of what can be done if you store keyword and topic assignments as annotations. Here is an overview of the limited features of this interface. Bear in mind that very different interfaces can be built along the lines of this sources-with-annotations paradigm.
This interface is designed to show the basic information contained in the letters and the keyword/topic aasignments.
The columns¶
The rightmost column is either the text of a selected letter, or a list of letters satisfying a criterion.
The other three columns are the keywords and topics that are associated with the displayed letter, or with any of the letters occurring in the list on the right.
The source¶
The source is the complete correspondence of Christiaan Huygens, which are predominantly letters in French, but there are also a few Dutch letters. The texts derive from TEI-marked-up texts as used by the CKCC project (References).
Keywords and Topics¶
There are two kind of keywords. In the left most column you see the manually assigned keywords. If you hover over the space to the right of them with the mouse pointer, you see the author of the assignment. If you click on them, you will get a list of all letters that have that keyword manually assigned to them.
In the middle column you see the automatically detected keywords (only if you have selected a single letter or a subset of letters).
In the right column you find the topics. These topics are the result of an automatic attempt at topic detection. A topic is a collection of words, which together span a semantic field. Which field is often hard to infer, and quite often there does not seem to be a common denominator at all. The words of a topic all have a relative weight with which they contribute to that topic. This weight shows up as a number between 1 and 100, to be interpreted as a percentage.
The topics are represented by means of its three most heavily contributing words. If you want to see all contributng words, click on the +++ following the words.
Topics have been assigned to letters with a certain confidence. This confidence is visible in the single letter view: it is the number between 1 and 100 before the three words.
Weights and confidence have different scales in the database than on the interface: I have used calibration to a scale from 1 to 100 and rounding for readability purposes.
You can highlight keywords and topics in the text, selectively.
The Idea¶
The main idea is to package the outcomes of digital scholarship into sets of annotations on the sources. Our model for annotations is that of the Open Annotation Collaboration (References): a topic has one body and one or more targets. Metadata can be linked to annotations.
Keywords as Annotations¶
In the present case, keywords and topics are targeted to letters as a whole. So the granularity of the target space is really coarse.
For keywords, the modeling as annotation turned out to be a straightforward matter. A keyword assignment by an expert or algorithm corresponds to one annotation with the keyword itself as body, and the letters to which that keyword is assigned as the (multiple) targets. The author of the assignment is added as metadata.
Topics as Annotations¶
The mapping from topics to annotations is not so straightforward. Several alternative ways of modelling are perfectly possible. The complication here is the confidence factor with which a topic is assigned to a letter. This really asks for a three way relationship between topics, letters and numbers, but the OAC (References) model does not cater for that. We could work around them by adding relationships to our annotation model, but that would defeat the purpose of the whole enterprise: packaging scholarship into annotations to make them more portable. So we want to stick to the OAC (References) model.
Here is a list of remaining options.
Confidence as metadata¶
Topic is body, letters are targets. The confidence is an extra metadata field.
Technically, this is a very sound solution, because the confidence really is a property of the assignment relation.
But the confidence is the outcome of an algorithm and as such a piece of the data. Treating it as metadata will cause severe surprises in processing chains that treat metadata very differently from data.
Topics as target¶
Confidence is body, topic is target, letters are other targets.
Technically, this is doable. But it is fairly complex and it asks for a rather complex interpretation of the targets: in order to read the topic off an annotation, one has to find the one target of it that points to a topic.
With respect to interpretation: under the standard interpreation of annotations we would read that the confidence is what the annotation says about a topic and a set of letters. This is odd, especially when the bodies of keyword annotations do contain the keywords themselves. This makes it much harder for interfaces to show topics and keywords in their continuity.
Confidence as target¶
Topic is body, confidence is target, letters are other targets.
Technically this is no different than the previous case.
The interpretation is even more odd than in the previous case, since the target is a single number. As if we really have web resources around that only contain one number. Not natural.
Combine topic and confidence into the body¶
This is the option chosen for the demonstrator.
The body is structured, it contains a topic and a confidence, only letters are targets.
The only technical complication is the structured body.
The interpretation is just right: an annotation asserts that this topic - confidence combination applies to this set of letters.
There is another price to pay: we cannot subsume all letters that are assigned a specific topic as the targets of a single annotation. Every letter - assigned topic combination requires a separate annotation, because of the distinct confidence numbers. Any interface that wants to present the letters for a given topic will have to dig into the structure of the annotation bodies. This limits the genericity of the approach.
Express confidence in an annotation on an annotation¶
Topic is body of annotation1, letters are targets of annotation1, confidence is body of annotation2, annotation1 is target of annotation2.
Technically, it is more complicated to retrieve these layered topic assignment annotations, it is a cascade of inner joins. But it is doable.
The interpretation is completely sound. Annotations on annotations is intended usage, and the confidence is really a property of the basic topic assignment.
The Work¶
Annotations in Relational Databases¶
Here I discuss how the annotations have been modelled as relational databases.
Indeed, these annotations have not been coded into RDF, they have no URIs, so they do not conform to OAC (References). Instead, they have been modeled into relational database tables.
There are several reasons for this:
# the annotations should not be tied to specific incarnations of the sources # the annotations are to drive interfaces in different ways, depending on what it is that is being annotated
The main reason for 1. is that those sources are not yet publicly online, or if they are, they have not yet stable uris by means of which they can be addressed.
Because of 2. we need random access to the annotations with high performance. The easiest way to achieve that is to have them ready in a relational database.
Nevertheless, there is a sense in which we conform to OAC (References): the annotations reside in a different database than the sources do, and the link between annotations and targets is a symbolic one, not checked by the database as foreign keys. So the addressing of targets is very flexible.
This establishes modularity between sources and annotations, and the intended workflow to deal with real OpenAnnotations (References) is:
# if you encounter a set of interesting rdf annotations on a source that you have in your repository: import the rdf annotations into a relational database, and translate target identifiers into database identifiers; # if you want to export your own set of annotations to the Linked Data web: export the database annotations to a webserver, translating target identifiers to URIs pointing to your sources.
Data model for annotations¶


The datamodel for annotations containing topics and keywords is as straightforward as possible. Some observations:
#There are separate databases for topics, which act as bodies of the annotations, and for the annotations themselves. #Topics are symbolically linked to the annotation database, not by database-enforced foreign keys. #Topics do not have external ids, so the annotations link to them by means of their database id. This is a challenge if you need to export topics as web resources. #Bodies have structure: there are three fields: ##**bodytext** for bodies that are ordinary text strings, like keywords; ##**bodyref** for bodies that are database objects themselves; this field is meant to contain the id of the body object; the application is meant to ‘know’ in which table these objects are stored; ##**bodyinfo** additional information inside the body; here we use it for storing the confidence factor, which is a floating point number stored as a sequence of characters; ##there is no sharing of targets between annotations, the database model admits only one annotation per target. If we allowed target sharing, we would need an extra cross table between annot and target, which would burden all queries with a lot more inner joins. Whether efficiency suffers from or improves by this choice I have not investigated. In QFA (References) I have used target sharing, and it worked quite well in a corpus with nearly 500 000 targets. Especially the targets of the features there would have caused the target table to explode of sharing were not used. #The letters, which are the sources that are targeted by the annotations, are in a separate database as well. They carry ids given with the corpus. These are the ids that are used as the symbolic targets in the target table of the annotations. #There is no split between the annotations and their metadata. The reason for the latter integration is that the annotation machinery should have a decent performance. Most queries sift through annotations by metadata, so for this demo I chose a simple solution.
The main intention between these choices is to keep the interface between the annotation model and the real world objects that constitute the bodies and the targets, as free as possible from database-specific constraints. We want to use the model for more than one kind of annotation!
Data Statistics¶
Since performance is an important consideration, here are some statistics of the sources and annotations of TKA (References).
quantity | amount | extra info |
---|---|---|
Number of letters (in the Christiaan Huygens corpus) | 3090 | 13MB |
Number of topics | 200 | 100 french, 100 dutch |
Number of words in topics | 2202 | |
Total number of annotations (keyword manual, keyword auto, topic | 18884 | |
Total number of targets | 37468 | |
Keyword (manual) annotations | 801 | |
Keyword (manual) targets | 859 | |
Keyword (auto) annotations | 11721 | |
Keyword (auto) targets | 29547 | |
Topic annotations | 6362 | |
Topic targets | 7062 |
Lessons Learned¶
Not all annotations are equal¶
The annotation model is very generic, and many types of annotation fit into it. Here we saw several kinds of keywords and topics, each with different glitches. In the QFA (References) demo there are linguistic features as annotations and queries as annotations, which require completely different renderings.
So the question arises: what is the benefit of the single annotation model if real world applications treat the annotations so differently?
And: how can you design applications in such a way that they benefit optimally from the generic annotation model? Now that we have interfaces for at least three real world type annotations we are in a position to have a closer look, and to gather the lessons learned.
The benefits of a unified model¶
A basic interface for annotation¶
Interfaces come and go with the waves of fashion in ICT. Most of them will not be sustainable in the long term. If the interface draws from data that is modeled to cater for the needs of the interface, it will be hard to re-use that data when the interface has gone. Moreover, even will the intended interface still exists, it is better if the data can be used in other, unintended applications. If the data conforms to the annotation model, there is at least a generic way to discover, filter and render annotations. This is very good if you are interested in the portability of the scholarly work that is represented in annotations.
Anchors for annotations¶
Annotations point to the resources they comment on. OAC (References) even requires that this pointing is done in the Linked Data way: by proper http uris. If those resources are stable, maintained by strong maintainers such as libraries and archives and cultural heritage institutions, it becomes possible to harvest many sorts of annotations around the same sources. This is an organizing principle that is quite new and from which a huge benefits for datamining and visualisation are to be expected.
However, this is only interesting if the uris leading to the resources are stable, and if it is possible to address fragments of the resources as well. To the degree that we have stable anchors for fragments, the OAC (References) targeting approach is nearly ideal.
Absolute addressing versus relative addressing
In real life there are several scenarios where there is no stable addressing of (fragments) of resources. This happens when resources go off-line into an archive. If we want to restore those resources later on, the means of addressing them from the outside may have changed. Moreover, there might not be a unique, canonical restored incarnation of that resource. How can one use old, archived annotations for this resource?
The solution adopted in QFA (References) and here in TKA (References) is to work with localized addresses. These are essentially relative addresses that point to (fragments) of local resources that are part of a local corpus.
There is a FRBR (References) consideration involved here. FRBR (References) makes a distinction between work, expression, manifestion and item. Work is a distinct intellectual or artistic creation. As such it is a non-physical entity. Expression, manifestation and item point to increasing levels of concreteness: an item is an object in the physical world. Wikipedia illustrates these four concepts with an example from music:
FRBR (References) concept | example | keyword |
---|---|---|
work | Beethoven’s Ninth Symphony | distinct creation |
expression | musical score | specific form |
manifestation | recording by the London Philharmonic in 1996 | physical embodiment |
item | record disk | concrete entity |
The full refinement of these four FRBR (References) concepts is probably not needed for our purposes. Yet a distinction between the work, which exists in an ideal, conceptual domain, and the incarnations of it, which exist in a lower, more physical layer of reality, is too important to ignore. It is important for the ways by which we keep identifiers to works and incarnations stable. Identifiers to works identify within conceptual domains, they have no function to physically locate works. These identifiers are naturally free of those ingredients that make a typical hyperlink such a flaky thing. So whenever annotations are about aspects of a resource that are at the work-level, they have better to target those resources by means of work-identifiers. By the way, the distinction work-incarnation also applies to fragments of works. Most subdivisions, like volumes, chapters and verses in resources do exist at the work level. Of course, there are some fragments that are typically products of the incarnation level, such as page.
Portable Annotations
How does this discussion bear on our concrete demo application?
Suppose we have a set of annotations to some well-identified letters in the Christiaan Huygens corpus. These annotations may also be relevant to the same letters in another incarnation of that Christiaan Huygens corpus. This other incarnation might be a another text encoding or even a different media representation or even another version with real content differences. With well-chosen, relative addresses, it is possible to make the targeting of annotations more robust against such variations.
Then a package of annotations on the Christiaan Huygens letters, made in the initial stages of the CKCC project (References), can be stored in an archive. Later, it can be unpacked and applied to new incarnations of those letters. If other research groups have curated those letters, chances are good that this annotation package can also be applied to those versions.
This all works best if the sources themselves and their fragments have work-level identifiers that are recognized by whoever is involved with them. Even if this is not the case, it is easier to translate between rival identifier schemes at the work level, than to maintain stable identifiers at the incarnation level.
There are no mathematically defined boundaries between works and incarnations. Even FRBR (References) leaves much to the interpretation. With a bit of imagination it is easy to define even more FRBR (References)-like layers in some applications. And then there is the matter of versioning. To what extent are differing versions incarnations of the same work? This is really a complex issue, and I plan to devote a completely new chapter plus demo application to it. See Portable Annotations (References).
Metadata and Annotations¶
The OAC (References) model defines an annotation as something that is a resource in itself. That means that annotations can be the targets of other annotations (and probably of itself as well, but I do not see a use case for that right now), and that annotations can be linked to metadata. So metadata is not part of the OAC (References) model, but the fact that metadata on annotations is around, is well accomodated.
With metadata the divergence sets in. Concrete applications need metadata to filter annotations, but there is no predetermined model for that metadata. So here is a point where applications become sensitive to the specifics of the information around annotations. Or, alternatively, applications might discover the metadata of annotations, and make educated guesses as to the filtering of the annotations they want to display. Very likely this will so computation intensive, that a preprocessing stage is needed, in which the metadata that is around will be in fact indexed according to an application specific metadata model.

metadata fields in Queries/Features As Annotations

metadata fields in Topics As Annotations
Let me conclude with an account of the metadata that the TKA (References) and QFA (References) demos needed to function. See the screenshots on the right.
Type and Subtype
One of the most important characteristics is the type of the annotation.
In QFA (References) it tells whether the annotation expresses a (linguistic) feature, or a query and its results. As features and queries are displayed differently, it is important to be able to select on type and subtype, and to do it fast (that’s why there is an index on these columns).
In TKA (References) there is a field metatype which is used to distinguish between keywords and topics, and a field metasubtype which distinguishes between manual annotations and automatic annotations, i.e. the results of algorithms.
Provenance
In a world where annotations are universal carriers for scholarship, provenance metadata is of paramount importance. Without it, it would be very difficult to assess the relevance and quality of the annotations that one discovers around a resource. For QFA (References) there are, even in demo setting, a handful of fields: researcher, research question, date_created, date_run, publications.
A typical use case for QFA (References) is this: a researcher in the future comes across some targets of a query annotation by way of serendipity. He navigates to the body of the annotation, which is a query instruction. Very likely he has no means at hand to run that query, but he can look up the other query results. Apart from that he is able to see why this query has been designed, which question it answers, and who has done it and when, and to which publications this has lead. Using this information he can find related research questions, queries and results. In this way a quite comprehensive picture op past and ongoing scholarship around these sources can be obtained, without maintaining the engines that once run all those queries.
Among the possible use cases for TKA (References) is this one: in order to get algorithmic access to the semantics of all those tens of thousands of 17th century letters various methods are tried to get keywords and topics. Some methods yield keywords in purely automatic ways. Other methods require training by manual topic assignments by experts. The methods must be tested against test data. Parameters will be tweaked, outcomes must be compared. Precision and recall statistics indicate the success of those methods. Yet something is missing: a view on all those keyword/topic assignments in context, where you can switch on and off different runs of the algorithms, and where you can assess the usefulness of the assigned labels and hit on the obvious mistakes.
If all topic/keyword assignments are expressed as annotations, than it is the provenance metadata that enables the application to selectively display interesting sets of annotations.
Even more importantly, there is real gold among those annotations, especially the manual ones by experts. They can be used by other projects in subsequent attempts to wrestle semantic information from the data. Good provenance information combined with real portability of annotations will increase the usefulness of those expert hours of manual tagging.
Real applications driven by annotations¶
How do real applications utilize the common aspects of the annotation model, and how do they accomodate the very different roles that different kinds of annotations play in the user interface? Let me share the experience of designing demo applications driven by annotations.
Approach¶
First a few general remarks as to the approach we have chosen.
#we wanted a broad range of annotations, in order to explore how fit annotations are to express the products of digital scholarship. That is why we considered queries, features, keywords and topics all as annotations; #rather than using OAC (References) annotations with real uris, we modeled annotations directly in relational databases. We see OAC (References) more as an interchange format, handy for exporting and publishing annotations and importing them into other applications; #we took care to separate the sources from the annotations: they are stored in different databases. We think that sources and annotations should be modular with respect to each other: one should be able to add new packages of annotations to an application; one should be able to port a set of annotations from one source to another incarnation of the same work (in the FRBR (References) sense); #even when we stretched the use of annotations to possibly unintended cases such as queries, we took care that the information we stored in body, target and metadata can be naturally interpreted accordingly; #we have not built any facility to create or modify annotations, nor sources for that matter. The motivation for these demos come from archiving, where modifying resources is not an important use case.
Structure¶
The common denominator of these annotation rendering demo applications is that they let the user navigate to pieces of the source materials. The application retrieves the relevant annotations, and gives the user some control to filter and highlight them. In the case of QFA (References) the query results are not fetched before rendering the page, but only on demand of the user, by means of an AJAX call. The feature results are fetched immediately, together with the sources.
Differences¶
The differences are there where the abstract annotation model hits the reality of the use cases: the contents of the bodies, the addressing of the targets, the modeling of the metadata. Also the visualisation of the annotations differs.
Bodies
Bodies tend to have structure. A choice must be made whether to express that stucture in plain text, or to use database modeling for it. Before discussing this choice, let me list what the bodies look like in each case.
Kind of annotation | form of body | interpretation of annotation |
---|---|---|
Query | plain text: statement in a query language | targets are results of query |
Feature | plain text of the form: key*=*value | targets are source fragments for which the feature key carries the value |
Keyword | plain text of the form: keywordstring | targets are letters in which keywordstring occurs |
Topic |
|
targets are letters to which the identified topic applies with a confidence factor |
Topic |
|
targets are letters to which the identified topic applies with a confidence factor |
Remember that topics are collections of words, where each word has a certain relative weight in that topic. This asks for plain old database modeling.
There are conflicting interests here. In our annotation model we want to accomodate annotations in their full generality. Yet the application will specialize itself around a few annotation types that are known beforehand. If the application does not specialize, its performance will not be up to the task. Yet we have tried to keep the structure of a body as generic as possible. The features case, for example, is a plain text body, but there is in fact the additional structure of keys and values.
But the topics case really would be awkward if we had to spell out the complete topic information for each annotation with that topic in the body. So we decided to give each body three content fields:
- bodytext for plain text content
- bodyref to contain an identifier pointing to an object in a(nother) database, without foreign key checking
- bodyinfo a bit of extra info of the body on which the application may search
By abstaining from database constraints for bodyref we keep the model still very versatile. If the application knows the database to look in, the body object can be found. Hence annotations with bodies that refer to arbitrary databases and tables can be accomodated without changes to the database model.
By having an extra bodyinfo field of type string, we can separate body information in two fields on which the application can efficiently perform filter operations. In the case of topics, the bodyinfo stores the confidence number. If topic bodies had been stored as plain text including this number, it would have been very inefficient to select topics regardless of this confidence number.
Probably the best option, but one that we have not implemented, is: express the confidence as body of a new annotation that targets the absolute topic assignment annotation.
The OAC (References) model is really concise, and there are many ways to link to additional data, which results in quite of few options to pursue when one maps topics onto annotations.
Targets

target model for queries/features

target model for keywords/topics
Targets are resources or fragments of resources. As there is no standard way to refer to fragments of resources, we decided to always use a string to denote a target. This is the field anchor, of type string. We do not fill this field with database identifiers, but with identifiers that come from the sources themselves.
As an illustration: before version 3.0 of QFA (References) we used the absolute word number of a word occurrence.
The first word of the Old Testament got number 1
, and the last word number 430156
.
Now the first word gets anchor gen_001:001^001
and the last word gets anchor ch2_036:023^038
.
These anchors refer to books, chapters and verses and specify a word in a verse by their order number in that verse.
The last word of the bible is the 38th word of the 23th chapter of the second book of Chronicles.
In TKA (References) only letters as a whole are anchored.
For example, the letter of the example screenshot above has identifier 1139a
,
which is clearly not a database identifier.
An obvious difference between the target model for queries/features on the one hand and keywords/topics
on the other hand is that the former allow target sharing between annotations.
This is only a pragmatic issue with no semantic consequences and not too much performance impact.
The model without target sharing is definitely simpler, and fewer inner joins are needed to get from target to body, for instance.
The model with target sharing does not enforce it.
Even if targets are shared, there is the need for a record in the cross table annot_target
per target per annotation.
So this will only gain something if the anchor field requires a significant amount of text per anchor.
Metadata
In fact, we have very little metadata, most of it unstructured. So we can afford to store it in a separate table (in the queries/features case) or even in extra fields in the annotation record itself. As soon as the relevant metadata becomes more complex, it is better to separate it thoroughly from the annotations, and make the connection purely symbolic, like we did with the connection between target anchors and the targets themselves.
Visualisation
Each type of annotation asked for different visualisations on the interface. The common aspect is that targets are highlighted, and bodies are displayed in separate columns, one for each (sub)type of annotation. The differences between queries and features and topics and keywords are: #for queries only those bodies are shown that have targets in the rendered part of the source; whereas for features all features/value pairs are selectable, regardless of their occurrence in the displayed passage of the source; #query targets are highlighted whereas features targets are highlighted according to display characteristics under user control; #queries and features have targets at the word level whereas keywords and topics are targeted at the letter level, but the individual occurrences are highlighted by means of a generic javascript search in element content, which is less precise!
Implementation¶
In order to rapidly implement our ideas concerning annotations and sources we needed a simple but effective framework on which we could build data-driven web applications. Web2py (References) offers exactly that.
We needed very little code on top of the framework, a few hundred lines of python and javascript. Deployment of apps is completely web-based, and takes only seconds.
Most work went into the data preparation stage, where I compiled data from various origins into sql dumps for sources and annotations. This was done by a few perl and shell scripts, each a few hundred lines again.
Greek-Collation¶


Application¶
This application is left in an unfinished state. It is a collection of experiments, expressed in Perl scripts.
Part of these scripts have been used to transform the Jude files, as delivered by Tommy Wasserman, into plain Unicode text, which has been deposited at DANS.
History¶
Contributors¶
Dirk Roorda (References) (author of this wiki), Jan Krans (References), Juan Garcés (References) and Matthew Munson (References).
Fabric of Texts¶
A good example of a fabric of texts is the Greek New Testament, because the set of NT manuscripts exceeds 5000, their variability is substantial, and there is a century long tradition of edition making.
The task we have set for ourselves is to find models and tools to facilitate new research on text fabrics like the New Testament. How can researchers dissect and recombine the data in new ways, supporting new hypothesis, and how can we exploit this unique selling point of the digital paradigm?
However, the complexity of this all is too daunting to start experimenting. That is why we have chosen a small but interesting subset: the letter of Jude. This is a one page ‘book’ near the end of the NT, of which there are variations that point to interesting differences of interpretation. Linguistic analysis at the manuscript level, and comparing the analyses for the different manuscripts could yield significant interpretation results.
While the NT fabric of texts includes over 5000 manuscripts, there are only 560 or so manuscripts in which (portions) of the letter of Jude occur. That makes it much more manageble and still not trivial!
There is another very good reason to single out the letter of Jude: recently all its manuscript data has been transcribed and analysed by Tommy Wasserman (References), and he has deposited them as a dataset (References) in the DANS archive, available on request.
Context and motivation¶
This work attempts to extend a line of research that lead to my Queries and Features as Annotations application (References).
My interest, as researcher at DANS is to find ways in which digital archives can facilitate researchers when they demand intensive use of data resources. Stable linking to fragments is a key requirement. We want to combine that with fragment linking across variants.
This move forces us to leave the more or less naïve concepts based on hypertext linking and embrace more involved concepts such as those of FRBR (References)
The Case¶

The data foundation for this case is the work of Tommy Wasserman (References), author of a monograph on the manuscripts of Jude (References). He kindly gave us the transcriptions of 560+ manuscripts that contain passages of Jude for experimentation. Jan Krans and Dirk have started to dissect them, and we have them now character by character in a database (1.7 million records). To be continued.
Here is a nice example of such a manuscript, from the Walters Art Museum in Baltimore. The page shown here is W.533 261. folio 129r.
The Idea¶
In a nutshell, the idea is to separate the information contained in a manuscript into layers. The main text is a layer, and everything else is store into other layers. The connection between the layers are the anchors: character positions in the main text.
After this step we have hope to be able to collate the main text layer of all manuscripts. We even could collate other layers and see what happens.
The results of the collation will be used to relate character positions across manuscripts.
The Work¶
Source¶

The source is a set of transcriptions from microfilm of all the Greek manuscripts that contain passages of the epistle of Jude. Tommy offered his own version of these transcriptions -as is- to Jan Krans (References) and me (References) for exploratory purposes. They came in the form of text files in a non-standard encoding. Due to the software that was originally used (Collate 2.0), the greek text was represented in Symbol Greek, not UNICODE. That is absolutely fine when using Collate 2.0, but not very interoperable with UNICODE-aware applications. So we converted all greek text to real UNICODE greek. Here is a screenshot.

In his book, Tommy has provided a list of the provenance of the original manuscripts, containing the year of creation and the places where they lie stored now.
Step 1: to UNICODE¶

The first step by Jan and I was to transform the original files into one file in UNICODE, with real unicode greek characters. We treated comments carefully, in order not to greekify texts that were after all latin. We detected all markup, checked it, and rewrote it as XML tags.
The source code of this conversion plus a complete log and a summary of the activities of this script has been added to the dataset (References): see the folder conversion and in there the files transform.pl and transform.log and summary.log.
Step 2: Layered markup, anchored by position¶
The next step was to split the material in layers. Every passage has a source layer, containing the primary text. Every character in the source layer has a fixed character position, a number. Everything else goes into other layers: markup layers, comment layers and nomina sacra layers. Material in other layers have character positions that correspond to the character positions in the source layer. At the same time, we merged the provenance information into the transcription data. A visual representation of the result is contained in the file graphpaper.txt, in the dataset (References). A screenshot appears at the top of this page.
Step 3: SQL import file of layered markup¶


The final step has been to transform the numerical representation into a real database model. The text has been divided into passages (verses). The contribution of every source (manuscript) to a passage consists of a set of layers. Each layer contains characters at certain positions. The complete datamodel is shown in the screenshot next.So every single character in every single manuscript occupies a layerdata record. This layerdata record contains also the address (position) of the character (relative to the character positions in the source layer). Moreover, the layerdata record is linked to the corresponding source record, passage record, and layer record. Below is a screenshot of a small fragment of the layers.sql file, which is included in the dataset (References)
Intermezzo¶
So far for the contents of this dataset. The question is: what can you do with it?
Exercise¶
A first exercise is to get all nomina sacra of the main text. The following sql query will do the trick:
use jude;
select
layerdata.address,
layerdata.glyph,
source.name,
passage.name
from
layerdata inner join source
on layerdata.source_id = source.id
inner join passage
on layerdata.passage_id = passage.id
inner join layeron layerdata.layer_id = layer.id
where
layer.name = 'SRC-NS'
order by
passage.id,
source.name,
layerdata.address
and the initial part of the result is:
address | glyph | name | name |
---|---|---|---|
8 | ι | 142 | 1 |
9 | υ | 142 | 1 |
11 | χ | 142 | 1 |
12 | υ | 142 | 1 |
48 | θ | 142 | 1 |
49 | ω | 142 | 1 |
Step 4: Collation¶
The following task is: use collation software (such as CollateX (References)) to collate the source layer. Based on the collation results we can then build a table that links character positions in one source to character positions in another source. This will yield an enormous web of interrelated character positions. If this web is stored as a new database table as well, then we have a starting point to build convenient visualizations of all the material that is relevant to a researcher of the passages of Jude.
But the first thing is: does an automatic collation yield good enough results to serve as foundation for the position linking? And can we do the position linking effectively? If we have to link every pair of manuscripts explicitly, we incur an enormous overhead, since there are more than 125,000 pairs of transcriptions.
So maybe the collation will give us a master source against which we can link all real sources in a bidirectional way. Transpositions are a complicating factor here. An idea could be to remove the concept of order from the master source, so that it becomes a bag-of-words. Since the master source only has to serve as a set of linking points, it is no longer a requirement that we must be able to reconstruct the variants from the master. Wy should we, if we have and keep the variants intact?
Collation with CollateX¶
At the moment I am at the stage that I have seen a reasonably good collation by CollateX (References), even without using the detected transpositions.
Here is a sample of the collation (in a pretty-printed form):
0142 |ιουδας| |ιυ |χυ | | | | |δουλος|αδελφος|δε|ιακωβου|τοις | |εν |θω |πρι | | | | |ηγιασμενοις |και |ιυ |χω | | |τετηρημενοις|κλητοις |
0251 |█ | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
0316 |█ | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
049 |ιουδας|χυ |ιυ | | | | | |δουλος|αδελφος|δε|ιακωβου|τοις | |εν |θω |πρι | | | | |ηγιασμενοις |και |ιυ |χω | | |τετηρημενοις| |
056 |ιουδας| |ιυ |χυ | | | | |δουλος|αδελφος|δε|ιακωβου|τοις | |εν |θω |πρι | | | | |ηγιασμενοις |και |ιυ |χω | | |τετηρημενοις|κλητοις |
1 |ιουδας| | | | | |ιησου|χριστου|δουλος|αδελφος|δε|ιακωβου|τοις | |εν |θεω | | | | |πατρι |ηγιασμενοις |και | | |ιησου|χριστω |τετηρημενοις|κλητοις |
1003 |ιουδας| | | | |χριστου|ιησου| |δουλος|αδελφος|δε|ιακωβου|τοις | |εν |θεω | | | | |πατρι |ηγιασμενοις |και | | |ιησου|χριστω |τετηρημενοις|κλητοις |
101 |ιουδας| | | | |χριστου|ιησου| |δουλος|αδελφος|δε|ιακωβου|τοις | |εν |θεω | | | | |πατρι |ηγιασμενοις |και | | |ιησου|χριστω |τετηρημενοις|κλητοις |
102 |ιουδας| | | | | |ιησου|χριστου|δουλος|αδελφος|δε|ιακωβου|τοις | |εν |θεω | | | | |πατρι |ηγιασμενοις |και | | |ιησου|χριστω |τετηρημενοις|κλητοις |
1022 |ιουδας| | | | |χριστου|ιησου| |δουλος|αδελφος|δε|ιακωβου|τοις | |εν |θεω | | | | |πατρι |ηγιασμενοις |και | | |ιησου|χριστω |τετηρημενοις|κλητοις |
103 |ιουδας| | | | | |ιησου|χριστου|δουλος|αδελφος|δε|ιακωβου|τοις | |εν |θεω | | | | |πατρι |ηγιασμενοις |και | | |ιησου|χριστω |τετηρημενοις|κλητοις |
104 |ιουδας| | | | |χριστου|ιησου| |δουλος|αδελφος|δε|ιακωβου|τοις | |εν |θεω | | | |και |πατρι |ηγιασμενοις |και | | |ιησου|χριστου|τετηρημενοις|κλητοις |
1040 |ιουδας| | | | | |ιησου|χριστου|δουλος|αδελφος|δε|ιακωβου|τοις | |εν |θεω | | | | |πατρι |ηγιασμενοις |και | | |ιησου|χριστω |τετηρημενοις|κλητοις |
105 |ιουδας| | | | | |ιησου|χριστου|δουλος|αδελφος|δε|ιακωβου|τοις | |εν |θεω | | | | |πατρι |ηγιασμενοις |και | | |ιησου|χριστω |τετηρημενοις|κλητοις |
1058 |ιουδας| | | | | |ιησου|χριστου|δουλος|αδελφος|δε|ιακωβου|τοις | |εν |θεω | | | | |πατρι |ηγιασμενοις |και | | |ιησου|χριστω |τετηρημενοις|κλητοις |
1066 |ιουδας| | | | | |ιησου|χριστου|δουλος|αδελφος|δε|ιακωβου|τοις | |εν |θεω | | | | |πατρι |ηγιασμενοις |και | | |ιησου|χριστω |τετηρημενοις|κλητοις |
1067 | | | | |ιουδας | | |χριστου|δουλος|αδελφος|δε|ιακωβου|τοις | |εν |θεω | | | | |πατρι |ηγαπημενοις |και | | |ιησου|χριστω |τετηρημενοις|κλητοις |
Collation with a new algorithm¶
Now we will research whether a bag-of-words master source is a workable idea. I just found a new way to link corresponding positions across variants. Still a lot of checking and cross-checking has to be done. Pending the verdict whether the new method yields good results, it is certainly an interesting experience to look at this data in completely new ways.
Here is an illustration first, it is the passage Jude verse 1, just one line of text, but in 560 variants.
Here is the source in just a few variants:
0142 = ιουδας ιυ χυ δουλος αδελφος δε ιακωβου τοις εν θω πρι ηγιασμενοις και ιυ χω τετηρημενοις κλητοις
049 = ιουδας χυ ιυ δουλος αδελφος δε ιακωβου τοις εν θω πρι ηγιασμενοις και ιυ χω τετηρημενοις
056 = ιουδας ιυ χυ δουλος αδελφος δε ιακωβου τοις εν θω πρι ηγιασμενοις και ιυ χω τετηρημενοις κλητοις
1 = ιουδας ιησου χριστου δουλος αδελφος δε ιακωβου τοις εν θεω πατρι ηγιασμενοις και ιησου χριστω τετηρημενοις κλητοις
1003 = ιουδας χριστου ιησου δουλος αδελφος δε ιακωβου τοις εν θεω πατρι ηγιασμενοις και ιησου χριστω τετηρημενοις κλητοις
101 = ιουδας χριστου ιησου δουλος αδελφος δε ιακωβου τοις εν θεω πατρι ηγιασμενοις και ιησου χριστω τετηρημενοις κλητοις
The first step in collating is: cluster similar words according to a similarity measure. I choose a measure that takes insertions and deletions into account, so it is not quite the Levenshtein distance. I compute it based on the longest common subsequence (LCS) in the following way:
sim(w1,w2) = 2*length(LCS(w1,w2)) / (length(w1) + length(w2))
I cluster words by growing clusters from words that have a similarity of at least 0.8 to at least one member of them. The 0.8 is a thing to play with.
The next step is to create a bag-of-cluster-occurrences for all words in all sources. This is the master bag. Any word in a source is member of a cluster and this cluster is in the masterbag. If a source has repeated occurrences in the same cluster, the master bag will contain several occurrences of that cluster.
The words in the sources are linked (bi-directionally) to the cluster occurrences in the master bag.
Here is the result of assigning clusters to the words in the example sources above. Every cluster has a number, and every cluster occurrence has that number with a # appended and then the occurrence number:
0142 |336#1 ιουδας 542x|338#1 ιυ 11x|636#1 χυ 9x|169#1 δουλος 541x|23#1 αδελφος 541x|144#1 δε 537x|325#1 ιακωβου 540x|440#1 τοις 538x|221#1 εν 539x|324#1 θω 10x|486#1 πρι 8x|296#1 ηγιασμενοις 484x|343#1 και 524x|338#2 ιυ 10x|637#1 χω 9x|573#1 τετηρημενοις 510x|356#1 κλητοις 537x|
049 |336#1 ιουδας 542x|636#1 χυ 9x|338#1 ιυ 11x|169#1 δουλος 541x|23#1 αδελφος 541x|144#1 δε 537x|325#1 ιακωβου 540x|440#1 τοις 538x|221#1 εν 539x|324#1 θω 10x|486#1 πρι 8x|296#1 ηγιασμενοις 484x|343#1 και 524x|338#2 ιυ 10x|637#1 χω 9x|573#1 τετηρημενοις 510x|
056 |336#1 ιουδας 542x|338#1 ιυ 11x|636#1 χυ 9x|169#1 δουλος 541x|23#1 αδελφος 541x|144#1 δε 537x|325#1 ιακωβου 540x|440#1 τοις 538x|221#1 εν 539x|324#1 θω 10x|486#1 πρι 8x|296#1 ηγιασμενοις 484x|343#1 και 524x|338#2 ιυ 10x|637#1 χω 9x|573#1 τετηρημενοις 510x|356#1 κλητοις 537x|
1 |336#1 ιουδας 542x|331#1 ιησου 530x|626#1 χριστου 531x|169#1 δουλος 541x|23#1 αδελφος 541x|144#1 δε 537x|325#1 ιακωβου 540x|440#1 τοις 538x|221#1 εν 539x|320#1 θεω 525x|481#1 πατρι 529x|296#1 ηγιασμενοις 484x|343#1 και 524x|331#2 ιησου 510x|627#1 χριστω 378x|573#1 τετηρημενοις 510x|356#1 κλητοις 537x|
1003 |336#1 ιουδας 542x|626#1 χριστου 531x|331#1 ιησου 530x|169#1 δουλος 541x|23#1 αδελφος 541x|144#1 δε 537x|325#1 ιακωβου 540x|440#1 τοις 538x|221#1 εν 539x|320#1 θεω 525x|481#1 πατρι 529x|296#1 ηγιασμενοις 484x|343#1 και 524x|331#2 ιησου 510x|627#1 χριστω 378x|573#1 τετηρημενοις 510x|356#1 κλητοις 537x|
101 |336#1 ιουδας 542x|626#1 χριστου 531x|331#1 ιησου 530x|169#1 δουλος 541x|23#1 αδελφος 541x|144#1 δε 537x|325#1 ιακωβου 540x|440#1 τοις 538x|221#1 εν 539x|320#1 θεω 525x|481#1 πατρι 529x|296#1 ηγιασμενοις 484x|343#1 και 524x|331#2 ιησου 510x|627#1 χριστω 378x|573#1 τετηρημενοις 510x|356#1 κλητοις 537x|
And here is the master bag of all 560 variants! Every item in the masterbag is a list of words from the variants that are linked to that item. The number of variants supporting each word is indicated, and the words are (vertically) ordered by the quantity of their support. The horizontal ordering is just alphabetical:
|12#1 x1|13#1 x3 |23#1 x541 |85#1 x2 |122#1 x2|144#1 x537|169#1 x541 |169#2 x1 |181#1 x23 |200#1 x1 |221#1 x539|221#2 x7|296#1 x538 |297#1 x2 |310#1 x1|320#1 x525|323#1 x1|324#1 x10|325#1 x541 |326#1 x1|331#1 x530|331#2 x510|331#3 x1|332#1 x1|332#2 x1|336#1 x542 |336#2 x1 |338#1 x11|338#2 x10|343#1 x524|343#2 x4|356#1 x538 |364#1 x2 |440#1 x538|440#2 x1|481#1 x529|482#1 x1 |486#1 x8|509#1 x1|573#1 x526 |581#1 x1|597#1 x1|626#1 x531 |626#2 x134 |627#1 x378 |627#2 x2 |634#1 x1|635#1 x1|636#1 x9|636#2 x2|637#1 x9|650#1 x14|651#1 x4|
|1x αγι |3x αγιοις|541x αδελφος|1x ●●●●●○○ |2x ●●●● |537x δε |541x δουλος|1x δουλος|23x εθνεσιν|1x εκλεκτοις|539x εν |7x εν |484x ηγιασμενοις |2x ηγιασμενης|1x ημων |525x θεω |1x θυ |10x θω |540x ιακωβου|1x ιακω |530x ιησου|510x ιησου|1x ιησου|1x ιηυ |1x ιηυ |542x ιουδας|1x ιουδας|11x ιυ |10x ιυ |524x και |4x και |537x κλητοις|2x κυριου|538x τοις |1x τοις |529x πατρι|1x πατρασιν|8x πρι |1x προ |510x τετηρημενοις|1x της |1x τω |531x χριστου|134x χριστου|378x χριστω|2x χριστω|1x χρυ |1x χρω |9x χυ |2x χυ |9x χω |14x █ |4x ░ |
| | | |1x ●●●●●●●●| | | | | | | | |47x ηγαπημενοις | | | | | |1x ιακοβου | | | | | | | | | | | | |1x κλιτοις | | | | | | | |9x τετηριμενοις |
| | | | | | | | | | | | |2x υγιασμενοις | | | | | | | | | | | | | | | | | | | | | | | | | | |2x τετυρημενοις |
| | | | | | | | | | | | |1x ηγισαμενοις | | | | | | | | | | | | | | | | | | | | | | | | | | |1x τηρημενοις |
| | | | | | | | | | | | |1x ηγαποιμενοις | | | | | | | | | | | | | | | | | | | | | | | | | | |1x τιτεμημενοις |
| | | | | | | | | | | | |1x ηγνησμενοις | | | | | | | | | | | | | | | | | | | | | | | | | | |1x τετιρημενοις |
| | | | | | | | | | | | |1x ηγιασμενος | | | | | | | | | | | | | | | | | | | | | | | | | | |1x τετηρημενος |
| | | | | | | | | | | | |1x προηγιασμενοις| | | | | | | | | | | | | | | | | | | | | | | | | | |1x τετημενοις |
Note that transpositions are no problem whatsoever. But there are problems, though, because very different words on corresponding positions will not get linked in any way.
Here is a remedy: take context into account. Merge cluster occurrences that have similar contexts. Here is how. First I make a skeleton passage by replacing the words with little support from the variants by a star. Words with sufficient support are replaced by the cluster occurrence in the master bag that they are linked to. The threshold is 0.6 . Again a value to play with. The context of an cluster occurrence is this skeleton occurrence with a place holder for the cluster occurrence in question. I will merge two cluster occurrences if they share a significant context. Significant means that at least one of the cluster occurrences occurs in this context often enough; often enough is expressed as a fraction of how often it occurs in all variants. The threshold here is chosen to be 0.0. Again, this is something to play with, or may be not since the limit value 0.0 already gives good results.
This is what we get. First the new collation:
0142 |336#1 ιουδας 542x|331#2 ιυ 11x|634#1 χυ 9x|169#1 δουλος 541x|23#1 αδελφος 541x|144#1 δε 537x|326#1 ιακωβου 540x|440#1 τοις 538x|221#1 εν 539x|320#1 θω 10x|481#1 πρι 8x|296#1 ηγιασμενοις 484x|343#1 και 524x|320#1 ιυ 10x|320#1 χω 9x|573#1 τετηρημενοις 510x|200#1 κλητοις 537x|0251 |651#1 █ 14x|0316 |651#1 █ 14x|
049 |336#1 ιουδας 542x|634#1 χυ 9x|331#2 ιυ 11x|169#1 δουλος 541x|23#1 αδελφος 541x|144#1 δε 537x|326#1 ιακωβου 540x|440#1 τοις 538x|221#1 εν 539x|320#1 θω 10x|481#1 πρι 8x|296#1 ηγιασμενοις 484x|343#1 και 524x|320#1 ιυ 10x|320#1 χω 9x|573#1 τετηρημενοις 510x|
056 |336#1 ιουδας 542x|331#2 ιυ 11x|634#1 χυ 9x|169#1 δουλος 541x|23#1 αδελφος 541x|144#1 δε 537x|326#1 ιακωβου 540x|440#1 τοις 538x|221#1 εν 539x|320#1 θω 10x|481#1 πρι 8x|296#1 ηγιασμενοις 484x|343#1 και 524x|320#1 ιυ 10x|320#1 χω 9x|573#1 τετηρημενοις 510x|200#1 κλητοις 537x|
1 |336#1 ιουδας 542x|331#1 ιησου 530x|331#2 χριστου 531x|169#1 δουλος 541x|23#1 αδελφος 541x|144#1 δε 537x|326#1 ιακωβου 540x|440#1 τοις 538x|221#1 εν 539x|320#1 θεω 525x|481#1 πατρι 529x|296#1 ηγιασμενοις 484x|343#1 και 524x|331#2 ιησου 510x|320#1 χριστω 378x|573#1 τετηρημενοις 510x|200#1 κλητοις 537x|
1003 |336#1 ιουδας 542x|331#2 χριστου 531x|331#1 ιησου 530x|169#1 δουλος 541x|23#1 αδελφος 541x|144#1 δε 537x|326#1 ιακωβου 540x|440#1 τοις 538x|221#1 εν 539x|320#1 θεω 525x|481#1 πατρι 529x|296#1 ηγιασμενοις 484x|343#1 και 524x|331#2 ιησου 510x|320#1 χριστω 378x|573#1 τετηρημενοις 510x|200#1 κλητοις 537x|
101 |336#1 ιουδας 542x|331#2 χριστου 531x|331#1 ιησου 530x|169#1 δουλος 541x|23#1 αδελφος 541x|144#1 δε 537x|326#1 ιακωβου 540x|440#1 τοις 538x|221#1 εν 539x|320#1 θεω 525x|481#1 πατρι 529x|296#1 ηγιασμενοις 484x|343#1 και 524x|331#2 ιησου 510x|320#1 χριστω 378x|573#1 τετηρημενοις 510x|200#1 κλητοις 537x|
And here is the new master bag:
|12#1 x1|13#1 x3 |23#1 x541 |144#1 x537|169#1 x541 |169#2 x1 |200#1 x541 |221#1 x539|221#2 x7|296#1 x540 |310#1 x1|320#1 x1072 |326#1 x542 |331#1 x530|331#2 x1053 |331#3 x1|336#1 x542 |336#2 x1 |343#1 x524|343#2 x4|364#1 x2 |440#1 x540|440#2 x24 |481#1 x539 |573#1 x526 |581#1 x1|627#2 x2 |634#1 x10|651#1 x18|
|1x αγι |3x αγιοις|541x αδελφος|537x δε |541x δουλος|1x δουλος|537x κλητοις|539x εν |7x εν |484x ηγιασμενοις |1x ημων |525x θεω |540x ιακωβου|530x ιησου|531x χριστου|1x ιησου|542x ιουδας|1x ιουδας|524x και |4x και |2x κυριου|538x τοις |23x εθνεσιν|529x πατρι |510x τετηρημενοις|1x της |2x χριστω|9x χυ |14x █ |
| | | | | | |1x ●●●●●●●● | | |47x ηγαπημενοις | |378x χριστω |1x ιακοβου | |510x ιησου | | | | | | |2x ●●●● |1x τοις |8x πρι |9x τετηριμενοις | | |1x χρυ |4x ░ |
| | | | | | |1x ●●●●●○○ | | |2x υγιασμενοις | |134x χριστου|1x ιακω | |11x ιυ | | | | | | | | |1x πατρασιν|2x τετυρημενοις |
| | | | | | |1x κλιτοις | | |2x ηγιασμενης | |10x θω | | |1x ιηυ | | | | | | | | |1x προ |1x τηρημενοις |
| | | | | | |1x εκλεκτοις| | |1x ηγισαμενοις | |10x ιυ | | | | | | | | | | | | |1x τιτεμημενοις |
| | | | | | | | | |1x ηγαποιμενοις | |9x χω | | | | | | | | | | | | |1x τετιρημενοις |
| | | | | | | | | |1x ηγνησμενοις | |2x χυ | | | | | | | | | | | | |1x τετηρημενος |
| | | | | | | | | |1x προηγιασμενοις| |1x ιηυ | | | | | | | | | | | | |1x τετημενοις |
| | | | | | | | | |1x ηγιασμενος | |1x χρω |
| | | | | | | | | | | |1x τω |
| | | | | | | | | | | |1x θυ |
Much better, it seems. Look how many nomina sacra (holy names) are now lumped together. Too many?
Next steps¶
Important questions remain:
- how can we check whether this kind of collation is good?
- what is needed to adjust the thresholds to get optimal results?
- how can we understand the contribution of each of the three threshold settings?
Visualisation¶
A visual check is mandatory. So the question arises: how do we visualize the collective variation of 560 manuscripts for one passage?
Note that the master bag in the previous section is a handy tool to link the corresponding slots in the variants. But the master bag itself is not ordered, and cannot be ordered in such a way that it faithfully reflects the order of all variants. If you want to see order, you have to go back to the variants themselves!
But the masterbag does give a clue. For each word in each variant we have the all the alternatives from the other variants at hand, complete with the size of their support. We could put the words on dials, like an iOS alarm setter.

In your mind’s eye, replace the numbers with words in the variants, so that on a horizontal row you can read the variants. And replace the AM and PM by the names of the variants. Imagine that the words with much support are bigger/bolder/blacker than the words with little support. By dialing the AM/PM dial, you can move to different variants. You can also dial the words themselves. Imagine that there is an extra dial with the names of the variants that only contains the names of the variants that have the passage exactly as displayed.
A knowledgeable scholar can quickly check whether a purported variant reading is supported by actual sources. And if a reading is supported by multiple variants, he can see what happens when he changes individual words.
These are just first ideas. It might be the first test-tube in a new text-critical laboratory.
Linking to the facsimiles¶
Suppose we get the linking of the transcriptions right, or at least, decent. Then here is an exciting perspective: suppose that we can link each word in a manuscript transcription with a rectangle on a good facsimile (bi-directional), then we can map all our inter-transcription links into inter-facsimile links. You then can use the dial interface above not only with the transcriptions on it, but also with the facsimile fragments on it.
How difficult can it be this facsimile linking if you already have the descriptions? That is an open question to me. There is much existing work (NT Manuscript Room Münster), there is OCRopus, Googles open source OCR suite that also can achieve something on manuscripts. And best of all, we do not have to recognize the characters, but only find rectangles around lines, glyphs, and words. Probably it cannot be done completely automatically, but then an efficient man-machine teaming-up might do the job.
Timeline¶
(See also the timeline in Queries As Annotations)
2012-11-29 DARIAH meeting Vienna¶
In the midst of many discussions about how to plan the marriage between researchers’ requirements and infrastructural provisions, I managed to achieve a promising collation of the Jude manuscripts. It is based on bags of words combined with context analysis. More details will follow shortly.
2012-11-21/22 Open Humanities Hack London¶
This event turned out to be a nice opportunity to get new ideas for linking facsimiles and transcriptions. Here is a report. I participated in a group that experimented with generating annotations to a facsimile (one of the Judas manuscripts from the Walters Art Museum in Baltimore). The annotations generated have as their targets rectangles around the lines of the manuscript and as bodies the line number. Ultimately, we want to use more clever ways of detecting the rectangles around lines, or rather, individual characters. What is also needed is a way to show those glyph-demarcating annotations straight on top of the facsimile. This will require additional functionality in the Pundit system that we used at the hack.
2012-11-20 First results with CollateX look good¶
Today I have applied a recent developer version by Ronald Dekker (References) of CollateX (References) to the 560 manuscripts, with good results a prima facie.
2012-11-06 Depositing a dataset at DANS with Tommy Wasserman’s data¶
Tommy has agreed that his data may be used by the Open Humanities Hack, provided proper attribution is given. A nice way to make that possible is to package it in a dataset at DANS (References). In that way the data is referable by means of a persistent identifier, the use of the data is regulated by licences, and the provenance of the data can be explained. The source data ae restricted access, that means that Tommy has to grant permission before you can use it. There are also public files: the conversion script and logs from source to sql database, plus a description of what you can do with the sql representation.
2012-10-17 Jude on Graph paper¶
Jan and I have studied the transcriptions of Tommy, converted it to UNICODE, processed the markup, and transformed it into a set of layers and imported it into a database. The layered text has been transformed into a graph paper representation, in order to visualize and check the layered information.
2012-06-21/22 Jan and Dirk visit Juan and Matthew in Göttingen¶
Discussion of how to code all the data that is inside a manuscript. Towards a layered structure of information. The glyph as unit? See also What is a Glyph?
2012-06-14 The Jude files¶
We receive high quality text material by Tommy Wasserman (References) to experiment with: the transcriptions of all known manuscripts of Jude.
2012-05-03 Jan and Dirk about a text critical laboratory¶
Identification of first steps.
2012-03-01 Dirk visits Juan and Matthew in Göttingen¶
Continuation of the exchange of ideas that started at the Lorentz Workshop on Biblical Scholarship, Leiden, 2012-02-06/10
What is a Glyph?¶


τους (1)

τους (2)

δε

και

παρα

υμας
Consider a fragment of the Stephanus edition (1550) of the letter of Jude. More specifically, look at the separate fragments.
There are two very dissimilar glyphs that in transcription reduce to τους. At lowest level of description, will we consider these two big ligatures as the atomic entities, or do we decompose them into four letters? The decomposition is definitely a product of interpretation, it is not a product of direct perception.
Similar things hold for the δε and και ligatures, and here the constituent characters are even less recognizable, or even absent. In the παρα ligature the individual components are recognizable, they are arranged along a non-standard path.
There is also something going on with υμας. They look like two glyphs, υ (with diacritic) and μας (with diacritic). We see the circonflex on top of the ς, but we know that it belongs to the α. Now, if the μας is a single glyph, how do you express the fact that the circonflex belongs to the α part of it?
So, if we take the glyph as atomic object, and assign addresses to all glyph positions in a manuscript, then we need a method of subaddressing to address the inferred characters from which a ligature glyph has been composed.
This is not too difficult. In our first layer of description we mention the (big) glyphs and their positions, and in a second layer of description we could provide the list of composing characters per glyph. The position in that list will be the subaddress to individual characters.
Now in this case we are helped by the fact that this is not a manuscript at all. We have in fact pictures of the letterbox that Garamond used to typeset this edition. We can see there all those ligatures as separate entities.
But the bulk of the source materials are real manuscripts, where the boundary between character, glyph, ligature and connected writing are just plain fuzzy.
Juan Garcés, Jan Krans, Matthew Munson and I discussed these matters at length in Göttingen, at the Digital Humanities Centre.

Juan

Jan, Juan, Matthew

a complex glyph with acute accent

a complex glyph with grave accent
Lorentz Workshop¶

Participants on the last day of the workshop
The names of the participants are in the IPTC metadata of this photograph, the GPS location is also included.
The material has migrated from the original wiki at demo.datanetworkservice.nl to readthedocs.org on 2013-11-21.
Workshop Notes¶
Monday¶
Greek Group Bert Jan¶
Disccusion with Jan Krans, Alexey Somov, Ulrik Sandborg-Pedersen, Nico de Groot, Matthew Munson, Markus Hartmann, Jon Riding, Bert Jan Lietaert Peerbolte
- The main point we are discussing is how to analyze the text of the Greek NT by computers. Paratext works with algorithms, establishes word pairs (e.g. Greek - English; Greek - Dutch): on the basis of these word pairs, English equivalents are displayed. Occasionally, this leads to mistakes because words function in various semantic domains and their exact meaning is related to the semantic domain in which the word functions in the particular text in which it is found.
- The critical apparatus poses a problem to bible software. How should the apparatus be taken up in your program? The presence of variant readings is important to translators who do not have a paper edition at hand. Adding the committee’s classification from Metzger’s textual commentary informs translators on the value of the readings adopted in the main text. NB: Paratext uses the GNT apparatus, not NA27. The way in which Paratext presents textual evidence now shows that it is digital information that has been published in a printed form, and has then been transformed to a digital environment. In the apparatus a scholarly edition would need full information on manuscript readings. Clicking the abbreviation should result in checking the exact reading referred to.
- A syntactical analysis of the Greek text is lacking so far. Semantics and morphology are covered in Paratext, but syntax is not. Jon Riding is actually questioning the sheer existence of syntax as a discreet system. Eep Talstra (who just entered the room) argues that syntactical patterns are there in the text, and that they can be discerned by a computer. In response to this point, Ulrik Sandborg-Pedersen presents the new syntactical parsing project by Logos. There, the parsing is done by the computer, not by human activity. This parser can generate entire trees of NT texts in Logos. The group are hoping for Rick Brannan to show us the possibilities of this new asset in his presentation later this week. One of the difficulties here appears to be the fact that every computer analysis of syntax is also connected to grammatical categories. To mention but one example: why should δέ be categorized as a “conjunction”?
- An interesting methodological issue pops op: what to do with the phenomenon of krasis? Should the word κἀκεῖνον in Luke 20:11 be read as καὶ ἐκεῖνον or not? If you work on the basis of a database, the word should probably be split up for reasons of classification. The methodological problem is of course whether you can apply grammatical classification as the defining element for building a database. Jan Krans mentions the (non) use of the article in John 1 as an important point of discussion, but the exact point escapes the author of this report...
Notes BJLP
Hebrew Group Wido¶
- How do you organize data of the text
- How do you order the analysis of the text
The WIVU database itself is designed to allow data to be extracted. Allowing arbitrary “data mining”, to supplement the data, to write scripts to manipulate or search the data in ways or to answer questions not originally envisioned by the database designers.
Once you decide upon a data model, that makes the database well-defined, but you can no longer accommodate those things that do not fit the model.
Multiple information hierarchies in the text. These hierarchies are not necessarily compatible with one another. How does one preserve the hierarchical relationships and not lose information? E.g., multiple Bible translation alignments: LXX and MT, for example. Emdros is one possible solution.
Can we design a database that is optimal for incorporating new data as we go along? There is a symbiotic relationship between the database as a research tool and the database as a repository of information. Is a database designed to answer very specific research questions, or do we design it to allow uses which we today cannot predict?
Can databases be “re-purposed”? How can we combine the information of “alien” databases (or parts of them) into a “new” database or dataset for the researcher’s use?
There was some discussion about the practical problem of proprietary databases versus “open” databases.
Different manuscripts and reading traditions. Basic information unit should be an “abstract position” of the text where variants can be linked and associated.
We need some sort of standard for first level data, second level data, etc.
Standards for publication and peer review specifically relating to the digital humanities. Research publishing needs to allow the reader to – as in the natural sciences – repeat the research (data, algorithms) to see if one can get the same results. In this way, research conclusions can be validated.
Notes: Kirk Lowery
Informatics Group Henk¶
Paul Huygen, Juan Garces (Göttingen), Nicolai, Marcel Ausloos, Dirk Roorda, Henk Harmsen, Janet Dyk, Andrea Scharnhorst
- levels of analysis (circularity, ambiguity)
- preserving research results
- anchored sources
- open source communities - non-exclusive licences
Ambiguity
Variance
Granularity - Modularity
Lowest level = texts as transmitted in manuscript traditions then transcriptions then markup levels
Uncertainty on lower levels are often resilved by patterns at higher levels.
Anchoring. Paul: Münster Group is doing that. There are unique identifiers in the WIVU database.
Canonical Text Services (FRBR) (http://wiki.digitalclassicist.org/Canonical_Text_Services)
Marcel: abbreviations?
Juan: physical and logical aspects in manuscript analysis.
Nicolai: we are moving towards collaboration, sharing, multiplicity.
Paul: how do we do this from the beginning, modelling. (looks like SharedCanvas).
Andrea: how you want to deal with ambiguity should be communicated to the informatics people.
Changing classification numbers in evolving systems.
Where is the responsibility for versioning: in the application or in the archive.
The role of mistakes: must we keep them, because they are facts, or must we correct them, because we want to do analysis.
Keep both.
Janet we do not only have the glyphs: we also have higher level patterns. We are doing analysis at different levels, so we cannot separate the realm of fact and the realm of interpretation rigourously.
Nicolai: it’s important to define the purpose of the database. Sometimes you need the mistakes to be present, e.g. if you are interested in the physical texts. But on top of that we need databases, systems for representing other layers of analysis.
Nicolai: bigger players go to open source initiatives. If we in our EU project cannot use the WIVU freely for research, we may have to drop it, which we would do with reluctance.
Relationship WIVU - SESB - Logos.
Juan: Bibleworks might be the must prone to proposals like this.
Notes: Dirk Roorda
Plenary¶
After the sub group summaries we discussed the following points:
- commercial versus/with open source. The panel states that it is good to have anchored sources publicly available, and that for the rest commercial interest can still flourish in the presence of an open source community that is also using those sources.
- syntax trees and colocations provide more information as to the interpreation of the words. Jan Krans: but is that information correct? Eep: could you anchor those analysis results to the sources?
- scope and details of anchored texts
- reference grammars stopped when linguistic database started (Michael Aubrey). The multiplicity of text databases per se is a worthwhile contribution to the field.
- the development of databases. They evolve, data is added, features are added.
- the role of the level of analysis for sharing and collaboration. On the lowest levels (say the facsimile of the Sinaiticus) we can achieve a lot of consensus. At higher levels there will be more controversy. So the lower levels are more amenable for sharing, and we should do that in order to further our research purposes at higher levels.
- scholars want to do their own analysis. They do voluntarily incorporate those of their colleaugues. The design of sharable data should reflect that. (Wido).
- Juan: also look at what’s happening in comparable disciplines: literary and discourse studies in old texts. Also mind the differences: there are other purposes at play as well.
- Sharing should be focus on the results of the analysis, not necessarily on the analysis process itself.
- collaboration also means: sharing assumptions. Playing with assumptions as well (counterfactuals).
- role of databases in teaching. Next generations have the equipment for new research. How much do we want to limit the use of our resources by new generations?
- Andrea’s question: why all these questions about text? Andrea: my question is: what can computational methods do for us, how far do they go?Do we use them to order our data, or are we using them to disturb our minds?
- Eep: we are changing from the question: how did the river from source to editions run in case of the Hebrew Bible? to how do rivers run in general?. From art-creation to sociological processes.
Notes: Dirk Roorda
Tuesday¶
Group Visualisation (Andrea)¶
Nicolai, Brenda, Wido, Eep, Marcel, Joris, Paul H.
Old Testament syntactic structures - Visualization of Kirk
New Testament Complexity in the manuscript evidence - 5000 different documents
A: Information about the sources: from which time on which media type of manuscript where it is now located a visualized bibliography
lists exist on parts of the sources Nicolai: a mapping of these sources is not interesting Wido: links between manuscripts and books Brenda: where things are found say how accurate they are
N.: closeness to the sources is the most interesting feature; so any measure which allows the comparison of the similarity of texts; clustered; and than check manually
Muenster institute applied multi-dimensional scaling to cluster manuscript on the textual basis - new testament people
Paul H.: genealogy of texts using techniques like in bioinformatics - minimal spanning tree (M. there are technically problems with this method in particular if data are noisy)
Brenda: most sense would make to do analysis of word levels for the NT: the New Greek Testament is an authorative sort of source
What would you like to see visualized? Why?
Wido: how to visualize the relationship between different texts (e.g., biblical books of Samuel and Chronicles) printed: two versions side by side and changes marked in red but we would like not only have changes marked but what kind of changes
Wido: We would also like to see how we could detect changes in the sources these authors might have used
Paul H.: could we use methods of fraud-detection?
B: What can computers do? Often people do guess work. We gave so little data. What computers can do help depends on the quality of the basic data. One approach to this is to test out very different measures at the same time (M)
N: can we develop criteria to detect authorship?
M.: an experiment was done on the time magazine editorial to find out who wrote it, or who wrote which pasaages; but influences are varied so that it is almost not possible
N: can we detect style? But people adapt their style to the context in which they use it, their audience
A: visuals could be use to communicate about your research to the outside communities: how complex are your data, in time/space/kind, along which dimension you have too much data and along which dimension you have to less and might want to have methods to “enrich” the data
J: CKCC project 20 000 letters of scholars in the enlightenment period; entity recognition - biggest question are topics, but names and places are also in the focus network visualization needs defined matrices!
Wido: different versions of a story can be found - not always with the same words - latent semantic analysis; motifs/themes/semantic links between things which come with a different wording
A: clustering depends on the seed node with which we start, from some nodes of the networks the group sees itself and equally amorphous all the rest, some other start nodes gives a more clear ring structure and neighboring village structure much clearer
M: network of quotations could be constructed and than similar techniques as in citation analysis could be applied
J: could one create information from the workflows done around the basic data, when storing and coding them? Construct a network of activity of biblical scholars (from the publications - tracking in books - reconstruction the perception history but than in modern time on the analyzer level (not the source level))
N: 8000 words are in the vocabulary: lemma’s (?)
B: timing of words has been used: what are changes: spelling - but no visualization
N: but what would it help us
B: wordle used in education to show what a certain passage is about; but beyond this we not even know the questions
W: visualizing the vocabulary of the bible; A.: could we also put this into the semantic web?
A: Can we apply the CRM system of Martin Doerr’s group to sources (http://www.cidoc-crm.org/)(artifacts) around the bible? Has that be done? CRM allows to trace the journey of an object across locations, musea
N/Eep: could we visualize stories using participant analysis; seeing the story happening; visualizing the path dependency of stories and this way show and maybe visualize “alternative histories”; poems are even more complex with this respects; track to visualize what is the consequence of some decisions made and switching between them; implement different scenarios - play with them difficult to read from raw data - for teaching to undergraduates visuals would be great
A: could we borrow visualizations from film scenarios or alternative writing with many stories ask Katy
Eep: for participant tracking metaphors became important - Iina Hellsten VU
N: as we discover new rules in analysis visualize on the flight; Principal component analyses applied to things like verbs, valencies, temporal and locational entities, persons,gender, number etc
DANS could do experiments with the database on the attribute level Andrea and Dirk could check this out
B: visualize different storylines in one text
Eep: synopsis of the gospel texts - might be similar to the ManyEyes visualization of changes in treaties of US legal systems
J: try to compose the old testament just from the perspective of David -> give this as task to the computer scientists and let them model the data
Eep: applying visualization in education; visualizing the workflow so that students could be more easier get used to the work practices** strong learning system
N: how to visualize a text? There are experiences in other fields you could rely on
E: how to translate database content to some kind of visualization**
N: bible is not only text, also images and other objects - project on augmented reality around the bible
E: deliberate changes versus changes/variations by accidents - purposefully variations of story with a certain effect on the audience; how to let students experience this?
A: Visuals fly with us because we live a a visual culture, but be carefully they can also be easily misleading and they can also become dull if not done professionally (design)
Software tools:
- Gephi - for network visualization http://gephi.org/
- Katy Boerner’s tool (where the workshop in Amsterdam is Feb 16, and registration via ehumanities.nl) http://sci.slis.indiana.edu/
- ManyeEyes, works as wordle, upload data and explore them http://www-958.ibm.com/software/data/cognos/manyeyes/
- Places and Places exhibition www.scimaps.org
- other visualization websites for inspirations http://www.visualcomplexity.com/vc/
Website of Clement Levallois, who organized the workshop on visualization together with Stephanie Steinmetz (Wido mentioned this, where Vosviewer was presented), he is in Leiden and Rotterdam, http://www.clementlevallois.net/index.php
Group Linguistic Analysis (Kirk)¶
Continued Discussion of the the Groves Center’s Westminster Hebrew Syntax
How did they compile the database? They began manually, starting with Genesis 1 and creating trees manually. Andi Wu (Asia Bible Society) began writing grammar rules. Then the parser generated trees for Genesis 1 and these were checked against the manually created trees. The grammar was modified again and run on Genesis 2 and compared again. By the time Genesis 20 was reached, the grammar was sophisticated enough for the rest of the Hebrew Bible (minus the Aramaic portions).
Most of our conversation dealt with what information was necessary in this database and what could be left out. In a tree, you need to decide which words are related to each other. What information do you need beyond that? S, V, O, IO, NP, PP. Ind Object vs. PP? Subject instead of np? How to deal with strange situation? Multiple objects for a verb? Object marker in front of what is actually a subject (cf. Neh 9:19)?
There was continued interest in how Masoretic cantillation was used. Rather than summarize the discussion, the reader is referred to the following paper by Wu and Lowery which goes into great detail about this method:
Wu, Andi and Kirk Lowery, “From Prosodic Trees to Syntactic Trees,” pp. 898-904 in Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions. Association for Computational Linguistics: Sydney, July 2006. http://dl.acm.org/citation.cfm?id=1273188
Important insights: make the decisions made clear so that others can understand and disagree. Don’t make your database useless to those who don’t agree with your assumptions/decisions.
Group Dealing with Complex Textual Evidence (Jan)¶
Jan Krans, Juan Garces, Markus Hartmann, Constantijn Sikkel, Rick Brennan, Dirk Roorda, Karina van Dalen, Oliver Glanz, Alexey Somov, Reinoud Oosting, Eep Talstra (part-time), Bert Jan Lietaert Peerbolte.
- link between (linguistic) incoherence, then it is linked to its transmission; OT has a more a linguistic approach, NT has stronger emphasis on textual criticism -> it’s a matter of research interest
- text base is usually fixed to the standard editions (Textus Receptus, NA27, UBSGNT, SBLGNT) as “the text” - this influences the database structure - which is the right model? - NA27 should not be the de facto default
- is there one model that caters for both OT and NT? Do they have the attract the same research questions? Same tools can be used for both traditions, e.g. collation tools
- database = data + model + algorythm(s) - do we have to wait for a solutions to major questions before we create a database? - XML vs relational databases
- is RDF a possible solution? More of a metadata format than a solution
- versions: relations between source text(s) and translations is complex + quotations of the texts in Church Fathers etc. - how does this impact our research and modelling?
- science is also about abstraction and abstractions are important for scientific progress; are there discoveries & findings that have challenged major abstract models? Western non-interpolations (previously omitted, now included) - this would influence analysis and grammar
- new philology: texts are analysed in parallel -> no need to exclude variants in lieu of a standard text; the context of the manuscript is very important, variation is celebrated as part of the text tradition - each manuscript text has its own intrinsic value, different questions are asked; can we undertake such an approach with over 5000 MSS? (vast difference between MSS - from little fragments to large codices etc.)
- editions include disambiguations and interpretations that have been added in the process of transmission or by modern editors
- main challenge: trying to postpone the influence of (scholarly) tradition, departing from a sense of distance from that tradition
- case study: NT transcripts - Lk 2, 1Tim 3:16
- concept of text is more complex than the abstraction
- digital medium allows for explicitness of decisions and the reasons for these decisions - Q: what do you do with such micro-data? A: Find intricate patters, e.g. scribal habits, make decisions consistently, find patterns
- existing data is still restricted (e.g. Catholic Epistles) - is this data representative? more comprehensive data is necessary
- social revolution: data can only be established as a larger, collaborative effort (partly crowdsourcing) - needed: standards
- base data: sometimes you cannot agree, scholars aren’t sure - manuscript texts are more complex (change of hands, ink etc. - see file)
Plenary¶
The manuscripts of the Greek NT in digital form are low hanging fruit.
Gephi open source visualisation tool
Presentation Andi Wu¶
Showing the Synonym Finder
The programs gives a scored list of syonyms to a given Hebrew or Greek word. It also grabs translations in English and Chinese from the database. Can give Greek synonyms of Hebrew words and vice versa.
Janet: where did you harvest the synonyms from?
Question: we have the data to generate networks of meanings, so is this useful for scholars doing dictionary studies?
Showing the Similar Verse Finder
Shows verses with similar words in similar grammatical relationships. Similarity is sensitive to meaning, not in the first place to form, part-of-speech, morphology.
This tool crosses the OT/NT boundary.
Showing Multiple Senses for Words
Looking for the senses of strong 639. Computing the senses takes time ... Sense 7 does not have any instance yet. Uses info from the translations.
Shows all occurrences, indicating the sense it has in each occurrence.
Wido on Publication
Digital Humanities Quarterly is interested to have a special edition with contributions of this workshop. Some editorial work will be necessary.
Action Juan
Sends out a questionnaire in order to make a map. To make an inventory of our network.
Karina van Dalen
We have to show the humanists the tools in a bit more comprehensible ways.
What are the tools beyond linguistics? How do literary scholars discuss these things, how do we cooperate with programmers?
Agile development paradigm. Frequent reporting by programmers to scholars. Being involved during the complete development process. Discovering new possibilities in time, so that you can use them quickly. Benefit for programmers: they are building something that can be used.
Impressed by the tool developments that were shown here.
Ulrik confesses having done it in the wrong way at one time. The scholar did not want to use his database.
Dirk
Editions and the slide rule.
Ulrik
Socially: There is a spirit of common vision, focused: having databases of the biblical languages. Not too competitive.
Technically: people want to talk across the fields: ICT, Humanities. A lot of common understanding has been generated.
Yet: we are just starting. The real work await us: between us and among us.
Showing: The Discovery Bible, with question: is this useful?
Shows NASB text with annotations of verb tenses, more in general: to a large refeence grammar. Same for a lexicon. Plus concordance.
A scholarship-consumption tool?
Useful for teaching?
Jan: first have to feel it in action myself?
What is added value on top of SESB or Logos?
Ulrik: emphasis is marked in Greek, tense is marked in English.
Ulrik’s solution can be feature driven on the basis of whishes of the researchers.
Eep: so far it is an electronic version of a classical object: text, lexicon, grammar.
Oliver: we should have a map of feature lists: who is needing which features. That could function as an interface between scholars and programmers. We could develop larger chunks of functionality, targeted to researchers and teachers or both. What happens in workshops between scholars and programmers should somehow surface into the open.
John Cook: ambivalent. I’m not sure what I want to use in the class-room. We are always waiting for the ultimate (ICT) miracle.
Ulrik: this is a very small program: 2.500 lines of C++ on top of EMDROS.
Nicolai: agile development, participatory design that is what I like to see.
Andrea
Whenever you cross a disciplinary boundary you get an epistemic struggle. That is known. But the manifestions are always surprising, and unavoidable.
One’s solutions are another’s problems.
Explain the why-s? Find the language for that.
The topics are similar throughout digital scholarship. Do not go for a super discipline Digital Humanities, that eat up the original disciplines.


Workshop materials¶
Rick Brannan¶
I will post the written version of my presentation, “Greek Linguistic Databases: Overview, Recent Work and Future Prospects”, on my personal web site sometime the week of February 13. Check http://www.supakoo.com/rick/papers, it will be at or near the top of the page. If you would like a copy before then, please feel free to email me and I will send it your way.
Update: The paper is available here: http://supakoo.com/rick/papers/Leiden2012-GreekDatabases.pdf
The written version has a section on “Descriptions of Syntactic Analyses of the Greek New Testament Available in Logos Bible Software” that offers brief descriptions of each of the data sets available.
Further, I have posted video examples of the use of Grammatical Relationships, Preposition Use, Query Forms and Syntax Search Templates to YouTube. You can find them:
Grammatical Relationships and Preposition Use: http://youtu.be/MWBDukofiRk
Using Query Forms: http://youtu.be/dmar7jHT4hQ
Syntax Search Templates: http://youtu.be/VJ2mjyxb-Ko
Dirk Roorda¶
Andrea Scharnhorst¶
1. Databases
What can databases do and what they cannot do? This is a topic which occurs in very different communities, it deserves further articulation on a more general level, also because it is often “state of the art” in fields. Where are databases helpful, where not, what is left out? Consolidating the community around databases, ok: make a shift from one database, one person (or a group); to shared databases, BUT you also articulated the constraints emerging with and due the databases very clearly.
2. Open up the database paradigm
link up with research fronts in CS; linked data, linked open data come to mind maybe: Dirk and Andrea and Juan and ... can design a workflow from manuscript orgainzation -> open canvas -> tabLinker also: how did we work around text traditionally, where is place for discourse in formalized processes, reintroduce uncertainty
3. Visual experimentation
You all already use ‘visuals’ somehow, and you know about their power and drawbacks, for instance if it comes how to organize the text: vertical or horizontal. Do play around with existing tools, get inspeired. Do try to get visualizations around your object of research to Information Visualization conference (as we did with Wikipedia/UDC). Don’t hide your interfaces!
4. Shared toolbox(es)
Mapping your tools and their diffusion mutual exploring them linking them shared virtual labs (plugin principles used in other communities CI shell -> see Katy Boerner’s website) also for the beautiful “toys” - from which one can learn a lot on general principles but maybe also shared libraries on the code level - again open source is the magic word
5. Complexity
If you want to use “complexity” beyond just behind a metaphor, and also testing some concepts and methods of complexity theory link up with experts, don’t go for the second best. Or do go for the second best if this is more local, easier available BUT do realize where ever such a collaboration start the methods/concepts need to go through a cycle of re-appropriation, at the end they become “biblical scholarship complexity approaches” and might not even be recognizable anymore - it is a long way
Please do realize: these processes of extending the borders to link with others and move the field, while at the same time try to keep an identity and perform inner consolidation will not go without struggles and epistemic fights. There is no innovation no new idea without controversy, on the opposite as more innovative as more controversial. but, your community should be use to these problems, maybe this is a source for tolerance.
Changing the epistemic reference system also means risks, the risk to loose control, for valid knowledge and practices to get forgotten, to get stack in redundancy, to loose the epistemic grip on your topics. The solutions of the once and the problems for the others
6. Information and documentation - old fashioned library stuff
In the process of consolidation as an information scientist I think you should have a shared resource(s) for - list of sources and editions (and their location) - list of groups - list of tools - list of publciations (Mendeley?) maybe the way Dirk proposed to share resources - sharing them means also to bring them together and make them refindable, referencable, .... and Juan’s map idea
see also http://demo.datanetworkservice.nl/mediawiki/index.php/LorentzFinal section on Andrea
Nicolai Winther-Nielsen¶
OUTLINE
The presentation deals with PLOTLearner http://www.eplot.eu/project-definition/workpackage-5 as Persuasive Technology for learning, linguistics & interpretation
PLOTLearner is an Emdros database application which is
- a self-tutoring database – interpreters learn from a text (stored in Emdros)
- an application for study and teaching a cultural heritage (Hebrew Bible as case)
Persuasive technology seeks to move from Computer-assisted Language Learning (CALL) to Task-based Language Learning (TBLT)
Tool development is docomunted as the move from PLOTLearner 1 for skill training in morpho-syntax to PLOTLearner 2 for beginner’s learning of text interpretation
PROBLEM
Some answers to Talstra’s questions on integrating Linguistic and Literary data types.
Main points:
- Analysis should broaden the perspective to include syntax, discourse and labeling of interpretations
- Teaching should shift to learning and we should move from exclusively focusing on forms to a focus on form, which will include semantics, pragmatics and discourse connectivity
Research questions on what kind of knowledge?
- all learning must address all levels of data in the texts, e.g. Hebrew Bible
- linguistic theory has an important role to play in interpretation and teaching
- interpretative labeling is relevant as a supplement to categories and features
Analytical instruments
- New technology must support persuasive language learning
- New needs and tasks are to train scholars and teachers for use of learning technology
- The goal must be to repurpose interpretation into an activation of an engagement with grammar
FLOW OF PRESENTATION
Construction of text databases is not only a matter of how to include historically embedded data and annotation with contextual information, but also how to learn and engage the content databases and thus how to how to develop useful applications for the study and teaching a cultural heritage stored in a database.
The presentation seeks to move beyond Computer-assisted Language Learning (CALL)and exploit insights from current Task-based Language Learning (TBLT)in an attempt to describe the skeleton and functions of a corpus-driven, leaner-controlled tool for active learning from a database of the Hebrew Bible.
Or in more simple terms: how will a self-tutoring database enable interpreters to learn from a text?
This is the core question for a presentation of a Europe Union life-long learning project EuroPLOT http://www.eplot.eu/home which is developing Persuasive Learning Objects and Technologies – or in short PLOT - for several pedagogical tasks in 2010-2013. In this project Nicolai Winther-Nielsen and Claus Tøndering have released the first prototype of PLOTLearner http://www.eplot.eu/project-definition/workpackage-5 which is being developed by at the University of Aalborg in Denmark. An early version of this tool was presented at the 2011 Annual Meeting of the Society of Biblical Literature in San Francisco http://europlot.blogspot.com/2011/11/europlot-at-sbl-in-san-francisco.html, and the tool is now being tested in agile development processes in Copenhagen, Madagascar and Gothenburg.
At this stage of development we are already able to offer and test basic skill training generated from the Werkgroep Informatica database of the Hebrew Bible as well as for the Tischendorf Greek New Testament and the German Tiger corpus of the Frankfurter Rundschau. We are now exploring how the Emdros database management system can be used beyond simple learning skills like writing, reading and parsing of morphology and syntax beyond the existing tools and system reported on by Ulrik Sandborg-Petersen and the presenter in the Eep Talstra Festschrift from October 2011 http://europlot.blogspot.com/2011/10/publications-on-data-driven-hebrew.html.
For the Lorentz Colloquium there was two goals. First I wanted to demonstrate what we can do with Hebrew language learning through the first prototype and how we are improveing on the interface and organization of the grammatical catgories. Second we wanted to get new ideas on how to improve on the interface and functionality for the next prototype of the PLOTLeaner. The tool supports a data-driven learning from a corpus, but it is not clear how we should proceed from the current game-based learning technology for skill training to a persuasive visualization of a historical text for learners. We are also want to understand the goals and needs of learners and their instructors better and to explore how we may help them to reuse and repurpose our tool for different groups and styles of learners and for different pedagogical approaches.
To answer this the EuroPLOT is exploring three central issues on interpretation, persuasion and pervasiveness. One issue is how problem-based learning activities can be supported by our tool at an introductory learner stage and feed into meaningful tasks. Another issue is how persuasive effects and effectiveness can be triggered and measured though our technology. The third and final issue is how to deal with open educational access to data which are licensed to publishers and how we can make them available globally for learning in the Majority World.
If we can implement solutions for all three areas we believe we will be able to persuasively support the European Community’s goals for community building through intelligent learning systems and sustainable ecologies.
The presentation Databases for research, training and teaching is available online as a screencapture video which on 3BMoodle http://3BMoodle.dk, click on http://www.3bmoodle.dk/file.php/1/Lorentz-NWN-120210-2.swf.
Links by the participants¶
Hebrew Text Database¶
Jan on Citizen Science¶
Dirk¶
Taverna Open Source Workflow sharing. List of users
Shared Canvas: a web-based page turning application based on Open Annotation.
Matthew¶
RapidMiner Download: http://rapid-i.com/
Andrea¶
Gephi - for network visualization http://gephi.org/
Katy Boerner’s tool (where the workshop in Amsterdam is Feb 16, and registration via ehumanities.nl http://sci.slis.indiana.edu/
ManyEyes, works as wordle, upload data and explore them http://www-958.ibm.com/software/data/cognos/manyeyes/
Places and Places exhibition www.scimaps.org
Visual Complexity: other visualization website for inspirations http://www.visualcomplexity.com/vc/
website of Clement Levallois, who organized the workshop on visualization together with Stephanie Steinmetz (Wido mentioned this, where Vosviewer was presented), he is in Leiden and Rotterdam, http://www.clementlevallois.net/index.php
Ben Fry’s site http://benfry.com/ from here I took the Darwin example
Drastic data: Web-based exploration of data, website of Olav ten Bosch and his project with DANS on the Dutch census http://www.drasticdata.nl/DDHome.php http://www.drasticdata.nl/ProjectVT/ - enjoy playing
Magna View: a software to visualize databases, as easy as with chart styles in EXCEL, and the (small) firm is open for academic licencing, and they are interested in unusual use of their tool in research contexts http://www.magnaview.nl/ I communicated with Dr. Erik-Jan van der Linden and we used the tool
References¶
People¶
- Dirk Roorda
- researcher at DANS
- Jan Krans
- researcher at VU University Amsterdam, faculty of theology
- Juan Garcés
- academic coordinator at Göttingen Centre for Digital Humanities
- Matthew Munson
- researcher at Göttingen Centre for Digital Humanities
- Tommy Wasserman
- lecturer at Örebro School of Theology
- Ronald Dekker
- head ICT at ING
- Charles van den Heuvel
- head Research Group History of Science and Scholarship at the Huygens Institute for the History of the Netherlands.
- Erik-Jan Bos
- researcher at the Descartes Centre for the History and Philosophy of the Sciences and the Humanities
- Walter Ravenek
- scientific developer for the CKCC project
- Eko Indarto
- developer at DANS
- Vesa Åkerman
- developer at DANS
- Paul Boon
- developer at DANS
Applications¶
- Queries and Features as Annotations.
- Demo application by Dirk Roorda and Eko Indarto. Wiki by Dirk Roorda.
- web2py
- Free open source full-stack framework for rapid development of fast, scalable, secure and portable database-driven web-based applications. Written and programmable in Python. Lead developer: Massimo di Pierro.
- CollateX.
- On GitHub Using a developer version by Ronald Dekker. More info at a demo instance of this service.
- Topics and Keywords as Annotations.
- Demo application TKA by by Dirk Roorda, Eko Indarto, Vesa Åkerman and Paul Boon. Wiki TKA (this page) by Dirk Roorda.
- web2py,
- Free open source full-stack framework for rapid development of fast, scalable, secure and portable database-driven web-based applications. Written and programmable in Python. Lead developer: Massimo di Pierro.
Data¶
- Eep Talstra, Constantijn Sikkel, Reinoud Oosting, Oliver Glanz, Janet Dyk
- Text Database of the Hebrew Bible. Data Archiving and Networked Services. Persistent identifier: urn:nbn:nl:ui:13-ukhm-eb
- HuygensING-CKCC (2011-09-09)
- CKCCC. Project Circulation of Knowledge and learned practices in the 17th-century Dutch Republic. A web-based Humanities Collaboratory on Correspondences (Geleerdenbrieven) - archived version d.d. 2013-07-23 Data Archiving and Networked Services. Persistent identifier: urn:nbn:nl:ui:13-scpm-ji
- Tommy Wasserman (author), Jan Krans and Dirk Roorda(contributors)
- Transcription of the manuscripts containing the New Testament letter of Jude. Data Archiving and Networked Services. Persistent identifier: urn:nbn:nl:ui:13-qxf4-1v
Projects¶
Literature¶
- Tommy Wasserman
- The Epistle of Jude: Its Text and Transmission. ConBNT 43. Stockholm: Almqvist & Wiksell International, 2006. Eisenbrauns
- Peter Wittek and Walter Ravenek
- Supporting the Exploration of a Corpus of 17th-Century Scholarly Correspondences by Topic Modeling. CLARIN paper 2011
- Dirk Roorda, Charles van den Heuvel
- Annotation as a New Paradigm in Research Archiving.
searchable pdf
. Proceedings of ASIS&T 2012 Annual Meeting. Final Papers, Panels and Posters. https://www.asis.org/asist2012/proceedings/Submissions/84.pdf - Dirk Roorda, Erik-Jan Bos, Charles van den Heuvel
- Letters, ideas and information technology: Using digital corpora of letters to disclose the circulation of knowledge in the 17th century. Digital Humanities 2010, London
- OAC
- Open Annotation Collaboration.
- FRBR
- Functional Requirements for Bibliographic Records. Quick introduction on wikipedia.