An API for TEI: Shakespearean examples.

Ben Schmidt, July 10, 2015

I had a couple conversation at the DH conference in Australia about the potential for Bookworm as an alternative means of access for TEI-encoded texts.

This is something that I’ve been thinking about for a while, and have discussed in substantial length with Julia Flanders and Joe Wicenkowski, to the point of even once putting together a grant proposal with Julia. But in any case, as a sort of reference or provocation I wanted to put this up here now.

Need

The basic goal here is to have the rich information embedded in TEI texts tied to the underlying texts as metadata rather than markup in ways that will make them easily associated to computation.

Bookworm generally support bag-of-ngrams approaches. (The extensions allow many of the Stanford NLTK tools like named-entity extraction to work easily on these materials as part of pre-processing.)

Essentially what this does is let us much more easily fulfill some of the suggestions in Michael Witmore’s article on the “Multiple Addressibility of Texts.” (I can’t praise this article enough: it has a too-little-remarked-upon status as one of the central referents for DH textual scholarship, expressing as important an idea as, say, Ramsay’s “Algorithmic Criticism” just as clearly.) As Witmore says, a text can be addressed on many levels. Each of the following is a text that can be usefully analyzed on its own: a novel, a chapter, “every first line of an Emily Dickinson poem,” as “everything published by Scribner and Sons, ever.” You could say that to this point, we have privileged the levels of address that are humanly feasible in the codices we usually encounter; we read books at a time, we compare authors to themselves and their contemporaries, but we rarely explore the extremely large or the moderately size corpus that is divided across thousands of books. (You could perhaps read every letter in every archive dated March 13, 1730; but the cost of going to see them would be prohibitive). Digital tools make reconfigurations much easier.

Bookworm’s existing API can handle many of the important forms of analysis one might want to explore in a TEI-encoded set of texts. The Bookworm API provides a vocabulary for describing queries across these multiple addresses. If you want to return the full set of words in that Emily Dickinson poem with their counts, for example, the query would look like this:

{
  "search_limits":{
    "author":["Dickinson, Emily"],
    "line_number":[1]
  },
  "groups":["unigram"],
  "counttype":["WordCount"]
}

But while the Bookworm API affords a language to easily express a query like this, there isn’t necessarily a database on which we could run it. The Hathi Corpus contains most (all?) of Emily Dickinson’s poems: but in many different copies, and not marked up in a way that we tell where one poem ends and the next begins, let alone which is line number 1.

That’s where TEI comes in. When TEI-encoded texts exist, they offer a substantially finer-grained level of address than just the individual volume. In the Hathi Trust, computers still don’t know in any reliable general way where chapters begin and end; we don’t know who speaks dialogue; even the rather mundance task of removing running headers requires a fair amount of code that has been replicated and triplicated.

Shakespeare

By telling Bookworm to treat tag intersections as the lowest level of all possible encounters in a TEI text, we can make more interesting comparisons for textual work using the efforts of TEI encoders. Take the following quote as an example.

Then doth it well appear the Salic law

Was not devisèd for the realm of France,

Nor did the French possess the Salic land

Until four hundred one and twenty years

After defunction of King Pharamond,

Idly supposed the founder of this law,

Who died within the year of our redemptiom

Four hundred twenty-six ; and Charles the Great

Subdued the Saxons and did seat the French

Beyond the river Sala in the year

Eight hundred five…

We would usually refer to this by a single identifier: Henry V, Act 1, Scene 2, lines 60-69. (If I’m counting right).

That hierarchy is linear: it presumes the only way you can find a string of text is by an identifier that situates it in the text as a linear string.

Here are some of the attributes that the Folger has for that same set of lines:

{
  "TEI_xmlns": "http://www.tei-c.org/ns/1.0",
  "TEI": true,
  "sp_who": "#BishopOfCanterbury_H5",
  "ab": true,
  "author": "William Shakespeare",
  "text": true,
  "sp": true,
  "body": true,
  "div1_type": "act",
  "div2_n": "2",
  "div2_type": "scene",
  "docname": "TEIfiles/Folger_Digital_Texts_Complete/H5.xml",
  "title": "The Life of Henry V",
  "div1_n": "1",
  "div2": true,
  "div1": true,
  "editor": "Paul Werstine"
}

Most of these are uninteresting; they are about things like the author, or an incredible degree of precision about the location. It is not just {"act": 1, "scene":2}, but rather {"div1_type": "act","div1_n": "1"}.

It does contain the information about the line numbers as well; those aren’t included for reasons detailed below. I’ve downloaded from the Folger Library all of Shakespeare’s plays, and then written a parser that recursively tracks through all of the tags to find out, for every word, what various different levels apply to it. (Most of the code is source-agnostic, meaning it should work on any XML–but there are a number of special cases I know about and probably many I don’t that any other sources would require incorporating.)

What does this let us do? First off, we can simply confirm that the plays are there.

(Although these are all charts rather than the raw numbers returned from the API, keep in mind the charts are built using the API and the numbers are, in fact, available for you to see in the tabs. I’m just taking it for granted that no one wants to look at numbers.)

How many tokens are in each play?

{
    "database": "TEIworm",
    "plotType": "barchart",
    "method": "return_json",
    "search_limits": {},
    "aesthetic": {
        "x": "TotalWords",
        "y": "title"
    },
    "counttype": ["TotalWords"],
    "groups": ["title"]
}

If these numbers seem off by even 50% to you, particularly on the high side, that may just be because no one agrees on a what a tokenization is. Bookworm’s includes every punctuation mark, for example, and some but currently not all of the paratext from the Folger editions. It’s certainly true that Hamlet is the longest play.

This is not interesting. But we can bring in some of the core Bookworm functionality to make it a little more active: what are the usages of different words in different plays?

Word Usage by plays

{    "database": "TEIworm",
    "plotType": "barchart",
    "method": "return_json",
    "search_limits": {"word":["betray","betrays","betrayed"]},
    "aesthetic": {
        "x": "WordsPerMillion",
        "y": "title"
    }
}

But that’s just the highest level of address. The tags also break it out by characters, so that you can see which characters use particular words the most.

Shakespeare’s top 40 characters, by the percentage of time they use a given word.

{    "database": "TEIworm",
    "plotType": "barchart",
    "search_limits": {
		"word": ["love"],
         "sp_who__id": {
            "$lte": 40
        }
    },
    "aesthetic": {
        "x": "WordsPerMillion",
        "y": "sp_who"
    }
}

The character names here are ugly but should be apparent: #JuliaTGV is “Julia from Two Gentlemen of Verona,” #Antony_JC is “Antony from Julius Caeser,” and so forth. The only trick has to do with the complexities of the Folger markup. “#HenryVI_1H6” is “Henry VI, from Henry VI part 1:” but the tag refers to his actions in any play where Henry VI appears.

Number of tokens per character in Henry IV parts I (left) and II (right)

	{
    "database": "TEIworm",
    "plotType": "slopegraph",
    "method": "return_json",
    "search_limits": {
        "title": ["The History of Henry IV, Part 1"]
    },
    "compare_limits": {
        "title": ["Henry IV, Part 2"]
    },
    "aesthetic": {
        "left": "WordCount",
        "right": "TotalWords",
        "label": "sp_who"
    },
    "counttype": ["WordCount", "TotalWords"],
    "groups": ["sp_who"]
}

What would it take for a fully-realized unigram API for TEI?

There are a lot of ways that this is not complete.

Search results are not handled well. One of the major strengths of Bookworm is that you can click through to underlying texts; but here, you just get some broad description of the source. We can certainly do better; in theory, any Bookworm result should correspond to an Xpath and be compatible with certain forms of stylsheet representation. At worst, we could default to storing the full text in the database, although I’ve been reluctant to allow that in general for security reasons.

The metadata contained in TEI headers is complicated and not routinely easily reduced. If assigning a universal name in the corpus, can an individual play be used as an authority? Shakespeare has a character the folder identifies as “HenryV_H5.” Should the English language name for this character be “Henry V” even when describing the actions of Prince Hal in Henry IV? Is information from Henry V even conceptually appropriate when applied to the same characters in Henry IV? The current approach selects names and sexes for characters more or less at random from all the information we have about them; the result is that some may be known by their less common characteristics.

The backhandling of references in general is still arbitrary. I have literal directive in the code to link xml:ids from the “who” elements of “sp” tags; this could be extended to any other set of relationships, but if, for instance, the markup for a speech included both the speaker and the recipient, a fully automatic linking of tags (which I’ve only sketched out here) wouldn’t be possible.

Finally, and most critically, the TEI tags must fully encompass the text for the internal logic of this approach to work. This is where the caveat about lines, above, comes into play. The Folger has chosen to record line information as “milestones” (which do not wrap any text) as opposed to any type of “div” or “ln” elements (which include the internal text.) As a result, although I can see where a line number begins, individual phrases are not wrapped inside a line. This isn’t a great loss right now, and there are ways around this that may be worth exploring at a later date. (In fact, the line information is included in the “w” and “c” elements). But it might be worth finding a way to allow milestone attributes to ascribe characteristics to texts as well as div attributes.

If the header information is stored in a separate file from the text, things may become yet more complicated.