Arranging the novels from the txtLAB

Ben Schmidt, January 19, 2016

Andrew Piper announced yesterday that the McGill text lab is releasing their corpus of modern novels in three languages. One of first thoughts with any corpus is: what existing Bookworm methods might add some value here? It only took about ten minutes to write the code to import it into a bookworm; the challenge is figuring how methods developed for millions of books can be useful on a set of just 450.

It might work better if cut up into chapters or sections as Piper did for his article on the conversional novel[@piper_novel_2015]. I could wander through a bunch of metadata plots to explore the authors and their language; barcharts of how often “Waverly” uses “the” and the like. There’s one at the end.

But instead, I want to just try one particular plot type, the vectorspace plot, in the Bookworm D3 library. It’s powerful, somewhat opaque, and underused. The scatterplot of classes according to weighted feature counts is one of the most basic tools in the text plotting arsenal. The most powerful of these is the scatterplot by principal components of vocabulary. (I still like my old post explaining how principal components are related to feature counts, if you want more explanation.) These are incredibly common; probably only linecharts over time are more useful. (The first Stanford Literary Lab pamphlet used them heavily, as does CMU docuscope program. The base idea here is an interactive that encompasses thousands of versions of this basic plot that lets us uncover relative positionings of novels in any arbitrary space defined by language usage.

So, here’s a first stab at the chart; I’ll walk through a few different incarnations before giving you the best one to play with.

On the first one, the x axis will be uses of the word “me” and the y axis uses of the word “they”. Authors are shown for each novel, with coloration showing whether it’s first or third person. Mouseover any author to see the title.


{
	"database": "txtLAB450",
	"plotType": "vectorspace",
	"method": "return_json",
	"words_collation": "casesens",
	"search_limits": {
		"word": ["me","they"],
		"language": ["English"]
	},
	"aesthetic": {
		"variable": "WordsPerMillion",
		"dimensions": "*unigram",
		"color": "person",
		"label": "author","group":"title"
	},
    "weights": {
	    "me": {
            "x": 1,
            "y": 0
        },
        "they": {
            "x": 0,
            "y": 1
        }}
	}

Peter Pan uses “they” the most of any novel; Mary Shelley’s ‘Mathilda’ uses “me” more than any other. Just these two words do a half-passable job at separating first and third-person narration; first person narration is in the lower right quadrant, and third person in the upper left. In fact, this visualization quickly foregrounds an ambiguity in the textlab’s metadata: Mary Shelly’s Frankenstein is listed as third person, but shows up as heavily using “me.” It’s actually (according to Cliff’s notes online; I won’t pretend I can remember unassisted) a pastiche of various first-person or epistolary forms.

Suppose we wanted to use a variety of pronouns to find more limit cases like this. We’ve already run out of dimensions for the computer screen at two, but we could pile first-person onto the dimension of “me” and third-person with “they.” But what direction should second-person pronouns go in?

The general solution when using a variety of words is to let principal components do the organizing. Here’s the same plot with several more pronouns, and the positions of each text positioned on the first two principal components of the words.

Pronouns


{
	"database": "txtLAB450",
	"plotType": "vectorspace",
	"method": "return_json",
	"words_collation": "casesens",
	"search_limits": {
		"word": ["I","me","they","you","her","she","he","we","us","them","him"],
		"language": ["English"]
	},
	"aesthetic": {
		"variable": "WordsPerMillion",
		"dimensions": "*unigram",
		"color": "person",
		"label": "author","group":"title"
	}
}

The default layout here places the texts according to the principal components of their use of each of the terms. Those weights are represented by the red dots in below the chart. The small dot in the center is a weight of [0,0]; towards the right is a positive weight in PC1, and towards the top a positive weight in PC2. In mine, for example, “you” shows up lying almost entirely along the first component (x axis) and “they” more along the second (y axis), while “she” and “her” are positioned both towards the south and the west. (You might be seeing a mirror image; principal components are arbitrarily signed, and I don’t know if this Javascript implementation of PCA I’m using always returns the same sign.)

Here we see the ambiguity of choosing “first” or “third” more clearly. “Frankenstein” is still conspicuously off; but Heart of Darkness (which is structured as one first-person narrative, Marlow’s, in the quotation marks of a second first-person narrator) and Willkie Collins’s The Woman in White (like Frankenstein, multiple first person narrators) also both stick out as only tenuously third-person. In the other half of the chart, Amelia Opie’s Adeline Mowbray is flat-out miscategorized, as far as I can tell from looking at the text; Ulysses, as I recall, is more third person than first; and Ethan Frome is only first person in the two bookend chapters. Waverly, too, has a little first-person in the opening from the “author of Waverley,” insofar as he might be a fictional narrator; but really it’s a third person novel.

I go through this not to nitpick the metadata from McGill but to establish that the chart is working; principal components on feature counts of pronouns can help clarify or complicate expert accounts of narrative voice. That’s not necessarily the holy grail of “something we didn’t know before” for discipline-specific publishing, but it does pass the infrastructure bar of highlighting shortcomings in existing resources.

Of course, principal components isn’t the only way to plot this. Looking at narrative voice, for example, we might want to use linear disciminant analysis (AKA, the original LDA) to split the narrative types left from right. (Top to bottom is arbitrary).1

Linear discriminants of third- and first-person narrative


{
	"database": "txtLAB450",
	"plotType": "vectorspace",
	"method": "return_json",
	"words_collation": "casesens",
	"search_limits": {
		"word": ["I","me","they","you","her","she","he","we","us","them","him"],
		"language": ["English"]
	},
	"aesthetic": {
		"variable": "WordsPerMillion",
		"dimensions": "*unigram",
		"color": "person",
		"label": "author","group":"title"
	},"weights":{"he": {"y": 0.5, "x": -0.08489903}, "her": {"y": 0.5, "x": "0.18047290"}, "him": {"y": 0.5, "x": 0.02929207}, "I": {"y": 0.5, "x": -0.16673649}, "me": {"y": 0.5, "x": -0.42734463}, "she": {"y": 0.5, "x": -0.16415530}, "them": {"y": 0.5, "x": -0.13469374}, "they": {"y": 0.5, "x": "0.47925630"}, "us": {"y": 0.5, "x": "0.29900222"}, "we": {"y": 0.5, "x": -1}, "you": {"y": 0.5, "x": 0.38}}
	}

What’s really important here, though, is that the choice of pronouns for the vectorspace is arbitrary. Pronouns matter for personhood; but any object of study might involve a different dimensional space. Here’s a very different vector space: principal components by color.

English novels arranged by their use of color words


{
	"database": "txtLAB450",
	"plotType": "vectorspace",
	"method": "return_json",
	"words_collation": "casesens",
	"search_limits": {
		"word": ["white","black","red","blue","green","yellow","brown","grey","orange","purple","pink"],
		"language":["English"]
	},
	"aesthetic": {
		"variable": "WordsPerMillion",
		"dimensions": "*unigram",
		"color": "gender",
		"label": "author","group":"title"
	}
}

You’d have to be a literary critic to perform a real reading on this, but I think it should be possible. Whiteness dominates the scaling here on both dimensions; the first principal component essentially measures how much “white” and “black” used , while the second contrasts “white” against “blue,” “grey,” “green,” and “red.” Katherine Mansfield’s “The Aloe” is a particularly white-heavy novel; war literature of Dos Passos and Stephen Crane makes use of a wider palate. I set the coloration to gender, but there isn’t much of a gender distinction here. There does, though, seem to be some stylistic consistency by authors; multiple Kipling novels feature a black-and-white use of coloration, while three novels by Virginia Woolfe score highly on the broader spectrum of colors.

“White” and “black” are only sort of colors. As I said, this is a true interactive; you can choose which words you want in the vectorspace. So go up to the text field above that chart, delete “white” and “black,” and hit “enter” on the keyboard. You’ll see that all the novels snap to a line; this is because “white” was taking up a dimension pretty much on its own, and the old weights are retained. Click the button that says “reset to principal components” and you’ll get principal components for the new set. Now the chart is dominated by “grey” and oddly “green” versus “red” and “blue”; bright versus natural colors, perhaps. These succesfully distinguish sharply between Dos Passos, who is highly grey, and Stephen Crane, who is red.

It’s possible that you might also to change the weights yourself. One of the great historical documents of data visualization is John Tukey’s description of the PRIM-9 system for navigating a multidimensional space just like this. When I built this visualization I was inspired by that to make the knobs turnable; so that rather than just being bound by components, you can drag and reset to any arrangement you want. Turn off the effect of “gray” above if you want, or, reduce down just to words to get a different palate.

If you grab one of the red dots and swirl it around, you’ll be watching the projection rotate in higher-dimensional space around your individual word. Think about it as happening in three dimensions.

The full-featured browser

So here’s the fully featured version of the interactive, including controls to switch to French or German 2

I’ve preset it with some random thematic words: ["death","love","money","work","home","eyes","read","father","mother","children"] What happens is interesting; essentially, down and to the left (in my projection) shows novels that draw most heavily from language of motherhood, home, and children; up and to the left are novels that draw most heavily from language of death, love, work, and fatherhood. To the right are novels that don’t draw from either; it might be worthwhile to do some normalization, but this is enough for now.

You can switch them to any others.

That’s all for now. Let me know if you find this useful or want to see it on another corpus. We could do it on the Hathi Trust, for example, with full LC classifications instead of individual novels.


{
	"database": "txtLAB450",
	"plotType": "vectorspace",
	"method": "return_json",
	"words_collation": "casesens",
	"search_limits": {
		"word": ["death","love","money","work","home","eyes","read","father","mother","children"], 
		"language": ["English"]
	},
	"aesthetic": {
		"variable": "WordsPerMillion",
		"dimensions": "*unigram",
		"color": "person",
		"label": "author","group":"title"
	}
}

  1. For the time being, the weight indicators are wrong here. I’ll have to look into that later.

  2. Unfortunately, there’s no really good way to compare all the languages at once other than to see the three groups: it would be nice if there were multi-lingual collation so the “Vienna” and “Wien” would map to the same place, but that requires formalizing some stuff in the Bookworm token backend.