This is a blog to share the results of quick explorations into Bookworms on this server.
It's built using hakyll: if you have a bookworm of your own, you can clone it and write your own set of posts using the Bookworm API without having to edit the templates.I'm leaving plenty of uncompleted or unexplained stuff up here, so if I haven't linked to it from Twitter or something, no promises that it might even be comprehensible.
Andrew Piper announced yesterday that the McGill text lab is releasing their corpus of modern novels in three languages. One of first thoughts with any corpus is: what existing Bookworm methods might add some value here? It only took about ten minutes to write the code to import it into a bookworm; the challenge is figuring how methods developed for millions of books can be useful on a set of just 450.
It might work better if cut up into chapters or sections as Piper did for his article on the conversional novel[@piper_novel_2015]. I could wander through a bunch of metadata plots to explore the authors and their language; barcharts of how often “Waverly” uses “the” and the like. There’s one at the end.
But instead, I want to just try one particular plot type, the
vectorspace plot, in the Bookworm D3 library. It’s powerful, somewhat opaque, and underused. The scatterplot of classes according to weighted feature counts is one of the most basic tools in the text plotting arsenal. The most powerful of these is the scatterplot by principal components of vocabulary. (I still like my old post explaining how principal components are related to feature counts, if you want more explanation.) These are incredibly common; probably only linecharts over time are more useful. (The first Stanford Literary Lab pamphlet used them heavily, as does CMU docuscope program. The base idea here is an interactive that encompasses thousands of versions of this basic plot that lets us uncover relative positionings of novels in any arbitrary space defined by language usage.
A first pass at understanding the potential of the Hansard corpus through a Bookworm browser.
Sketching out orthographic normalization with word2vec
Rejecting the gender binary: a vector-space operation
Vector Space Models for the Digital Humanities
Bookworm D3 layouts
Bookworm 0.4 is now released on github. It contains a number of improvements to the code from over the summer. It makes the existing code much, much more sensible for anyone wanting to build a bookworm on their own collections of texts based on the experience of many using it so far. All the stages: installation, configuration, and testing are now a lot easier. So if you have a collection of texts you wish to explore, I welcome you to test it out. (I’ll explain at more length later, but for the absolute lowest investment of time you can just run a prebuilt bookworm virtual machine using vagrant.)
This post is just kind of playing around in code, rather than any particular argument. It shows the outlines of using the features stored in a Bookworm for all sorts of machine learning, by testing how well a logistic regression classifier can predict IMDB genre based on the subtitles of television episodes.
I just saw Matt Wilkens’ talk at the Digital Humanities conference on places mentioned in books; I wanted to put up, mostly for him, a quick stab at some of the raw data running the equivalents on my movie bookworm.
This is a quick post to share some ideas for interacting with the data underlying the recent article by Ted Underwood and Jordan Sellers on the pace of change in literary standards for poetry.
The point of this is to begin to think through some of the questions on someone else’s work on what useful exploratory apparatus might be for articles using unigram data.
Here are some interactives I’ve made in preparation for my talk at the Literary Lab at Stanford on Tuesday on plot arcs in television shows based on underlying language.
This is sort of in lieu of a handout for the talk, so some elements may not make much sense if you aren’t in the room.
These extend two more detailed posts on my Sapping Attention blog; one giving a methodology for topics on TV shows, and a second describing using principal components as a way to visualize archetypal plots as movements in multidimensional space.
For both of these, I felt that I was bumping up against the possibilities of writing on blogger; the underlying data is much, much richer and well worth exploring. So this post is where I’m gathering together some of the transformations I used there.
Even if you think you don’t know Usenet, you probably do. It’s the Cambrian explosion of the modern Internet, among the first places that an online culture emerged, but modern enough that it can seamlessly blend into the contemporary web. (I was recently trying to work out through Google where I might buy a clavichord in Boston; my hopes were briefly raised about one particular seller until I realized that the modern-looking Google Groups page I was reading was actually a presentation of a discussion from the Usenet archives in 1992.)
Usenet persists; it’s also the prototype of the modern digital archive. One of the best available sources for early usenet is the Internet Archive’s UTZOO collection of about 2 million messages from roughly 1981 to 1991. It’s too vast to read, frustratingly incomplete, and far more significant in the aggregate than the details. In other words, it’s a perfect candidate for some some quantitative textual analysis.
Just a day after launching this blog (RSS feed, by the way, is now up here) I came across a perfect little example question to look at. The Guardian ran an article about appearance on teaching evaluations that touches on some issues that my Rate My Professor Bookworm can answer, with a few new interactive charts.
Though more and more outside groups are starting to adopt Bookworm for their own projects, I haven’t yet written quite as much as I’d like about how it should work. This blog is attempt to rectify that, and begin to explain how a combination of blogging software, interactive textual visualizations, and a exploratory data analysis API for bag-of-words models can make it possible to quickly and usefully share texts through a Bookworm installation.
But this is a difficult task. So much so that I have to completely change my blogging stack to do it. So for a first post on this site, I want to introduce some elements of the API and talk about why I think a platform like this is valuable for exploring a large collection of texts visually and quantitatively. Maybe someone will be persuaded to do it themselves.