The Usenet Archive

Ben Schmidt, May 8, 2015

Even if you think you don’t know Usenet, you probably do. It’s the Cambrian explosion of the modern Internet, among the first places that an online culture emerged, but modern enough that it can seamlessly blend into the contemporary web. (I was recently trying to work out through Google where I might buy a clavichord in Boston; my hopes were briefly raised about one particular seller until I realized that the modern-looking Google Groups page I was reading was actually a presentation of a discussion from the Usenet archives in 1992.)

Usenet persists; it’s also the prototype of the modern digital archive. One of the best available sources for early usenet is the Internet Archive’s UTZOO collection of about 2 million messages from roughly 1981 to 1991. It’s too vast to read, frustratingly incomplete, and far more significant in the aggregate than the details. In other words, it’s a perfect candidate for some some quantitative textual analysis.

First things first; the easiest way to browse this may be through the standard line chart methods. (Although the newsgroup search at the bottom of this post is equally interesting).

Some of usenet’s most lasting contributions have been abbreviations. Here’s the rise of IMHO, for example, starting in around 1987. You can change this to search for any word and click to see examples, although you may prefer to use the bigger interface if you’ll be sticking around a while.

An Bookworm browser for the early Usenet by year

	{ "database": "usenet",
	"plotType": "linechart","words_collation":"Case_Sensitive",
	"search_limits": {"word": ["IMHO"],"date_year":{"$gte":1982,"$lte":1991}},
	"aesthetic": {  "y": "WordsPerMillion",  "x": "date_year"  }
	}

I’m not an expert on usenet. A history of usenet is here; Ian Milligan has written up his explorations of the Canadian Usenet. And I’m not planning to do much more with this collection in the near future.

The major purpose of this is to start testing out some portable code for building a Bookworm on any set of e-mails. Loading this in to Bookworm was pretty easy: I simply took an hour or so and hacked together a Makefile to download the UTZOO files and run an ugly parser to pull out the metadata. (Usenet posts are close enough to e-mails that Python’s email.parser module happily extracts the metadata. Character encoding is a major problem, so I’ve simply dropped everything that fails to parse as Unicode/ASCII). Once parsed, two lines of code write out the files in bookworm format, and the bookworm ingest handles all the messy details of tokenization, charting, and the rest. I later ran a second pass to better deal with some of the e-mail address formats; that will be useful for the next e-mail collections that I run it on. (Probably some particularly interesting e-mail list archives–the Humanist, R-help, perhaps the entirety of H-net if I can download it all.)

The Usenet population

One obvious set of questions is where users are posting from. The top-level domains of posts give some idea of the ways that posters reached usenet. Some are familiar–.uk, .com, .edu, but the Domain Name System didn’t exist until 1985. Dominating the early years are UUCP–usenet’s own transfer protocol–and .arpa, going back to the ARPAnet itself.

Changing top-level domains

{
    "database": "usenet",
    "plotType": "streamgraph",
    "method": "return_json",
    "words_collation": "Case_Sensitive",
    "search_limits": {
        "tld__id": {
            "$lte": 10
        },
        "date_year": {
            "$gte": 1983,
            "$lte": 1991
        }
    },
    "aesthetic": {
        "x": "date_year",
        "y": "TotalTexts",
        "fill": "tld"
    }
}

It’s a similar story for the mid-level domains; some easily recognizable present-day institutions, and some almost forgotten companies. MIT, Berkeley, and Hewlett-Packard produced the most usenet posters in this set. bio.net is in the top ten probably only because this is a biology-oriented collection. The Digital Equipment Corporation is one of the big losers; I don’t know what CTS.comis, although it should be easy to look up.

Twenty largest mid-level domains

{
    "database": "usenet",
    "plotType": "barchart",
    "method": "return_json",
    "words_collation": "Case_Sensitive",
    "search_limits": {
        "mld__id": {
            "$lte": 20
        }
    },
    "aesthetic": {
        "x": "TotalTexts",
        "y": "mld"
    }
}

After the DNS takes over, the general story is of expansion. Both for .com and .edu e-mail addresses, the most common domains (by percentage) fade away.1

The top .com domains, as percentage of all .com posts.

{
    "database": "usenet",
    "plotType": "streamgraph",
    "method": "return_json",
    "words_collation": "Case_Sensitive",
    "search_limits": {
	    "tld":"com",
        "*mld__id": {
            "$lte": 30
        },
        "date_year": {
            "$gte": 1986,
            "$lte": 1991
        }
    },
    "aesthetic": {
        "x": "date_year",
        "y": "TextPercent",
        "fill": "*mld"
    }
}

Relative uses of the top .edu domains, as percentage of all .edu posts.

{
    "database": "usenet",
    "plotType": "streamgraph",
    "method": "return_json",
    "words_collation": "Case_Sensitive",
    "search_limits": {
	    "tld":"edu",
        "*mld__id": {
            "$lte": 30
        },
        "date_year": {
            "$gte": 1986,
            "$lte": 1991
        }
    },
    "aesthetic": {
        "x": "date_year",
        "y": "TextPercent",
        "fill": "*mld"
    }
}

Newsgroups

The real heart of usenet are the newsgroups. So what are they?

The most popular ones, it seems, are all computer oriented ones: for Apple, for Macintosh, for X-Windows, and for Atari. (I’m surprised how long the Atari and Amiga usergroups stay fairly active.)

{
    "database": "usenet",
    "plotType": "streamgraph",
    "method": "return_json",
    "words_collation": "Case_Sensitive",
    "search_limits": {
        "date_year": {
            "$gte": 1985,
            "$lte": 1991
        }
    },
    "aesthetic": {
        "x": "date_year",
        "y": "TextPercent",
        "fill": "*Newsgroups"
    }
}

There isn’t a really signficant pattern of changes here. It might be worth splitting up the usenets into domains as well, so one could easily run a search across all “comp” fields simultaneously.

Top newsgroups by messages, as percentage of all messages

{
    "database": "usenet",
    "plotType": "barchart",
    "method": "return_json",
    "words_collation": "Case_Sensitive",
    "search_limits": {"*Newsgroups__id":{"$lte":100},
        "date_year": {
            "$gte": 1981,
            "$lte": 1991
        }
    },
    "aesthetic": {
        "x": "TextPercent",
        "y": "*Newsgroups"
    }
}

One particularly useful feature is being able to search for words by newsgroup. Who uses “paranoid” the most? net.politics, naturally; but the Mac and Amiga groups have a surprisingly large number of uses, too. If you search for “crazy,” you’ll see that at least for Amiga this holds up.

Uses of “paranoid” (or a word of your choosing) in the 50 largest newsgroups

{
    "database": "usenet",
    "plotType": "barchart",
    "method": "return_json",
    "words_collation": "Case_Sensitive",
    "search_limits": {
	    "word":["paranoid"],
	    "*Newsgroups__id":{"$lte":50},
        "date_year": {
            "$gte": 1981,
            "$lte": 1991
        }
    },
    "aesthetic": {
        "x": "TextPercent",
        "y": "*Newsgroups"
    }
}

The next logical step is to juxtapose some of these against each other. So for example, we can look to see which newsgroups used the phrase “IMHO” the earliest.

Which newsgroups use “IMHO” the most and the earliest? Colors show percentage of emails in cell using the phrase; click to read.

{
    "database": "usenet",
    "plotType": "heatmap",
    "method": "return_json",
    "words_collation": "Case_Sensitive",
    "search_limits": {
	    "word":["IMHO"],
	    "Newsgroups__id":{"$lte":50},
		"date_month":{"$gte":3000},
        "date_year": {
            "$gte": 1981,
            "$lte": 1991
        }
    },
    "aesthetic": {
        "x": "date_month",
        "color": "TextPercent",
        "y": "Newsgroups"
    }
}

But this shows a massive data integrity issue! All of the net newsgroups are from 1983 to 1987; all of the comp newsgroups are from after about Christmas 1987.

Why is this? Maybe I’ve done something wrong; or maybe it has to do with the way that the actual tapes were collected.

In any case, it’s just about enough to get me to give up.

If I were to spend some more time on this, I think the most interesting places would be net.politics, the slowing dying communities around Apple and Amiga products, and some looks at gender based on e-mail addresses particularly in the gender section, because the rise of usenet is at just the same time as women begin to vanish from computer science; even if it’s not causal, there should be some evidence to look at.

One last question: where’s the eternal September?

But let me finish with a strange little question. One of the reasons I set up this blog was to share some of the browser for data that I’m not sure what to do with. Usenet is one of those–I occasionally teach the computer culture of the 70s and 80s in history classes, but not to the degree I feel particularly confident making broad statements about what’s out there. For instance: one of the things that we “know” about Usenet is that AOL ruined it in 1993 with the onset of the “eternal September.” The presumption there is that the character of Usenet changed each September with the influx of college students.

But in fact, I have a hard time finding obvious terms that spike in September. (You’ll see a discontinuity in June: that’s because 1991 is the heaviest-traffic year for the first half, but has no text for after the end of the spring semester.

This means that FAQ, the most obvious search for this field is almost unusable.

Where is the eternal September? Percentage of e-mails using “netiquette” by day of year, aggregate over the period 1982-1991

	{ "database": "usenet",
	"plotType": "linechart",
	"search_limits": {"word": ["netiquette"],"date_year":{"$gte":1982,"$lte":1992},"date_day_year":{"$lte":366}},
	"aesthetic": {  "y": "TextPercent",  "x": "date_day_year"  }
	}

  1. By the way, this is a good example of the use of the asterisk wildcard in the Bookworm API to drop items from a comparison set