Vector Spaces: an interactive explanation

Ben Schmidt, March 15, 2015

The fundamental modeling tool behind much text analysis is the “vector space” model. The underlying idea is that any document can be broken into an arbitrary set of words, and we can plot it in space. Here’s an explanation with some interactive charts to help show how they work.

A note on reading. These are memory-chomping interactive charts, so they don’t load with the page. You have to click on them (or the grey box under the code that describes them) and then they’ll load.

We saw this in class last week in R, but I thought it might be helpful to have a dynamic version so that you can see just what this rotation looks like.

The classic document set in machine learning is the Federalist papers, so we’ll start with them. (Next week, we’ll be reading some of the more substantial findings that move beyond simple vector space models).

The vector space model is easiest to understand in two dimensions, so we’ll start there. This is a plot showing each of the federalist papers (by number) and colorized by author. The x-axis shows how many times each paper uses the word “Congress”; the y-axis shows how many times it uses the word “President”.

Far off to the right you can see Federalist no. 40, which uses the word “Congress” 17 times. (Although it is not in reference to the proposed congress under the constitution, but instead to the convention that wrote that the constitution itself).

{
    "database": "federalist",
    "plotType": "vectorspace",
    "method": "return_json",
    "words_collation": "Case_Sensitive",
    "search_limits": {
        "word": ["President", "Congress"]
    },
    "aesthetic": {
        "variable": "WordCount",
        "dimensions": "*unigram",
        "label": "fedNumber"
    },
    "counttype": ["WordCount"],
    "groups": ["*unigram", "author", "fedNumber"],
    "weights": {
        "Congress": {
            "x": 1,
            "y": 0
        },
        "President": {
            "x": 0,
            "y": 1
        }
    }
}

Multiple words on the same axis.

It’s easy to see one way to improve this: instead of just displaying one word on each axis, we can add some together.

{
    "database": "federalist",
    "plotType": "vectorspace",
    "method": "return_json",
    "words_collation": "Case_Sensitive",
    "search_limits": {
        "word": ["President", "Congress","legislature","senate","house","executive","presidency","legislative"]
    },
    "aesthetic": {
        "variable": "WordCount",
        "dimensions": "*unigram",
        "label": "fedNumber"
    },
    "counttype": ["WordCount"],
    "groups": ["*unigram", "author", "fedNumber"],
    "weights": {
        "Congress": {
            "x": 1,
            "y": 0
        },"legislature":{"x":1,"y":0},"senate":{"x":1,"y":0},"house":{"x":1,"y":0},"legislative":{"x":1,"y":0},
        "President": {
            "x": 0,
            "y": 1
        },"executive":{"x":0,"y":1},"presidency":{"x":0,"y":1}
    }
}

The red dots below give the weights for each direction. You can drag them to make any individual element represent something in particular. Try, for instance, dragging “Senate” all the way to the left: what happens?

Stopwords and authorship.

These topical words can tell you things about the materials covered: it’s probably obvious, though, that this doesn’t seem to be telling you much about the authors (represented here by colors).

It turns out that so called “stop words” are the best at investigating authorship. You can see the most famous example below: papers by Alexander Hamilton tend use the word “upon” quite often, while papers by James Madison very rarely do.

{
    "database": "federalist",
    "plotType": "vectorspace",
    "words_collation": "Case_Sensitive",
    "search_limits": {
        "word": ["on","upon"]
    },
    "aesthetic": {
        "variable": "WordCount",
        "dimensions": "*unigram",
        "color": "author",
        "label": "fedNumber"
    },
    "weights": {
        "on": {"x": 1,"y": 0},
        "upon": {"x": 0,  "y": 1}
    }
}	

A longer list of stopwords gives you some more room for manipulation.

{
    "database": "federalist",
    "plotType": "vectorspace",
    "method": "return_json",
    "words_collation": "Case_Sensitive",
    "search_limits": {
        "word": ["on","upon","as","our"]
    },
    "aesthetic": {
        "variable": "WordCount",
        "dimensions": "*unigram",
        "color": "author",
        "label": "fedNumber"
    },
    "counttype": ["WordCount"],
    "groups": ["*unigram", "author", "fedNumber"],
    "weights": {
        "on": {
            "x": 1,
            "y": 0
		},        "upon": {
            "x": 0,
            "y": 1
        },"as":{
            "x": 0,
            "y": -.3
        },"our" :{
            "x": 0,
            "y": .5
        }
    }
}

You can manipulate this chart at the site linked to right here.. Try dragging until you get the points into a configuration like this: it gives a pretty good discrimination in two dimensions between the two authors.

Image of an optimal loading.

Image of an optimal loading.

And add in some more elements until you get a really complicated version going.

There are some obvious extensions that we’ll talk about more in class: most importantly, scaling the results so that different lengths don’t overly impact the size, and a few techniques for automatically finding a good set of weights, instead of specifying them manually. (Although we watched John Tukey do it in class, it’s not actually something that’s become widespread).

Continuing online