Proteus and Bookworm

Ben Schmidt, March 29, 2015

Here’s a first look, for internal purposes, at some things worth pursuing as part of integration between Bookworm and Viral Texts.

This is using the dump cl20-flover08-rep4.clinfo.gz from September 2014.

Number of texts in the version that David and Ryan gave me.

{   "database": "viral",
    "plotType": "linechart",
    "search_limits": {"date_year":{"$gte":10}
    },
    "aesthetic": {
        "x": "date_year",
        "y" : "TextCount"
    }}

Appearances by decade of four arbitrary clusters

{   "database": "viral",
    "plotType": "linechart",
    "search_limits": {"chunk":[50,20,30,40]
    },
    "aesthetic": {
        "x": "decade",
        "y" : "TextCount","color":"chunk"
    }}

50 clusters of text (y axis), heatmap of appearances by year

{   "database": "viral",
    "plotType": "heatmap",
    "search_limits": {
        "date_year": {
            "$gt": 1840,
            "$lt": 1901
        },
        "chunk": {
            "$gte": 280,
            "$lte": 330
        }
    },
    "aesthetic": {
        "x": "date_year",
        "y": "chunk",
        "color": "TextCount"
    }}

Cluster publication across the top 30 newspapers and top 100 clusters.

{
    "database": "viral",
    "plotType": "heatmap",
    "search_limits": {
		"title__id":{"$lte":30},
        "date_year": {
            "$gt": 1840,
            "$lt": 1901
        },
        "chunk": {
            "$gte": 1,
            "$lte": 100
        }
    },
    "aesthetic": {
        "x": "chunk",
        "y": "title",
        "color": "TextCount"
    }
}

Topic models

{
    "database": "viral",
    "plotType": "barchart",
    "search_limits": {
    },
    "compare_limits": {
    },
    "aesthetic": {
        "x": "WordCount",
        "y": "topic_label"
    }
}

It’s also possible to bridge this with a topic model. For instance, it then becomes possible to see how the language in a topic devoted to home remedies changes .

Comparison of the top words in topic 57 between 1890 and 1900 vs 1860 and 1875.

Must click through to code, div methods not yet supported in Hakyll

{
    "database": "viral",
    "plotType": "worddiv",
    "search_limits": {
        "date_year": {
            "$gt": 1890,
            "$lt": 1900
        },
        "topic": [57]
    },
    "compare_limits": {
        "date_year": {
            "$gt": 1860,
            "$lt": 1875
        },
        "topic": [57]
    },
    "aesthetic": {
        "label": "unigram",
        "size": "Dunning"
    }
}

Geographical location of all publications of cluster 30

{
    "database": "viral",
    "plotType": "map",
    "method": "return_json",
    "search_limits": {
        "chunk": [30]
    },
    "aesthetic": {
        "point": "placeOfPublication_geo",
        "size": "TextCount"
    },"projection":"USA"
}

Here’s a cluster about the ways the apostles died, organized by date.

Color shows whether or not it uses the word “St” or “Saint”

Appearances of cluster 15 by date

{
    "database": "viral",
    "plotType": "map",
    "search_limits": {
        "word": ["St"],
        "chunk": [15]
    },
    "aesthetic": {
        "point": "placeOfPublication_geo",
        "size": "TotalTexts",
        "time": "date_year",
        "color": "WordsPerMillion"
    },
    "projection": "USA"
}

It’s also possible to just drag through a bunch of different clusters.

Drag the top line to see geographical distributions for each cluster number

{
    "database": "viral",
    "plotType": "map",
    "method": "return_json",
    "search_limits": {"word":[],
        "chunk": {
            "$lte": 200
        }
    },
    "aesthetic": {
        "point": "placeOfPublication_geo",
        "size": "TotalTexts","color":"WordsPerMillion",
        "time": "chunk"},
	"projection":"USA"
}

Another example that requires clicking through the code to see examples.

Search for clusters by individual words. (Top 1000 clusters only in example)

{
    "database": "viral",
    "plotType": "barchart",
    "method": "return_json",
    "search_limits": {
		"word":["rebellion"],
        "chunk": {
            "$lte": 1000
        }
    },
    "aesthetic": {
        "x":"WordsPerMillion",
        "y":"chunk"},
	"projection":"USA"
}