| 
Back in late 2006, Google released a massive set of web n-gram
data (basically pieces of sentences). A trigram (n=3), for example,
might be "I like food" or "frog is tasty." Each
n-gram is also labeled with the number of times it appeared in Google's
corpus. The entire archive, which is almost 100GB uncompressed,
has unigrams (n=1) through fivegrams (n=5). The data set is offered
through the LDC for those who are interested (link).
As soon as I got my hands on the data, I quickly got to work on
some straight forward visualizations. The first type compares two
sets of trigrams, each starting with a different word. One visualization
compares 'He' with 'She', while the other uses 'I' and 'You'. In
the case of the 'He' vs. 'She', the top 120 trigrams for each were
identified. The frequencies of the second word in the trigrams were
combined and sorted, and rendered in decreasing frequency-of-use
order. A similar process was used to create a ranking for the third
(and final) word in the trigrams. Words are sized according to the
square root of their use frequencies. The color-coded lines act
like paths (a tree structure), enumerating all of the trigrams.
The process was identical for the 'I' and 'You' version, except
that only the top 75 trigrams were used.
These visual comparisons allow us to see differences in how the
two subjects are used - both where they are similar and diverge.
For example, among the top 120 trigrams, 'He' and 'She' have many
common second words. However, they differ on some interesting ones,
for example, only 'he' connects to 'argues', while only 'she' connects
to 'love' (within the top 120).

I also created a little series of visualizations that shows how
six common subjects are used. The subject is noted in the top left.
The column immediately to the right is a frequency-ranked list of
the most common secondary words (for example, the "have"
in "we have..."). Each of these is followed by a horizontal
list of the common words that follow (for example "a"
in "I have a..."). Besides the subject text, all fonts
are sized proportionally to the inverse power of their frequencies.
Very simple, but kind of cool. At least I think so.
Go to Home Page
Go to Projects Page |