Chris
Harrison

Web Trigrams: Visualizing Google's Tri-Gram Data

Back in late 2006, Google released a massive set of web n-gram data (basically pieces of sentences). A trigram (n=3), for example, might be "I like food" or "frog is tasty." Each n-gram is also labeled with the number of times it appeared in Google's corpus. The entire archive, which is almost 100GB uncompressed, has unigrams (n=1) through fivegrams (n=5). The data set is offered through the LDC for those who are interested (link).

As soon as I got my hands on the data, I quickly got to work on some straight forward visualizations. The first type compares two sets of trigrams, each starting with a different word. One visualization compares 'He' with 'She', while the other uses 'I' and 'You'. In the case of the 'He' vs. 'She', the top 120 trigrams for each were identified. The frequencies of the second word in the trigrams were combined and sorted, and rendered in decreasing frequency-of-use order. A similar process was used to create a ranking for the third (and final) word in the trigrams. Words are sized according to the square root of their use frequencies. The color-coded lines act like paths (a tree structure), enumerating all of the trigrams. The process was identical for the 'I' and 'You' version, except that only the top 75 trigrams were used.

These visual comparisons allow us to see differences in how the two subjects are used - both where they are similar and diverge. For example, among the top 120 trigrams, 'He' and 'She' have many common second words. However, they differ on some interesting ones, for example, only 'he' connects to 'argues', while only 'she' connects to 'love' (within the top 120).

I also created a little series of visualizations that shows how six common subjects are used. The subject is noted in the top left. The column immediately to the right is a frequency-ranked list of the most common secondary words (for example, the "have" in "we have..."). Each of these is followed by a horizontal list of the common words that follow (for example "a" in "I have a..."). Besides the subject text, all fonts are sized proportionally to the inverse power of their frequencies.

Very simple, but kind of cool. At least I think so.

© Chris Harrison