Chris
Harrison

Word Spectrum: Visualizing Google's Bi-Gram Data

Using Google's enormous bigram dataset, I produced a series of visualizations that explore word associations. Each visualization pits two primary terms against each other. Then, the use frequency of words that follow these two terms are analyzed. For example, "war memorial" occurs 531,205 times, while "peace memorial" occurs only 25,699. A position for each word is generated by looking at the ratio of the two frequencies. If they are equal, the word is placed in the middle of the scale. However, if there is a imbalance in the uses, the word is drawn towards the more frequently related term. This process is repeated for thousands of other word combinations, creating a spectrum of word associations. Font size is based on a inverse power function (uniquely set for each visualization, so you can't compare across pieces). Vertical positioning is random.

To better achieve a even distribution, I normalized the frequencies of bigrams based on total primary term frequency. So, for example, in the case of war vs. peace, there are 81,839,381 bigrams starting with war and 31,263,375 bigrams starting with peace. If I render the spectrum without normalization, it ends up lopsided toward war (since the usage totals are so much higher). To compensate, I scale down all of war's bigrams so that the overall frequencies are even.

A new visualization looking at the same dataset has been posted.

Jeff Clark has created interactive versions for Twitter and news.

Warning: the visualizations use actual word frequencies from the web - foul language is present!

Each thumbnail links to a PDF version.

© Chris Harrison