Chris
Harrison

Word Associations Visualizing Google's Bi-Gram Data

This series uses the same bigram dataset as the word spectrum visualization. Please refer to that page for an extended description of the data and processing.

To eliminate occlusion, I developed an entirely different layout. Now, instead of a continuous spectrum of words, words are bucketed into one of 25 different rays. Each of these represent a different tendency of use (ranging from 0 to 100% in 4% intervals). Words are sorted by decreasing frequency within each ray. I render as many words as can fit onto the canvas. There is a nice visual analogy at play - the "lean" of each ray represents the strength of the tendency towards one of the two terms. As in the word spectrum visualization, font size is based on a inverse power function (uniquely set for each visualization, so you can't compare across pieces). Common words (a, the, for, as, etc.) are not shown.

I was really pleased at how many interesting details get packed into these fairly simple visualizations. I think they offer an interesting insight into our language and what topics are prevalent on the web.

Warning: the visualizations use actual word frequencies from the web - foul language is present!

Each thumbnail links to a PDF version.

© Chris Harrison