Using Google's enormous bigram dataset, I produced a series of
visualizations that explore word associations. Each visualization
pits two primary terms against each other. Then, the use frequency
of words that follow these two terms are analyzed. For example,
"war memorial" occurs 531,205 times, while "peace
memorial" occurs only 25,699. A position for each word is generated
by looking at the ratio of the two frequencies. If they are equal,
the word is placed in the middle of the scale. However, if there
is a imbalance in the uses, the word is drawn towards the more frequently
related term. This process is repeated for thousands of other word
combinations, creating a spectrum of word associations. Font size
is based on a inverse power function (uniquely set for each visualization,
so you can't compare across pieces). Vertical positioning is random.
To better achieve a even distribution, I normalized the frequencies
of bigrams based on total primary term frequency. So, for example,
in the case of war vs. peace, there are 81,839,381 bigrams starting
with war and 31,263,375 bigrams starting with peace. If I render
the spectrum without normalization, it ends up lopsided toward war
(since the usage totals are so much higher). To compensate, I scale
down all of war's bigrams so that the overall frequencies are even.
This is only a subset of possible word pairings. If you have a
interesting idea for a word comparison, email
me.
A new visualization looking
at the same dataset has been posted.
Jeff Clark has created
interactive versions for twitter
and news.
Warning: the visualizations use actual word frequencies from the web
- foul language is present!
Each thumbnail links to a PDF version.