Chris
Harrison

Search Clock

I was curious about how people used the internet. Specifically, I wanted to see how internet behavior changed over the course of a day. Search engines are the gateway to the internet for most people, and so search queries provide insight into what people are doing and thinking. I had several assumptions before I started:

  • Overall, internet usage is highest during the day, tapers off at night, and reaches a lull in the early morning hours.
  • People search for information during the workday (8-6ish)
  • People socialize or look for information of personal interest when they get home from work. (6ish to midnight)
  • People look for entertainment (often of the sexual variety) late at night and into the wee hours. (midnight-6am)

I was curious to see if data from search engines would support my anecdotal observations. I built a simple clock-like visualization that displays the top search terms over a 24-hour period. Displaying search terms in a cyclical layout (like a clock) allows continuous examination of trends that would otherwise be broken up. The data I had access to was both large and noisy. In response, I combined hourly data into week or year averages. All search strings were broken up into single words (period, commas and similar were considered whitespace as well). This helped pool frequent terms, and better illuminate search motivation (e.g. “information about taxes” and “information about chinchillas” counted as two hits for "information"). The top five search terms were shown for each hour, sized to reflect their relative frequency (larger = more popular). A list of stop words was developed to eliminate uninteresting terms (e.g. that, for, an, not, free). I have not modified the data in any way – you see it as it is.

Some might be wondering if international users in different time zones impacted the search distribution. This is probably true. However, my guess is that most users were based in North America (especially for Magellan in the late 90s and AOL in general). The data seems to support this as well, with search activity slowing down at night (western hemisphere time).

I ran the visualization with two unique data sets:

Magellan Voyeur Data Visualization

Magellan, search engine of yesteryear, offered a service called Voyeur, which displayed the last 10 search queries. Brian Amento of AT&T Labs archived this data in 10-minute intervals from 1997 to 2001. There are gaps in the data set from outages and changes to the Voyeur service. However, these events are assumed to be random, and thus have little impact on the distribution of search terms. Furthermore, because the data spanned a four-year period, I combined hourly data into yearly averages, which further helped to compensate for gaps and noise.

This data set is interesting for a few reasons. Foremost, it is more than decade old. People were searching for different things back then, and it shows. Secondly, the data spans a multi-year period, which helps exaggerate overarching trends. Lastly, and perhaps most importantly, Magellan was used to search for a variety of content by a diverse user group (including people at work, unlike the AOL data set).

Notes:

The inner most ring is the average for 1997. Rings then work outward one year at a time until 2000. 2001 was not included because only a fraction of the year was collected. The size of the font is a linear relationship with the number of times the term appeared in that hour (e.g. 100 hits = Courier size 100). Time is EST.

Interesting Trends:

I could explain every trend for you, but half the fun is exploring the data! For those who are lazy, here are some major (and obvious) trends to get you started:

Overall:

  • There appears to be a dramatic shift away from chat and towards information retrieval between 1997 and 2000.
  • People are diurnal - search activity dies down at night and picks up again as people get up for work.

1997:

  • It is clear that chat is most prevalent when people are home (evening). You can see chat frequency starting to grow around 11am, dominating by 5pm, and tapering off around 1am. It is supplanted by sex around 5am.
  • It seems people are curious about adult topics throughout the day. You can see sex jump in frequency around 11pm, reaching a climax around 2am (no pun intended) and dying down to nominal levels by 5am. However, since everyone is in bed, it clings to the top spot until pictures jumps to life, snatching the top spot as people roll out of bed.
  • Secondary terms are interesting as well. Entertainment oriented terms are popular in the afternoon and evening. University and software make their main appearance during the work day (8am-5pm). Warez makes it into the top five from 5am-7am thanks to late-night pirates and people who can’t get to sleep.

1998:

  • Chat and pictures vie for the top spot starting around 5pm, continuing until 2am. However, mp3s (and download) make a strong appearance, especially at night.

1999 & 2000:

  • These two years are similar, and so I've grouped them for brevity. The data shows chat, mp3s and porn begin to lose out to information, which dominates around the clock. MP3 remain popular in 1999. By 2000, e-commerce has matured; people are increasingly searching for things to buy.



AOL Data Visualization

The AOL data set will live in infamy for it's much hyped breach of privacy. The data is a nice compliment to the voyeur data set as it is different in a several important ways. First, it is significantly larger (~30 million search queries). Secondly, the data was collected from March to May, 2006, a three-month period, and for a subset of users. Third, AOL caters to a very different user demographic; it is primarily targeted at home users, and thus, search queries seem to reflect more personal and less work-related topics. Adding to this difference is the fact the population on the internet has dramatically changed since the late 90s.

Notes

  • Each month has four weeks (the first week is day 1-7, the second week is 8-13, etc.) Months are separated by a gap. The inner four rings makes up the month of March, followed by April in the middle, and finally by May on the outside.
  • The size of the font is a non-linear relationship that corresponds to the number of times the term appeared in that hour. This was necessary to dampen very frequent terms, such as myspace, and allow less popular terms to remain readable. (e.g. (1000 hits)^(0.66) = Courier size 95). Of course, the non-native-resolution versions (like the thumbnail above) had a linear scale as well.
  • This is AOL’s data, and I have no idea how they put it together. Thus, I cannot vouch for its reliability or independence. We just have to assume it’s somewhat randomized. At quick glance, it seems that more data was included for March than the other months (based on term frequency - you can see this in the image above clearly). I chose not to normalize frequency based on total number of searches, as I felt that removed some transparency from the visualization; You see it as was rendered, straight from the data.

Interesting Trends

This data only spans nine weeks, and searching trends seemed to have changed little over this period (unlike the drastic differences in the multi-year voyeur data set). However, this is not necessarily bad – it simply shows that searching behavior on a weekly basis is not that volatile.

The most obvious trend is that myspace is popular - searches for the social website increase as people get home from work and fall off as people go to bed.

Perhaps more revealing are the second through fifth search terms. eBay picks up in the afternoon and evening period as one would expect. Entertainment related terms (lyrics and games) grow from 4pm onwards until bedtime. Sex and other porn-related terms are prevalent at night, starting around 11pm, although their frequency pales in comparison to daytime searches. Civic terms, such as state, county, gov and Florida are surprisingly ubiquitous, although mostly popular during the workday. Is AOL's average user a retired Floridian?

There are a few week-specific blips. Some are explainable, such as "Easter" and "happy" (see 24:00 hours on 6th ring out, aka the week before easter in 2006). I have no clue why "profileedit" and "myspace" become so popular in the 8th week (22-23:00 hours). Adultfriendfinder(.com) is also popular for a week (5th ring out, 3am-6am).

© Chris Harrison