|
Introduction
Wikipedia is an interesting dataset for visualization.
As an encyclopedia, it's articles span millions of topics. Being
a human edited entity, connections between topics are diverse, interesting,
and sometimes perplexing - five hops takes you from subatomic particles
to Snoop Dog. Wikipedia is revealing in how humans organize data
and how interconnected seemingly unrelated topics can be.
During my time at AT&T Labs, which coincidently has a great information
visualization group, I started think about how to visualizing
something as massive as Wikipedia. With roughly 1.5 million articles
(vertices) and tens of millions of article links (edges), a comprehensive
visualization package would have to found or built. After playing
around with GraphViz, but getting frustrated with layout limitations,
I decided on the latter option: build!
I don't buy into the reuse-recycle methodology. That's great for
software development, but not research. I believe great ideas and
innovations come about by redoing and rethinking existing problems.
If you always use what has already been "solved" and built,
how do you know it can't be done better?
And so I leaped head first into my own large-scale data visualization
project, from scratch. I hope to publish the novel graph layout
techniques I developed during this process.
WikiViz
v5
| The fifth incarnation of
WikiViz (v5) saw several significant improvements. Most notable
was a new graph layout algorithm that allowed me to scale
to hundreds of thousands or millions of vertices (tens of
millions of edges). A traditional spring-based layout technique
would choke at such quantities as they typically have (vertex)2
run times for repellent forces. However, much like a spring-based
layout technique, connected vertices attracted each other,
and were drawn together spatially. At a larger scale, this
would cause highly connected clusters of vertices to coalesce.
However, as my output improved, I began to realize that the
data I was using from the Wikipedia category links table was
messy and not representative of Wikipedia's true structure.
The cl_to and cl_sortkey fields are not true links, although
the topics are often related. To remedy this problem for future
layout attempts, I have prepared a true page links input file
with data scraped right from article content. When I get sufficient
time to tinker again, I will attempt to render a full and
accurate Wikipedia graph.
I have included a few sample renderings
for those who are curious. |
Key
 |
The nodes seen in the graph above were selected by starting at Wikipedia’s
Science
category page, and traversing category links to a depth of 100
hops. The resulting data included 73,230 category pages (vertices)
and more than a quarter of a million links (edges). You can click
on any of the tiles to see full resolution versions. I've included
some example locations that can be used as starting points.
| |
Location |
Description |
| |
B6 |
A highly connected set of
British cities and associated items |
| |
E6, F6 and around |
Countries of the world |
| |
D6 |
Year categories from roughly
1600-2007 |
| |
D2, D3 and around |
U.S. States |
| |
E2 and F2 |
Sports |
| |
Left side |
U.S. Counties |
| |
Upper right |
Cluster of music album categories |
WikiViz v1 to v4
Layout is done using a spring model - each edge in the graph acts
like a spring, pushing connected nodes apart if they are too close
and pulling connected nodes together. The spring constant and optimal
spring length can be adjusted. Overlapping nodes are nudged apart
using a separate, non-spring-based method. The code base is under
a thousand lines of Java code, and was whipped together a few hours
at a time over the course of a week. Tweaking the layout code (dampening,
overlap detection, etc.) consumed most of that time. Although called
WikiViz, there is nothing specifically Wikipedia about it. It's
a general graph layout and render engine that can be used with many
sources. The input is a simplified version of GraphViz's already
simple DOT format:
Chris
Harrison -- Student
Chris Harrison -- Human
Chris Harrison -- Animal
Human -- Biology
Biology
-- Animal
... |
 |
Initially,
the graph was rendered real-time on screen. With about 1000 edges,
the graph was interactive. However, as I pushed the edge count into
the tens of thousands, it became clear that watching the graph balance
into an energy minimized state was going to take serious processing
time (hours). Not only was each iteration taking about half a second
(2 fps), but the number of iterations required to properly layout
the graph was large (tens of thousands). It was clear that the computation
needed to be run non-graphically and saved to file when complete.
This also allowed me to escape the low resolution of monitors and
produce high-resolution images
(20,000x20,000 pixels or more).
For the time being, I am using Wikipedia's category links
database (cl_to and cl_sortkey fields), which includes ~178k vertices
and 6.5 million edges. Expanding to Wikipedia's full 1.5 million
vertices is going to be a huge challenge (memory requirements alone!).
However, the former data set is plenty interesting. Currently, I
am working with various subsets of the database: two to fifty hops
out, starting from Humans, History, Politics, Mathematics and Physics.
WikiViz v1 to v4 Images
Below are some visualizations that have been produced. Various
layout duration and spring attributes were used. Beware of the high
resolution links. Some JPEGs are 30+ MB, which can seriously tax
your hardware.
| |
|
|
|
Close
up of graph shown at left.
|
GraphViz
Images
| |
|
|
Two levels deep, centered on Artificial Intelligence (using
GraphViz) |
Two
levels deep, centered on Geography (using GraphViz) |
Go
to Home Page
Go to Projects Page
|