Introduction

Wikipedia is an interesting dataset for visualization. As an encyclopedia, it's articles span millions of topics. Being a human edited entity, connections between topics are diverse, interesting, and sometimes perplexing - five hops takes you from subatomic particles to Snoop Dog. Wikipedia is revealing in how humans organize data and how interconnected seemingly unrelated topics can be.

During my time at AT&T Labs, which coincidently has a great information visualization group, I started think about how to visualizing something as massive as Wikipedia. With roughly 1.5 million articles (vertices) and tens of millions of article links (edges), a comprehensive visualization package would have to found or built. After playing around with GraphViz, but getting frustrated with layout limitations, I decided on the latter option: build!

I don't buy into the reuse-recycle methodology. That's great for software development, but not research. I believe great ideas and innovations come about by redoing and rethinking existing problems. If you always use what has already been "solved" and built, how do you know it can't be done better?

And so I leaped head first into my own large-scale data visualization project, from scratch. I hope to publish the novel graph layout techniques I developed during this process.

 

WikiViz v5

The fifth incarnation of WikiViz (v5) saw several significant improvements. Most notable was a new graph layout algorithm that allowed me to scale to hundreds of thousands or millions of vertices (tens of millions of edges). A traditional spring-based layout technique would choke at such quantities as they typically have (vertex)2 run times for repellent forces. However, much like a spring-based layout technique, connected vertices attracted each other, and were drawn together spatially. At a larger scale, this would cause highly connected clusters of vertices to coalesce. However, as my output improved, I began to realize that the data I was using from the Wikipedia category links table was messy and not representative of Wikipedia's true structure. The cl_to and cl_sortkey fields are not true links, although the topics are often related. To remedy this problem for future layout attempts, I have prepared a true page links input file with data scraped right from article content. When I get sufficient time to tinker again, I will attempt to render a full and accurate Wikipedia graph.

I have included a few sample renderings for those who are curious.


Key

 
wikipedia graph
wikipedia graph

Five levels deep, centered on History (WikiViz v5)
See Medium Resolution Version
See High Resolution Version

Five levels deep, centered on Politics (WikiViz v5)
See Medium Resolution Version
See High Resolution Version


wikipedia graph
wikipedia graph

Five levels deep, centered on Religion (WikiViz v5)
See Medium Resolution Version
See High Resolution Version

Five levels deep, centered on Music (WikiViz v5)
See Medium Resolution Version
See High Resolution Version


A
B
C
D
E
F
G
H
1
2
3
4
5
6
7
8


The nodes seen in the graph above were selected by starting at Wikipedia’s Science category page, and traversing category links to a depth of 100 hops. The resulting data included 73,230 category pages (vertices) and more than a quarter of a million links (edges). You can click on any of the tiles to see full resolution versions. I've included some example locations that can be used as starting points.

  Location Description
  B6 A highly connected set of British cities and associated items
  E6, F6 and around Countries of the world
  D6 Year categories from roughly 1600-2007
  D2, D3 and around U.S. States
  E2 and F2 Sports
  Left side U.S. Counties
  Upper right Cluster of music album categories

 

WikiViz v1 to v4

Layout is done using a spring model - each edge in the graph acts like a spring, pushing connected nodes apart if they are too close and pulling connected nodes together. The spring constant and optimal spring length can be adjusted. Overlapping nodes are nudged apart using a separate, non-spring-based method. The code base is under a thousand lines of Java code, and was whipped together a few hours at a time over the course of a week. Tweaking the layout code (dampening, overlap detection, etc.) consumed most of that time. Although called WikiViz, there is nothing specifically Wikipedia about it. It's a general graph layout and render engine that can be used with many sources. The input is a simplified version of GraphViz's already simple DOT format:

Chris Harrison -- Student
Chris Harrison -- Human
Chris Harrison -- Animal
Human -- Biology
Biology -- Animal
...

Initially, the graph was rendered real-time on screen. With about 1000 edges, the graph was interactive. However, as I pushed the edge count into the tens of thousands, it became clear that watching the graph balance into an energy minimized state was going to take serious processing time (hours). Not only was each iteration taking about half a second (2 fps), but the number of iterations required to properly layout the graph was large (tens of thousands). It was clear that the computation needed to be run non-graphically and saved to file when complete. This also allowed me to escape the low resolution of monitors and produce high-resolution images (20,000x20,000 pixels or more).

For the time being, I am using Wikipedia's category links database (cl_to and cl_sortkey fields), which includes ~178k vertices and 6.5 million edges. Expanding to Wikipedia's full 1.5 million vertices is going to be a huge challenge (memory requirements alone!). However, the former data set is plenty interesting. Currently, I am working with various subsets of the database: two to fifty hops out, starting from Humans, History, Politics, Mathematics and Physics.

 

WikiViz v1 to v4 Images

Below are some visualizations that have been produced. Various layout duration and spring attributes were used. Beware of the high resolution links. Some JPEGs are 30+ MB, which can seriously tax your hardware.

Two levels deep, centered on History (WikiViz v4).
See Medium Resolution Version

Four levels deep, centered on Mathematics (WikiViz v3)
See Medium Resolution Version
See High Resolution Version


wikipedia graph

Three levels deep, centered on Physics (WikiViz v3)
See Medium Resolution Version
See High Resolution Version

Five levels deep, centered on History (WikiViz v3)
See Medium Resolution Version
See High Resolution Version


wikipedia graph

Four levels deep, centered on Politics (WikiViz v3)
See Medium Resolution Version
See High Resolution Version

Three levels deep, centered on History (WikiViz v2)
See Medium Resolution Version


Three levels deep, centered on Humans (WikiViz v2)
See Medium Resolution Version
See High Resolution Version

Close up of graph shown at left.


Three levels deep, centered on Physics (WikiViz v2)
See Medium Resolution Version
See High Resolution Version

Three levels deep, centered on Math (WikiViz v2)
See Medium Resolution Version
See High Resolution Version


Interactive WikiViz version with Geography Dataset (WikiViz v1)
See Medium Resolution Version

 

 

GraphViz Images

Two levels deep, centered on Artificial Intelligence (using GraphViz)
Two levels deep, centered on Geography (using GraphViz)

Two levels deep, centered on Geography (using GraphViz)
See Medium Resolution Version

Two levels deep, centered on Geography (using GraphViz)
See Medium Resolution Version

 

 

 

 

   
chris.harrison@cs.cmu.edu
© Chris Harrison