Experiments in Visualizing Social Networks

Alan B. Scrivener
Human Interface Prototypes
abs@well.com
27 June 2004

0. ABSTRACT

By representing a social matrix as a binary triangular array, and then permuting rows and columns to minimize the perimeter separating ones and zeroes, favoring ones in the upper left corner, a notion of a canonical form for the matrix has been devised. This has then been used as a starting point for adding dynamic display of multivariable information in a variety of ways.

1. KEYWORDS:

binary triangular matrix, dynamic multivariable social network visualization

2. SUPPORT:

This paper is supported by a contract from Mindtel LLC.

3. GOALS

Recent decades have seen many advances in the analysis and visualization of networks, both generalized mathematical networks as well as real-world computer networks, social networks, knowledge networks and the like. [BUCHANAN2003].

Many recent advances have been made by vendors and contributors of software for computer network monitoring. [LINK1]

A cursory review of many of these offerings shows that most use variations on a visualization technique sometimes known as "data constellations," which involves representing nodes as circles or rectangles and the edges connecting them as line segments. Usually the most-connected "super-nodes" are shown near the middle of the diagram, and proximity is used where possible to indicate connectivity as well.

For example, this diagram of disease transmission uses the "data constellations" technique:

figure 1
disease outbreak visualization at PBS web site
disease outbreak visualization at PBS web site [LINK2]

One disadvantage of this technique is that it works best with sparse networks. When the connections become sufficiently dense, the picture becomes so cluttered that it is difficult to trace individual connections.

For example, this diagram of connections between computers gives a rough idea of connection density, but does not easily communicate each link:

figure 2
internet topology visualization at SDSC web site
internet topology visualization at SDSC web site [LINK3]

(This author is reminded of a facetious map of the Los Angeles freeway system published in the humor magazine National Lampoon in March of 1972. [LINK4] As part of an article entitled "The Miracle of California," this map had a key indicating that each freeway was shown as a black line. The map was almost all black, with a few dozen thin triangular slivers of white.)

These traditional techniques have not been very good at the following:

It is the goal of this project to develop techniques which can better solve these visualization problems.

4. APPROACH

In the well-developed field of algebraic graph theory [BIGGS1994], a graph (network of nodes) is represented by a binary matrix, in which all values are either zero or one. Each row and column of the matrix corresponds to a node in the graph. Each matrix element corresponds to a possible edge (connection). If the graph is undirected (i.e., the edges have no arrowheads, and so are bi-directional), the matrix is symmetric along the diagonal from upper left to lower right. Furthermore, if self-connecting edges are disallowed, then all pertinent information can be represented in a triangular matrix of size n by n-1 (where n is the number of nodes), having n*(n-1)/2 elements, each representing a possible edge.

There are many more benefits to this notation. There is a powerful and elegant body of theory that uses algebraic manipulations to answer questions about any given graph, including how many cycles (closed loops) it has, whether it is a spanning tree (visiting each node once and only once), and so on. There are also isomorphisms to the theory of error-correcting codes invented by Richard W. Hamming of Bell Labs in 1940s. [HAMMING1986] [LINK5]

The current explorations are based on the author's flash of intuition, subsequently explored rigorously: that the triangular matrix representation can form the basis for a new family of graph visualization techniques.

One problem that the "data constellation" methods share with the triangular matrix representation is that the position of the nodes have no real geometric meaning. This is problematic because human perception is compelled to seek and find such meanings even where they may not exist. A side-effect is that there are many ways to represent a single graph which may look very different from each other. Therefore, we have been seeking some sort of "canonical representation" for a given graph, to enable comparisons to be made easily.

A second intuitive flash was the idea that it would be better if a matrix is more "blobby." By this we mean that the ones are clustered together. A more rigorous definition of this is that the perimeter between zeroes and ones is minimized. For example, consider this eleven node triangular matrix:


  B C D E F G H I J K 
  0 1 0 0 1 0 0 1 0 0 A
    1 0 1 1 0 1 1 0 0 B
      1 1 1 1 1 1 0 0 C
        0 0 0 1 1 0 0 D
          1 1 1 1 0 0 E
            0 1 1 0 0 F
              1 1 0 0 G
                1 0 0 H
                  1 1 I
                    0 J

First we add lines of just plus sign characters (+) and spaces, to emphasize the spaces between the digits:


  B C D E F G H I J K 
  0 1 0 0 1 0 0 1 0 0 A
   + + + + + + + + + +
    1 0 1 1 0 1 1 0 0 B
     + + + + + + + + +
      1 1 1 1 1 1 0 0 C
       + + + + + + + +
        0 0 0 1 1 0 0 D
         + + + + + + +
          1 1 1 1 0 0 E
           + + + + + +
            0 1 1 0 0 F
             + + + + +
              1 1 0 0 G
               + + + +
                1 0 0 H
                 + + +
                  1 1 I
                   + +
                    0 J
                     +

Then by adding minus (-) and vertical bar (|) characters we draw the perimeter between the zeroes and ones:


  B C D E F G H I J K 
  0|1|0 0|1|0 0|1|0 0 A
   + + +-+ + +-+ + + +
    1|0|1 1|0|1 1|0 0 B
     +-+ + +-+ + + + +
      1 1 1 1 1 1|0 0 C
       +-+-+-+ + + + +
        0 0 0|1 1|0 0 D
         +-+-+ + + + +
          1 1 1 1|0 0 E
           +-+ + + + +
            0|1 1|0 0 F
             + + + + +
              1 1|0 0 G
               + + + +
                1|0 0 H
                 +-+-+
                  1 1 I
                   +-+
                    0 J
                     +

In this case there are 13 minuses and 19 vertical bars, for a total of 34 perimeter units. If you wanted to buy fence for this triangle matrix, to keep the zeros and ones from mingling, you'd need 34 lengths of fence. But what does it mean for a relationship to be near the upper left corner, say, of the triangle? Does this indicate anything? No, it doesn't. The order of listing the nodes both across and down is arbitrary, as long as they match up. What if we permute the order of the nodes ("scramble the letters") -- does it change the meaning? Again, no. But the mind searches for meaning in a visual display, and so we were looking for a way to make this position information in some way meaningful.

Our original planned procedure was to permute the order of the rows and columns ("scramble the letters") through all possible combinations, until the perimeter number was minimized. This would be the "blobbiest" form of the triangle array.

One effect of doing this would be to reduce the data. There are many graphs that have the same topology, i.e., if you remove the labels you can't tell them apart. This approach would cause each of those graphs to be displayed the same way, except for the labels. (There is an analogy here to hashing algorithms. [LINK6])

Once we developed this technique (and accompanying software tools) we planned to use the approach as a springboard to more complex visualizations of multiple types of relationships changing over time.

5. PROCESS

5.1 Analysis

Before we start crunching numbers, let's do some thinking. A graph with one node is trivial: it can't have any edges. Likewise a graph with two nodes is also trivial: its only possible edge is either missing or present. In either case you could argue that it is already in "canonical form." When we get to three nodes things get a little more interesting. First, consider the graph with no edges:


  B C 
  0 0 A
    0 B

It can be argued that it is already in "canonical form" since no matter how you permute it it doesn't change.

The same can be said of the graph with three edges:


  B C 
  1 1 A
    1 B

Next, consider the graph with one edge:


  B C      B C      B C
  1 0 A    0 1 A    0 0 A
    0 B      0 B      1 B

It resembles the graph with two edges, in that if you exchange all zeroes for ones and vice versa the matrices are the same as above:


  B C      B C      B C
  0 1 A    1 0 A    1 1 A
    1 B      1 B      0 B

Taking a page from group theory, if we solve one we've solved the other. From inspection we see that (for example, in the case of one edge) we can permute any matrix into any other. The first and third have shorter perimeters than the second, so we can make an arbitrary choice, and select the first one. Here is how it looks with its perimeter drawn:


  C A 
  1|0 B
   + +
    0 C
     +

Moving on to a graph with four nodes, there are six elements, each of which can be zero or one, so there are two to the sixth power, or 64 possible matrices. Each can be permuted 6! or 24 ways. This yields 24*64 or 1,536 permutations to inspect, so it must be time to stop thinking and start programming.

(Notice how quickly the problem shifted from trivially obvious to effectively intractable without a computer!)

5.2 Software Tool Building

We wrote a Java application called NetworkBlobs which takes as input a list of node pairs, and outputs the text forms above or graphical displays of a matrix, with or without perimeters. Here is a 5 node matrix without the perimeters drawn:

figure 4
5 node graph
5 node graph as triangle matrix;
rows and columns (A-E) are individuals (nodes);
squares in the triangle matrix are potential
relationships (edges, connections);
grey = connected, black = not connected

Here is a 200 node matrix without the perimeters drawn (clearly this method of display can be used until one runs out of pixels, probably at about a thousand nodes):

figure 5
200 node graph
200 node graph as triangle matrix;
rows and columns are individuals (nodes);
squares in the triangle matrix are potential
relationships (connections or edges);
grey = connected, black = not connected

The Java program could also draw the perimeters. Here is a 13 node matrix with the perimeters drawn:

figure 6
13 node graph
13 node graph as triangle matrix;
rows and columns (A-M) are individuals (nodes);
squares in the triangle matrix are potential
relationships (connections or edges);
grey = connected, black = not connected;
orange = perimeter (boundary between 0s and 1s);
light gray = spacer

At this point we decided we needed some real data to operate on before we went much further. The author's 15-plus years experience in scientific visualization [LINK7] have taught a profound respect for real data. It often times can break your code in ways you never imagine when you create "fake" test data.

Our client was scheduled to provide data from a proprietary "massive multi-player online game" (MMOG), but this was months away, and we wanted to begin testing the software right away. We turned to some published social network information, the author's high school yearbook. As a mild sort of "prank," a group of 13 friends -- all graduating seniors -- bought an advertisement in the back of the yearbook with a group photo with whacky props and enigmatic quotations:

13 seniors in high school yearbook ad
13 seniors in high school yearbook ad

The yearbook also had listings for each of the individual students, including which clubs they'd belonged to for each of their four years in high school; here is an example:

typical student listing
typical student listing

From this I derived a list of each student and which clubs they belonged to, and as a consequence who they were connected to by clubs:

Ultimately we boiled this down to a file of connections between all the students, as shown by this partial listing (where A=0, B=1, C=2, D=3, E=4, F=5, G=6, H=7, I=8, J=9, K=10, L=11, M=12):

We could feed this list directly to our Java application, NetworkBlobs, and it would store the triangular matrix as a Java object, capable of writing out its data by several different Java methods, such as the text-only techniques and the colored squares techniques seen above.

We also gave our object the ability to compute its perimeter, or number of orange rectangles in the pictures above.

Next we enhanced the Java program to step through all permutations of a matrix, retaining the form that minimized the perimeter. (We were aided in this by some Java code for generating permutations by Michael Gilleland of Merriam Park Software. [LINK8])

We quickly discovered two things. First, the problem as originally posed is intractable on a PC for numbers of nodes over about 12. See the following table, where:

n n! runtime microsec/n!
7 5,040 1.340s 265.8730
8 40,320 1.835s 45.5109
9 362,880 11.192s 30.8421
10 3,628,800 2m 4.163s 34.2159
11 39,916,800 31m 21.444s 47.1341
12 479,001,600 11h 46m 13.880s 88.4629
13 6,227,020,800 [unknown] [unknown]

Note that the runtime is unknown for n=13; our estimate was about 30 days! For this short-duration exploration project we couldn't use 30 days of processor time for one computational experiment. We have ideas for improving the algorithm's efficiencies; see FUTURE WORK below.

Second of all, we found many cases where there were a number of patterns which "tied" for first place by the scoring we used, based solely on perimeter length. We iterated until we found that adding a "tie-breaking" small score which preferred the upper left corner for ones in the matrix, we could break symmetry and have fewer canonical forms that categorized graphs as similar if they had symmetric relationships.

5.2 Visualization Experiments

Once we had data, though it was a small amount, we went to work finding different ways to visualize it.

5.2.1 Yearbook Data

We visualized the yearbook data for subgroups, mostly of 11 or less, since that would run in about half an hour. Starting at networks with three nodes and working our way up to 11 by throwing away nodes, and looking at the canonical form of each sub-group, we noticed that the most-connected player thus far is always on the top row, and it seems to be close to a popularity contest in general. (If this turns out to be true, and provable, it could save a whole lot of calculation.)

With our best canocal form generator to date, we took the 11-node network data (left) and produced this canonical form (right):

figure 7a
B C D E F G H I J K
A
B
C
D
E
F
G
H
I
J
figure 7b
C H E F B G D A K J
I
C
H
E
F
B
G
D
A
K
11-node graph from yearbook data before and after conversion to canonical form

Improvements in algorithm performance, access to better hardware, and the passage of time will all allow for more, and larger, runs.

In pondering how to proceed from this base, we remembered some previous research done for Mindtel LLC, in which he had vectors of data (i.e., sets of values) we wanted to display at once. In one case we had public health data from three villages on the Big Island of Hawaii, giving us numbers of villagers in each of six categories: Adult M, Adult F, Child M, Child F, Infant M, Infant F. We showed each of the six values as the size of a ring, or toroid, where color indicated deviation from average: red was high deviation, green was low. The base ring showed an average score over the six variables, again with color indicating deviation from average and size being total population. This image is showing how all villages have below-average number of children, and most have below average numbers of women as well:

figure 8
village health data in Hawaii
village health data in Hawaii (rings repesent number of villagers in several age and gender groups) [LINK9]

In a second project, we worked on ways to visualize text data from TIDES, a daily email briefing for the US government that excerpts all English language or translations in electronic (searchable) news media mentioning the Middle East and the U.S. war on terror. We plotted calendar days against time zones (for east/west position) to make a surface whose height was total words in all the TIDES news feeds, and then used toroid shapes to show frequencies of a number of words we were scanning for.

figure 9
TIDES text data as a space-time chart with word-count rings
TIDES text data as space-time chart with word-count rings ( [LINK10]

After some introspection, the idea was hatched to use the toroids to represent types of relationships or connections. The yearbook data told us which clubs or other groups the nodes belonged to. We used the same base toroid on a cylindrical pole, ringed by a number of toroids (some have compared to these palm tree leaves). The position of each "palm leaf" toroid indicates the type of group. The color and size of each "leaf" indicates how "strong" the membership is (e.g., how many years were spent in the club). The NetworkBlobs program was modified so that a triangle matrix object could output the check positions and values in a format that another set of tools could use to make the rings. Our first experiments with this approach were encouraging.

figure 10
rings (toroids)  above checks
rings (toroids) above checks;
far and left edges of triangle base = nodes;
checks = connections, grey = connected, black = not connected;
white cylinders and horizontal rings for calibration only;
angular position of vertical rings = type of group;
size and color of vertical rings = degree of group memberships, small/red = low, large/green = high

The most common feedback which we received from other researchers and associates at this point was, "Very interesting, but what does it mean?" We decided to focus next on mapping this new notation back to known forms.

5.2.2 Multiplayer Online Game Data

At about this point we got access to some of the first data from the multiplayer online game.

5.2.2.1 2D Visualizations of Player Data

We visualized in small pieces at first it by the following approach:

We found events of three types, which we called c, r and w, occurring to a group of 7 simultaneous online players, which we named A-G, over a few minutes of game play. We looked at each event in turn as a triangular matrix and as a string sculpture display, side by side.

figure 11
B C D E F G
A
B
C
D
E
F
players C and E were involved in a type 'r' event at time T1

figure 12
B C D E F G
A
B
C
D
E
F
players F and G were involved in a type 'c' event at time T2

figure 13
B C D E F G
A
B
C
D
E
F
players B, C and E were involved in a type 'w' event at time T3

Combining all the events into one display gave us a collapsed time view of the relationships.

figure 14
B C D E F G
A
B
C
D
E
F
combined relationships between all 7 players over time

Then having the NetworkBlobs program permute (scramble) all of the players' row and column assignment until the perimeter is minimized and the upper left corner is favored in case of tie, gave the canonical form (as we then conceived of it), the "blobbiest" way to organize the players (nodes). Here is that form displayed both ways:

figure 15
C B D A G F
E
C
B
D
A
G
combined relationships between all 7 players in canonical form

Comparing this with figure 14, we see that the "cycles" of interconnected pairs and triplets have been moved closer together, even using the string sculpture technique. Thus is a clue that the notion of canonical form may have usefulness outside of the triangular array display technique, as a pre-processing step for other techniques.

5.2.2.1 3D Visualizations of Player Data

We then moved on to using the 3D techniques on the player data that we'd experimented with previously using the yearbook data.

Here was our approach:

We used the tools we'd built while working with the yearbook data to do some animations of the game player data. Here are some samples:

figure 16




long shot animation:
triangle matrix of relationships;
rows and column represent 7 individuals;
grey=connected, black=not connected;
angles of toroids on white cylinder mounts show how related;
colors and sizes (redundant) of toroids show how much related (red=low, green=high);
animated over time to show change of relationships over time

[Quicktime movie here]

We zoomed in on three of the four active squares, and added some outlines to the toroids for emphasis, and animated again:

figure 17




close-up of similar to above; triangle matrix of relationships;
rows and column represent 7 individuals;
grey=connected, black=not connected;
angles of toroids on white cylinder mounts show how related;
size and color (redundant) of toroids show how much related ;
animated over time
[Quicktime movie here]

After analyzing how informative these displays were, we decided that the hardest to read variable was the type of group, indicated only by the angle of a toroid. We took the color of the ring, which had been redundantly indicating importance of relationship along with size of the toroid, and had it redundantly indicate type as well. (In our data each event was at an instant in time, but we ramped up and down the importance of each connection over time, to reduce the abruptness of the animation.)

figure 18




close-up of similar to above, now colors and toroid angles both indicate how related;
triangle matrix of relationships;
rows and column represent 7 individuals;
grey=connected, black=not connected;
size of toroids show how much related;
animated over time
[Quicktime movie here]

After our colleagues had viewed these animations and given us feedback, after our own studies of the the animations as well, we concluded that this type of real-time monitoring of a large group of interacting nodes would be useful for the spotting of correlated events. As more data becomes available we expect to test this result more thoroughly.

6. RESULTS

We believe we have achieved two significant results to date:

  1. As shown in figures 14 and 15, the canonical form may serve as an effective pre-processor for other visualization techniques.

  2. Animations of toroids can be used as a real-time display of interacting groups, using human pattern recognition to spot correlated activities, as shown in figure 18.

7. FUTURE WORK

(Note: This paper represents a "snapshot" of this research as of June 7, 2004. Some of the tasks listed below me be completed or in progress by now.)

As the process has unfolded, the following additional goals have been suggested by the author, colleagues and others.

7.1 Algorithm Quality Improvements

Some of the 3D visualizations have confirmed that the canonical form would be better if more connections (grey squares) were clustered in the upper left corner. And yet there is already a "gravity" factor with a slight pull towards that corner. Upon reflection we realized that since we weren't counting the outer edges as perimeter elements, single squares would when possible migrate to a corner where they would only have one perimeter element instead of four, and otherwise seek edges for similar reasons. The solution is to count all edges between a 1 and the boundary of the triangle in the perimeter total. This improvement is as yet unimplemented, but we expect it will result in even "blobbier" network displays.

7.2 Algorithm Efficiency Improvements

The current algorithm for computing the canonical form of a triangular matrix involves a chunk of n squared arithmetic steps done n factorial times in a row. We think this may be improved in several ways:

7.3 Additional Computational and Visualization Experiments

More game data is becoming available to us almost every day. We would like to create more animations of game activity using the tools we have already developed.

With larger data sets we will be able to produce longer animations and probe for subtler relationships between more players.

Another experiment we would like to do is to apply our methods to types of random networks described in the literature of network analysis. In Six Degrees (2003) by Duncan Watts, [WATTS2003] the evolution of the idea of biased random networks is described. Using parameters named alpha and beta devised by Steve Strogatz and Duncan Watts, and later a gamma parameter devised by John Kleinberg, various nuances of how model networks are randomly wired can be controlled. We would like to generate some random networks with various values of alpha, beta, and gamma, and attempt to detect the values used with our visualization techniques.

7.4 Enhancements to Techniques

We would like to be able to deal with cases where the values in the matrix are not just one or zero. In the simplest case, a positive integer might, for example in high school data, indicate the number of clubs two individuals share. Our Java code was written to able to minimize perimeters for this kind of data, by minimizing the sum of the differences between each adjacent cell. In a more complex scenario, each cell might contain a list (vector) of values, for example a list of which clubs both students belong to. Our code will need to be enhanced to deal with this case. The rings visualizations of the high school data represents our first attempt at visualizing this type of data. We would like to explore additional analysis and visualization techniques for this type of data as well.

We would also like to explore enhancements to the "string sculpture" technique in which the types of relationships between nodes are represented by colored toroids threaded along the connecting edges, as in this first attempt:

figure 19
string sculture vizualization
string sculture vizualization method; endpoints (not drawn) are nodes, line segements are edges (connections), toroid centered on line segment is transient relationship event, color=type, size=how big (how important)

7.5 New Techniques

Perhaps partly inspired by graph theory pioneer William Hilbert's icosian calculus [HILBERT1885] and graph theory pioneer William Hamilton's related icosian game [HAGGARD1980], both based on the icosahedron, we have considered using the twelve vertices of an icoashedron to represent nodes in a graph, connecting them -- across the interior of the volume where necessary -- with lines representing the graph's edges, as in this somewhat difficult to parse illustration:

figure 20
graph nodes represented as icosahedron vertices in 3D
graph nodes represented as icos icosahedron vertices in 3D

A clearer idea of the resulting three-dimensional shape can be gained by manipulating the 3-D object in an interactive Java applet:

We have also considered applying an algorithm, analogous to the perimeter-minimizing done with the triangular arrays above, in which the nodes are permuted until the minimum distances are found between the connected 3D points. This would cause highly connected nodes to cluster and possibly make interesting and informative patterns of lines as well.

For graphs having more than 12 nodes, other Platonic solids, Archimedian solids, and geodesics can be used.

Another approach we have considered is the visualization of bipartite graphs, i.e., graphs in which all nodes can be divided into two groups, and all edges connect a node in one group with a node in another. The high school data originally came in this form, with the two groups of nodes being students and clubs. From this data we derived the direct connections between students. We believe there may be innovative and useful ways to visualize the bipartite data. One strong hint is that any such graph, represented in the canonical triangular array form, will have all nonzero cells confined to a rectangle in the middle of the triangle.

So far all of our representations of time have been done with animations. It has been suggested that we utilize world lines, i.e., traces in space representing time series events, as another approach.

So for in our analysis of the online game data we have been seeking to discover which players are cooperating, i.e., what the structure of their relationships is. We have been advised that soon we will be able to extract some structure data from the game server. Our sponsors have requested that we find ways to visualize known relationship structures. In the game players are broken into squads, with a military chain of command, or hierarchy, such as this:

figure 21
sample hierarchy in game data
sample hierarchy in game data

One idea we have considered is representing players as toroids, with the highest ranks in the center, and the chain of command radiating outward, such as this (the coloring scheme is one suggestion of several):

figure 22
proposed toroidal visualization of hierarchy
proposed toroidal visualization of hierarchy

Events over time can then be represented by world lines that we have been calling "threads," where each player's behavior is within their toroid:

figure 23
events display as world lines
events display as world lines (color = event type)

We believe this approach may facilitate recognizing correlated events among squad members.

7.6 Borrowing Techniques from Ethology

By happy coincidence, during the research for this paper the author attended an event at the San Diego Wild Animal Park [LINK11] called "Breakfast With the Elephants." A staff zoologist gave a presentation on the birth and early development of an African elephant at the park, and in the process reminded the author that ethologists (animal behavior experts) employ a technique called an ethogram to record and analyze animal behavior. There is a rich literature on these techniques. For example, the breakfast presentation included a "first occcurrence" timeline that showed the first time the baby elephant stood up, walked, nursed, had physical contact with another elephant besides its mother, etc. Realizing that what we have been doing with the online game data is essentially constructing "human ethograms," we would like to borrow techniques from ethology where useful.

REFERENCES AND LINKS

References:

Links:

Last update   3:17 PM PDT Wed. 30-Jun-02004 by ABS.