F/X Visualizations April 30, 2010
The Network Structure of Baseball Blogs: Part 1

Earlier in the week I read about the network structure of twitter employees' accounts and that got me thinking about the network structure of baseball blogs. Network theory (or graph theory) looks at the structure of objects connected by pairwise connections. It has been used to study the structure of the internet, email networks, the phone and power grids, epidemiological networks, food webs and tons of other things. In this case you can think of baseball blogs as vertices and then connect them with edges if they link one another, then graph out all the connected blogs and see whether there is any structure.

I used the data from BallHype to generate the web. I looked at their top 200 baseball blogs and then went back to each blog's last 100 posts and saw which of the other 200 blogs linked to that post. These are links from posts to posts not general links from a blog to another blog. Here are all the blogs with at least one connection to the main component, with an edge draw whenever one blog links another.

To make the image a little more simple and only show the stronger connections I re-drew this graph with edges only when one blog linked another one three or more times. I dropped out blogs which were not connected to the main component using this new edge definition. Each link is directed with an arrow going from the linking blog to the linked blog.

The algorithm tries to draw the vertices in positions such that they are close to blogs that linked them and which they linked. So you can sort of see clusters of blogs which should be similar (linked to and from similar blogs). Here I have labeled the top 15 blogs (a cutoff that conveniently includes Baseball Analysts -- BA).

Here you can see BA cluster out with the well-connected center of the network particularly close to its sabermetric brethren: the Hardball Times, Baseball Prospectus, The Book Blog, FanGraphs and Beyond the Box Score.

Next I wanted to see how strongly blogs following the same teams clustered out together in the network. I should say that the vertices are not all of the blogs, because of the cutoff I am only showing blogs which connect to this strongly connected component (remember my definition for an edge is three or more links). The Reds Sox, Cubs, Cardinals and Angles all have lots of blogs in the top 200 but most of these fell away, presumably because they either did not link enough or did not have a enough links in (I am not saying anything about the quality of these blogs based on that). Some other teams with a lot fewer blogs had more stay in the network.

The Yankees and Mets are well represented with many blogs that are well connected, and a couple connections between the two. There are a handful of blogs which cover both Mets and Yankees, such as Mike Silva's Blog and the New York Time's Bats Blog, and I just randomly assigned those to either the Yankees or Mets. Having one blog that links to lots of other team blogs really keeps lots in the network which would other wise drop out. Fack Youk is the Yankee blog with may links going out. Amazing Avenue, the main hub of the Mets network, has many connections going out and coming in.

Then you have some surprising teams. Who knew there were so many Nats blogs? You can see this is largely driven by one, Federal Baseball, which regularly links a number of other Nats blogs. On the other hand the Pirates section is driven by one blog, PBC blog, which receives links in from a number of other blogs. There is an interesting blog in there, Call to the Pen, which links to Padres, Mariners and Pirates blogs, as well as many others.

I am not trying to make a value statement that having blogs in this network is a better than not (e.g., I am not saying that the Nationals blog community is any better or worse than the Red Sox blog community). I am just showing the network based on my arbitrary way of defining a connection.

This is a first pass at the data and next week I will dig a little deeper into the network structure. How connected is the network? What is the average distance between two random blogs? Do any teams cluster out together?

Is that outlier that is labeled "LL" Lookout Landing? Interesting that is so isolated an is part of the SBNation.

Holy crap.

Which blog is that unmarked circle right below THT? Seems to be the center.

This is fascinating, Dave. I'd love to see the labels of all the blogs on that second graph where you labeled the top 15.

Very interesting!

Too bad you couldn't do this according to link referer's to find out how the audience's relate between tons of baseball blogs. That would require lots of data from lots of sources though, so it probably can't be done.

Tyler I was also surprised to see Lookout Landing way out there. I think of them as one of the top blogs and I thought they would be in the thick of it. I think part of it is they do not do big link roundups, which tie you into lots of other blogs. Also I think that they are linked by other blogs by my 3-or-more-links-from-another-blog-to-draw-the-edge cutoff really cut down on their links in. I think that a better was to do it would be to keep all links but weigh the edges based on number of links between blogs.

Clemente that is BBTF's Primer Newsblog, which makes sense because they link a ton of stuff.

Mike, here it is. Some of my abbreviations are not super clear, but it is the best I could do. If there is one you are wondering about ask. When making those abbreviations I noticed that a ton of the SB nation blogs are two-word alliterations: Amazin Avenue, Lookout Landing, Royals Review, Red Reporter, Camden Chat, Twinkie Town and Bluebird Banter.

Fack Youk, Amazin Avenue and Brew Crew Ball are three relatively separate nodes all sending out tons of links. I also really like how strongly the sabermetric-based non-team blogs cluster out together in the middle.

Thanks, Dave!

I think while there is a lot of idea-sharing that goes on between the USSM/Lookout Landing authors and the broader saber community, there is not a lot of direct linking of articles between the two groups going either way. I think that's a purposeful choice by those authors to develop their communities that way and it seems to work well for them. Sometimes, though, it means that their good saber-related work gets ignored in the broader saber community. In the offseason, I tend to follow LL pretty closely, but during the season, I don't keep up as much.

Dave, did Viva El Birdos show up in your data? I can't seem to find it in the massive web you linked to above.

My blog didn't even make their Yankees list! Someone click on my name so maybe I'll register :)

You should be careful about inferring clusters from graph visualizations, as that is known [in the research literature] to potentially be highly misleading. (Not that the computational algorithms to find cohesive groups don't have their issues as well, but it's still in general worth being careful with such visualizations.)

Very cool. I was surprised to see Camden Crazies (me) included, and it was neat that it showed up in the middle there with BA, The Book, FanGraphs, and the like.

You might want to distinguish in-degree from out-degree here. The in-degree is how many things link to it, while the out-degree is how many things it links to. In-degree is usually considered to be more important when discussing things like authority, respectability, etc. Google uses in-degree for its page rankings, for example.

Nick,

Viva El Birdos is definitely a top 200 blog according to BallHype, but it neither linked any other blog more than three times nor was linked by any other blog more than three times so it did not show up in the smaller networks (it is one of the nodes in the bigger first network). I think that was a real draw back to my method. In simplifying the network I threw away a lot of the structure.

Phil,

Definitely in-degree and out-degree are very different and that BallHype uses in-degree for their rankings, but here I am not ranking the blogs or saying anything about their authority or respectability. Also the links in the network are directed, you can see the arrows signifying who is linking whom.