The Network Structure of Baseball Blogs: Part 1
Earlier in the week I read about the network structure of twitter employees' accounts and that got me thinking about the network structure of baseball blogs. Network theory (or graph theory) looks at the structure of objects connected by pairwise connections. It has been used to study the structure of the internet, email networks, the phone and power grids, epidemiological networks, food webs and tons of other things. In this case you can think of baseball blogs as vertices and then connect them with edges if they link one another, then graph out all the connected blogs and see whether there is any structure.
I used the data from BallHype to generate the web. I looked at their top 200 baseball blogs and then went back to each blog's last 100 posts and saw which of the other 200 blogs linked to that post. These are links from posts to posts not general links from a blog to another blog. Here are all the blogs with at least one connection to the main component, with an edge draw whenever one blog links another.
To make the image a little more simple and only show the stronger connections I re-drew this graph with edges only when one blog linked another one three or more times. I dropped out blogs which were not connected to the main component using this new edge definition. Each link is directed with an arrow going from the linking blog to the linked blog.
The algorithm tries to draw the vertices in positions such that they are close to blogs that linked them and which they linked. So you can sort of see clusters of blogs which should be similar (linked to and from similar blogs). Here I have labeled the top 15 blogs (a cutoff that conveniently includes Baseball Analysts -- BA).
Here you can see BA cluster out with the well-connected center of the network particularly close to its sabermetric brethren: the Hardball Times, Baseball Prospectus, The Book Blog, FanGraphs and Beyond the Box Score.
Next I wanted to see how strongly blogs following the same teams clustered out together in the network. I should say that the vertices are not all of the blogs, because of the cutoff I am only showing blogs which connect to this strongly connected component (remember my definition for an edge is three or more links). The Reds Sox, Cubs, Cardinals and Angles all have lots of blogs in the top 200 but most of these fell away, presumably because they either did not link enough or did not have a enough links in (I am not saying anything about the quality of these blogs based on that). Some other teams with a lot fewer blogs had more stay in the network.
The Yankees and Mets are well represented with many blogs that are well connected, and a couple connections between the two. There are a handful of blogs which cover both Mets and Yankees, such as Mike Silva's Blog and the New York Time's Bats Blog, and I just randomly assigned those to either the Yankees or Mets. Having one blog that links to lots of other team blogs really keeps lots in the network which would other wise drop out. Fack Youk is the Yankee blog with may links going out. Amazing Avenue, the main hub of the Mets network, has many connections going out and coming in.
Then you have some surprising teams. Who knew there were so many Nats blogs? You can see this is largely driven by one, Federal Baseball, which regularly links a number of other Nats blogs. On the other hand the Pirates section is driven by one blog, PBC blog, which receives links in from a number of other blogs. There is an interesting blog in there, Call to the Pen, which links to Padres, Mariners and Pirates blogs, as well as many others.
I am not trying to make a value statement that having blogs in this network is a better than not (e.g., I am not saying that the Nationals blog community is any better or worse than the Red Sox blog community). I am just showing the network based on my arbitrary way of defining a connection.
This is a first pass at the data and next week I will dig a little deeper into the network structure. How connected is the network? What is the average distance between two random blogs? Do any teams cluster out together?