Designated Hitter April 07, 2005
Picturing Baseball

Since you're here reading the fine Baseball Analysts site, I assume you've read a lot of baseball articles already. Along the way, you've probably seen a lot of tables that look like this:

```CLUB          W       L       RS      RA   Pyth Diff
STL          105      57     855     659       5
ATL           96      66     803     668       1
LAD           93      69     761     684       4
HOU           92      70     803     698       1
SFG           91      71     850     770       3
CHC           89      73     789     665      -5
SDP           87      75     768     705       0
PHI           86      76     840     781       0
FLO           83      79     718     700       0
CIN           76      86     750     907       9
PIT           72      89     680     744      -2
NYM           71      91     684     731      -5
COL           68      94     833     923      -5
MIL           67      94     634     757       0
MON           67      95     635     769       0
ARI           51     111     615     899      -3```

This is a pretty important table, actually. It includes the wins and losses of all National League teams last year, plus their runs scored and allowed and their "pythagorean variance," which is the difference between the number of wins you'd expect from each team based on its runs scored and allowed, and its actual wins. Arguably, this table contains the most important, fundamental stats of the season for each team.

Let's say you're a Reds' fan. Because the list is sorted by number of wins, you can see that the Reds finished seventh from the bottom of the NL, with 76 victories. You can also see that they scored 750 runs, which is in the middle of the pack somewhere, but allowed 907 runs, which looks like the second-worst total. Finally, you see they actually outperformed their Pythagorean projection by nine games.

As I said, that's a lot of information, and it took a bit of work to pull these facts out of the table. And that was just one team -- imagine being a general baseball fan and wanting to understand the big picture; wanting to understand how all sixteen teams relate to each other. If you can imagine how frustrating that would be, then you grasp the essence of my pet peeve: the overwhelming terrible use of numbers in baseball writing. I'm not talking about the analysis of the numbers (though that often misses the mark, too). I'm talking about the way the numbers are displayed.

Specifically, I'm talking about articles in which the writer uses stats to make a point. Don't you think that the writer should present the stats in a way that highlights the point? And that doesn't force readers to cross their eyes and furrow their brows?

Yes, baseball stats are an integral part of the game. Yes, the MacMillan Baseball Encyclopedia was probably the best Christmas present I ever received. Yes, our understanding of the game is deepened and strengthened by the insightful use of baseball stats.

But that doesn't mean that every interesting baseball article has to include tables of stats. Just because it works for the Baseball Encyclopedia doesn't mean that it works for a magazine, newspaper or website. In fact, research has pretty conclusively shown that tables do a poor job of making a point. Readers don't take the time to read them and often don't understand them.

Okay, that's the first half of the rant. Now for the second half: too often, publications that do graphs do them completely wrong. Recently, Baseball America ran a graph of contract information that consisted of blue bars on a blue background, which inspired a rant at my website. But I shouldn't just pick on BA. USA Today, that popularizer of the pie chart and other specious graph designs, has probably done more to undermine the notion of good graph design than any publication in the history of publishing.

On the other hand, we have the New York Times, which in my opinion consistently shows a phenomenally insightful touch with their graphics. The Times, as you know, is supposed to be written for high-minded intellectuals who live in New York; USA Today is supposed to written for the average person who lives everywhere else. As a result, there is a perception that graphics are "devices for showing the obvious to the ignorant." No, no, no. Graphs can convey "complex information, as long as it's done with grace and clarity," in the words of Edward Tufte, the Godfather of Good Graph Design. Let me give you an example.

Here is a graph of runs scored and allowed by each team in the National League last year, adjusted by park factor. Teams that scored a lot of runs are on the right, while teams that gave up the least runs are at the top of the graph. Some folks object to this graph layout, because the "Runs Allowed" axis runs from high on the bottom to low on the top, instead of low to high. I've done that for a reason. Most people associate "up" and "right" with a good outcome, because larger numbers are usually good. So I ran the axis such that the best position is in the upper right hand part of the graph, and the worst position is in the lower left. I've also added dotted lines depicting the average number of runs per game, and added a couple of labels (good offense, bad defense, etc.) to help understand the graph. Finally, I've added the pythagorean variance to the team label, so that you can see how the team's actual won/loss record differed from its relative position on the graph.

The reason to graph this information is that runs scored and allowed are directly related to wins and losses, and the depiction of spatial relationships for each team lets you understand each team's strengths and weaknesses. Cincinnati was not only the worst defensive team in the league, it was WAY worse than every team other than Arizona. And St. Louis was clearly the best team in the league, as evidenced by its position in the upper right corner.

But the graph can be improved by substituting different lines -- ones that show the implied won/loss record of each team. Specifically, the three lines in the following graph now represent projected victory totals at three levels, with winning percentages of .400, .500 and .600. These lines allow you to group the teams by performance levels, regardless of whether they rely on pitching or hitting.

Now you can see that Cincinnati and Montreal were both .400 teams, but one was built on offense and the other on defense (note the ironic use of "built"). St. Louis was the only team clearly over the .600 mark, while the Cubs and Braves were slightly under that line -- the Cubs were hurt by their Pythagorean difference. It's true that you don't have the exact number of wins, losses, runs scored or runs allowed on this graph, but why do you care? You can get that information from lots of other websites. This graph allows you to see the critical relationships in ways a table of numbers can't. Applying some of Tufte's criteria, this graph gives the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space. That's a mouthful, but that's also the goal.

To paraphrase Edward Tufte one last time, the objective of graphical representations is to facilitate better reasoning about quantitative data. Good graphs allow the reader to focus on the substance of the data and determine the cause and effect relationships. Tables of numbers don't do this well at all, and neither do poorly designed graphs. But well-designed graphs can show the way.

I took up this cause two years ago, with the creation of the Baseball Graphs website. The good news is that lots of other folks are joining in. Here are some other recent examples of excellent baseball graphs:

Since first rolling out the baseball graphs site, I've moved onto a lot of other projects at The Hardball Times, such as Win Shares, WPA and even writing weekly columns. But good graphical displays remain my top priority and my raison d'etre on the Web. The cause is gaining steam, but it still has far to go. I thank Bryan and Rich for the chance to get on my soapbox once again.

Studes is a writer at the Hardball Times, and also the manager of the Baseball Graphs website.