Runs Allowed vs. ERA
I picked up a copy of Baseball Between the Numbers, the Baseball Prospectus tome which debunks conventional baseball wisdom, and I couldn't help but be struck by the chapter in which it damned baseball's Earned Run Average as an archaic statistic. While of course, ERA has its problems, I was surprised that its prescription for ERA's shortcomings was to user the simpler Run Average (RA), which is calculated the same way as ERA except using total runs instead of earned runs. This idea was also championed by Michael Wolverton among others at BP. Disputing the notion was Kevin Shearer on Rob Neyer's blog site.
While the arguments are convincing on both sides, neither side convinced me to my satisfaction. I decided to examine things statistically via simulation. The key questions regarding ERA vs. RA concern the concepts of variance and bias. ERA was created for a reason - to remove the effects of defense from a pitcher's record. The creators knew that defense could be a source of bias in a pitchers record - a defense which makes a lot of errors will artificially inflate a pitcher's runs allowed, while a good defense will artificially deflate a pitchers runs allowed.
ERA does remove the bias - by reconstructing the inning without the error it indeed removes the effects of defense. Over the long run, ERA will be neither helped nor hindered by good or bad defense. While it does remove bias, the problem with ERA is that it is not very efficient in doing so. By removing the effect of defense, ERA also throws out a great deal of information - namely everything that happens after the third out of the inning should have been made. After a two out error, it matters not whether the pitcher strikes out the next batter or gives up 5 runs, all of this information is lost by ERA (the fact that ERA fails to capture additional outs as well as additional runs seems to be lost on those who simply claim ERA is too lenient and bails out pitchers with poor defenses). Simple RA captures this information (and thus reduces variance by effectively increasing sample size), but of course is subject to bias due to good or bad defense.
What's more interesting is to look at the standard deviation of these numbers. The better statistic will have a smaller SD. Of course, this is not quite a fair comparison because the SD of RA will be larger than the SD of ERA because RA is inflated by the errors. We can deal with this by deflating the RA back to an ERA scale by multiplying RA by a factor of 4.24/4.59. Now that we have this fair comparison we can take a look at the SD of these averages. The statistic which has the smaller SD will more closely adhere to the true ERA of 4.24, and thus be a more precise statistic. The results? The SD of ERA was .625 runs per 9 IP. The SD of the adjusted RA was slightly smaller at .608 runs per 9 IP. This indicates that indeed RA is superior in this situation - RA was more closely clustered around the true value of 4.24, whereas ERA was slightly more spread out.
In the ERA vs. RA comparison there are two competing forms of variability. With RA, variability increases due to the presence of errors which create random noise. However, as we've just shown, this is more than counteracted by the fact that the RA effectively has a larger sample size to work with than ERA, since it throws out no data.
To account for this, I reran the simulation, this time making the error rate vary randomly in order to match the rough error rate distribution among major league teams. This change is likely to favor ERA over RA, since extra variability is now be added to RA, while no extra variability will be added to ERA. What were the results? In the end, this extra variability did very little to change the results. The SD of ERA was .628 runs per 9 IP while the SD of RA was .610 runs per 9 IP. The change in the SD of RA is almost negligible and is not significant. The end result is that adding this slight amount of variability into the simulation does virtually nothing to change the argument of ERA vs. RA. This indicates to me that indeed the people at Baseball Prospectus are correct and that RA is a better measure of a pitcher's run prevention skills than ERA.
As you can see, RA gets the bulk of the weight, but ERA has usefulness as well. Nevertheless, the regression indicates again that RA is a better indicator than ERA. About 90% of the weight should be given to RA and only 10% should be given to ERA. The standard error on the regression indicates that the standard error of this combined average is .604 points of ERA - down from .610 when using RA alone. However, obviously this distinction is very slight, so using this combined measure is of questionable value, especially since if you are going to go with an advanced statistic, there are many other measures better than either RA or ERA.
How about environments where the variability of error rate is high between pitchers? The variability in error rate among MLB teams is very narrow and we saw that factoring this in had little effect. However, if we increase this variation in error rate significantly to range from 0% to 2%, what happens? In this situation, indeed ERA does become a better statistic, with the standard deviations comparable and the importance of ERA nearly matching the importance of RA in a regression. Of course, this is not a realistic situation which occurs in baseball.