Runs Allowed vs. ERA
I picked up a copy of Baseball Between the Numbers, the Baseball Prospectus tome which debunks conventional baseball wisdom, and I couldn't help but be struck by the chapter in which it damned baseball's Earned Run Average as an archaic statistic. While of course, ERA has its problems, I was surprised that its prescription for ERA's shortcomings was to user the simpler Run Average (RA), which is calculated the same way as ERA except using total runs instead of earned runs. This idea was also championed by Michael Wolverton among others at BP. Disputing the notion was Kevin Shearer on Rob Neyer's blog site.
While the arguments are convincing on both sides, neither side convinced me to my satisfaction. I decided to examine things statistically via simulation. The key questions regarding ERA vs. RA concern the concepts of variance and bias. ERA was created for a reason - to remove the effects of defense from a pitcher's record. The creators knew that defense could be a source of bias in a pitchers record - a defense which makes a lot of errors will artificially inflate a pitcher's runs allowed, while a good defense will artificially deflate a pitchers runs allowed.
ERA does remove the bias - by reconstructing the inning without the error it indeed removes the effects of defense. Over the long run, ERA will be neither helped nor hindered by good or bad defense. While it does remove bias, the problem with ERA is that it is not very efficient in doing so. By removing the effect of defense, ERA also throws out a great deal of information - namely everything that happens after the third out of the inning should have been made. After a two out error, it matters not whether the pitcher strikes out the next batter or gives up 5 runs, all of this information is lost by ERA (the fact that ERA fails to capture additional outs as well as additional runs seems to be lost on those who simply claim ERA is too lenient and bails out pitchers with poor defenses). Simple RA captures this information (and thus reduces variance by effectively increasing sample size), but of course is subject to bias due to good or bad defense.
I wrote a baseball simulation program, which keeps track of a pitchers runs allowed and earned runs. The program doesn't take into account different batters, relief pitchers, etc, but that's not the point here. In my simulation I ran 10,000 seasons of pitchers with 200 IP. Pitchers gave up 4.59 runs per 9 IP and 4.24 ER per 9 IP. The same pitchers pitching in an errorless environment also gave 4.24 ER per game, which was no surprise - as I said earlier, ERA is an unbiased metric of a pitcher's run prevention skills.
What's more interesting is to look at the standard deviation of these numbers. The better statistic will have a smaller SD. Of course, this is not quite a fair comparison because the SD of RA will be larger than the SD of ERA because RA is inflated by the errors. We can deal with this by deflating the RA back to an ERA scale by multiplying RA by a factor of 4.24/4.59. Now that we have this fair comparison we can take a look at the SD of these averages. The statistic which has the smaller SD will more closely adhere to the true ERA of 4.24, and thus be a more precise statistic. The results? The SD of ERA was .625 runs per 9 IP. The SD of the adjusted RA was slightly smaller at .608 runs per 9 IP. This indicates that indeed RA is superior in this situation - RA was more closely clustered around the true value of 4.24, whereas ERA was slightly more spread out.
In the ERA vs. RA comparison there are two competing forms of variability. With RA, variability increases due to the presence of errors which create random noise. However, as we've just shown, this is more than counteracted by the fact that the RA effectively has a larger sample size to work with than ERA, since it throws out no data.
Of course, this assumes that the defense's rate of error is known and RA can be adjusted perfectly to account for unearned runs. In real life, the defensive liability due to errors is largely unknown. We know that teams make roughly around .017 errors per PA, but this can vary by team and other factors.
To account for this, I reran the simulation, this time making the error rate vary randomly in order to match the rough error rate distribution among major league teams. This change is likely to favor ERA over RA, since extra variability is now be added to RA, while no extra variability will be added to ERA. What were the results? In the end, this extra variability did very little to change the results. The SD of ERA was .628 runs per 9 IP while the SD of RA was .610 runs per 9 IP. The change in the SD of RA is almost negligible and is not significant. The end result is that adding this slight amount of variability into the simulation does virtually nothing to change the argument of ERA vs. RA. This indicates to me that indeed the people at Baseball Prospectus are correct and that RA is a better measure of a pitcher's run prevention skills than ERA.
A Combined Measure
But should ERA be thrown out all together, as BP suggests? Using the simulation data, I ran a regression on the simulated data with a fixed 4.24 ERA as the dependent variable and the pitchers' RA and ERA as independent variables to see the relative weights it would assign ERA and RA. Running the data through the regression (with no intercept), we get the following.
As you can see, RA gets the bulk of the weight, but ERA has usefulness as well. Nevertheless, the regression indicates again that RA is a better indicator than ERA. About 90% of the weight should be given to RA and only 10% should be given to ERA. The standard error on the regression indicates that the standard error of this combined average is .604 points of ERA - down from .610 when using RA alone. However, obviously this distinction is very slight, so using this combined measure is of questionable value, especially since if you are going to go with an advanced statistic, there are many other measures better than either RA or ERA.
So, we have shown that RA is a better statistic than ERA in the current MLB environment, but what about other environments? When the propensity for errors is much higher, does ERA become more meaningful? In fact, the opposite is true. With more errors, there will be more plays that ERA throws out. When a lot of errors are made, it actually provides us as observers more sample to observe a pitcher. When the % of PA reached on error is jacked up to 8%, the SD of RA actually decreases strongly from .608 to .533, while the SD of ERA remains the same. This makes RA even more superior in environments where many errors are made.
How about environments where the variability of error rate is high between pitchers? The variability in error rate among MLB teams is very narrow and we saw that factoring this in had little effect. However, if we increase this variation in error rate significantly to range from 0% to 2%, what happens? In this situation, indeed ERA does become a better statistic, with the standard deviations comparable and the importance of ERA nearly matching the importance of RA in a regression. Of course, this is not a realistic situation which occurs in baseball.
In conclusion, I hope I have advanced the debate between ERA and RA. The simulation approach is an advantage because it creates a controlled environment. While ERA does have some usefulness, it's unbiased nature is more than offset by its problems of throwing out too much information. It's neither too lenient nor too harsh on the pitcher, but is simply inefficient. Of course, some pitchers have certain characteristics which make unearned runs systematically more likely (such as extreme groundball pitchers and knuckleballers), which only RA can capture. The simulation didn't take these types of pitchers into account, but this is even another reason to use RA over ERA. While I first viewed the use of RA over ERA with skepticism, this simulation convinced me of its merits as the better of two imperfect statistics.