Runs Allowed vs. ERA
I picked up a copy of Baseball Between the Numbers, the Baseball Prospectus tome which debunks conventional baseball wisdom, and I couldn't help but be struck by the chapter in which it damned baseball's Earned Run Average as an archaic statistic. While of course, ERA has its problems, I was surprised that its prescription for ERA's shortcomings was to user the simpler Run Average (RA), which is calculated the same way as ERA except using total runs instead of earned runs. This idea was also championed by Michael Wolverton among others at BP. Disputing the notion was Kevin Shearer on Rob Neyer's blog site. While the arguments are convincing on both sides, neither side convinced me to my satisfaction. I decided to examine things statistically via simulation. The key questions regarding ERA vs. RA concern the concepts of variance and bias. ERA was created for a reason - to remove the effects of defense from a pitcher's record. The creators knew that defense could be a source of bias in a pitchers record - a defense which makes a lot of errors will artificially inflate a pitcher's runs allowed, while a good defense will artificially deflate a pitchers runs allowed. ERA does remove the bias - by reconstructing the inning without the error it indeed removes the effects of defense. Over the long run, ERA will be neither helped nor hindered by good or bad defense. While it does remove bias, the problem with ERA is that it is not very efficient in doing so. By removing the effect of defense, ERA also throws out a great deal of information - namely everything that happens after the third out of the inning should have been made. After a two out error, it matters not whether the pitcher strikes out the next batter or gives up 5 runs, all of this information is lost by ERA (the fact that ERA fails to capture additional outs as well as additional runs seems to be lost on those who simply claim ERA is too lenient and bails out pitchers with poor defenses). Simple RA captures this information (and thus reduces variance by effectively increasing sample size), but of course is subject to bias due to good or bad defense. Simulation 1 What's more interesting is to look at the standard deviation of these numbers. The better statistic will have a smaller SD. Of course, this is not quite a fair comparison because the SD of RA will be larger than the SD of ERA because RA is inflated by the errors. We can deal with this by deflating the RA back to an ERA scale by multiplying RA by a factor of 4.24/4.59. Now that we have this fair comparison we can take a look at the SD of these averages. The statistic which has the smaller SD will more closely adhere to the true ERA of 4.24, and thus be a more precise statistic. The results? The SD of ERA was .625 runs per 9 IP. The SD of the adjusted RA was slightly smaller at .608 runs per 9 IP. This indicates that indeed RA is superior in this situation - RA was more closely clustered around the true value of 4.24, whereas ERA was slightly more spread out. In the ERA vs. RA comparison there are two competing forms of variability. With RA, variability increases due to the presence of errors which create random noise. However, as we've just shown, this is more than counteracted by the fact that the RA effectively has a larger sample size to work with than ERA, since it throws out no data. Simulation 2 To account for this, I reran the simulation, this time making the error rate vary randomly in order to match the rough error rate distribution among major league teams. This change is likely to favor ERA over RA, since extra variability is now be added to RA, while no extra variability will be added to ERA. What were the results? In the end, this extra variability did very little to change the results. The SD of ERA was .628 runs per 9 IP while the SD of RA was .610 runs per 9 IP. The change in the SD of RA is almost negligible and is not significant. The end result is that adding this slight amount of variability into the simulation does virtually nothing to change the argument of ERA vs. RA. This indicates to me that indeed the people at Baseball Prospectus are correct and that RA is a better measure of a pitcher's run prevention skills than ERA. As you can see, RA gets the bulk of the weight, but ERA has usefulness as well. Nevertheless, the regression indicates again that RA is a better indicator than ERA. About 90% of the weight should be given to RA and only 10% should be given to ERA. The standard error on the regression indicates that the standard error of this combined average is .604 points of ERA - down from .610 when using RA alone. However, obviously this distinction is very slight, so using this combined measure is of questionable value, especially since if you are going to go with an advanced statistic, there are many other measures better than either RA or ERA. Extreme Situations How about environments where the variability of error rate is high between pitchers? The variability in error rate among MLB teams is very narrow and we saw that factoring this in had little effect. However, if we increase this variation in error rate significantly to range from 0% to 2%, what happens? In this situation, indeed ERA does become a better statistic, with the standard deviations comparable and the importance of ERA nearly matching the importance of RA in a regression. Of course, this is not a realistic situation which occurs in baseball. Conclusion |
Comments
I've discussed this on my blog several times. The two important things are:
1. as you noted: completely discarding everything after the third out in the reconstruction; we count it in his K, BB, HR, H totals, but not his ER... why not EHR, EK, EBB, EH then?
2. Look at the number of career unearned runs by Santana (FB pitcher) and Webb (GB pitcher). Or FB pitchers as a group compared to GB pitchers. More errors are made on GB than FB, so of course the GB pitchers will have more UER, even if they have a fantastic defense.
3. Ugh, errors as a way to measure defense. How very 19th century.
Posted by: TangoTiger at June 1, 2009 8:41 AM
Click on my name. My comments start at post #6.
Posted by: TangoTiger at June 1, 2009 8:43 AM
I agree at the professional level, RA is probably the better number. But I think when looking at amateur pitchers, where the routine play isn't made as consistently, ERA is better. Due to the high variablity of defense talent backing one pitcher from the next.
Posted by: billy at June 1, 2009 8:48 AM
Further expanding on my point.... For example, high school ball. Yes, it is a "high error environment", but often times it is just one of the two teams creating that "environment". The ability level from player to player, team to team, is so uneven. Hence the ERA being of more importance, not less, as you suggested. Perhaps if you wanted to compare to pitchers from the same team, you could use RA.
Posted by: billy at June 1, 2009 8:56 AM
"What's more interesting is to look at the standard deviation of these numbers. The better statistic will have a smaller SD."
This is true, if neither statistic is biased. However, RA may be biased by defense. This would suggest that looking at (Root) Mean Square Error as the appropriate comparison.
However, the posed question is ill-defined: What qualities are you trying to measure when using ERA or RA?
Posted by: Alex at June 1, 2009 9:27 AM
Tango,
Looks like a lot of great discussion on your site - I hope the article provides a bit more statistical rigor that I haven't seen in the debate so far. The GB/FB pitcher bias is another point in favor of RA, but even WITHOUT that, the simulation shows RA is superior. I was surprised at how strongly it showed that.
Billy,
As you mention, if the error rate widely varies from team to team, ERA becomes more important. But from the simulation, it has to vary quite a lot before ERA approaches RA in importance.
Posted by: Sky Andrecheck at June 1, 2009 9:37 AM
Alex,
In the simulation I control exactly how much defense is causing the unearned runs, so neither statistic is biased. The RMSE and SD are identical in this case.
Posted by: Sky Andrecheck at June 1, 2009 9:42 AM
I came across this game the other day -- the Astros gave up 16 unearned runs vs the Mets.
http://www.baseball-reference.com/boxes/NYN/NYN198507271.shtml
It's funny that allowing 15 hits, 3 walks, 5 doubles and a home run leads to 0 earned runs.
Posted by: Dackle at June 1, 2009 10:39 AM
My reading comprehension skills are apparently horrible. But the question still remains: what skill, attribute, or quality of a player are you trying to condense into a single number?
If you're interested in removing luck and defense from the measure of that quality, why wouldn't we just skip to (x)FIP?
On an unrelated point, the difference in SDs in simulations 1 and 2 is fairly small, on the order of 3%. So should we care?
Posted by: Alex at June 1, 2009 12:17 PM
Sorry I didn't answer your second question Alex. The measure of interest is the "true" number of runs per 9 IP allowed by a pitcher given errorless defense.
Yeah, you could bypass all this and talk about FIP stats, but I wanted to compare ERA and RA, which is an oft debated subject.
You're right that the difference between simulations 1 and 2 is very slight. The point was that even when teams have different levels of fielding ability it doesn't change the results much at all - RA still beats ERA in precision.
Posted by: Sky Andrecheck at June 1, 2009 12:49 PM
I am fairly new to this new age number thinking ,but am fascinated by it and using the numbers as a tool.I did read an article recently comparing Jason Bay and Manny Ramirez defensivly. The author of the article stated that Manny's defensive matrix or metric showed he was a superior leftfielder than Bay.The writer (claimed) he got this information from Baseball Prospectus. As a Redsox fan since 1975, I can say with quite certainly that Bay is a vastly superior leftfielder than Manny. Thoughts.
Posted by: Topher D. at June 2, 2009 10:37 PM
Once I accepted that fielding percentage is a really bad way to evaluate defense, I realized ERA was flawed. FIP, or something like it, is better, because defense is not just about making or not making errors (even leaving aside the importance of range, how many crazy scoring decisions does one need to see to realize how inconsistent/biased/justplainbad MLB official scoring is?).
Topher - I saw that Bay was rating out as awful by UZR. I watch a fair amount of Sox games, since my wife is a fan, and while I wouldn't call him a defensive wizard, the numbers don't match what my eyes tell me. It could be a statistical fluke that will even out with time, or it could be our lying eyes!
Posted by: Rob in CT at June 3, 2009 11:42 AM