Behind the Scoreboard June 01, 2009
Runs Allowed vs. ERA

I picked up a copy of Baseball Between the Numbers, the Baseball Prospectus tome which debunks conventional baseball wisdom, and I couldn't help but be struck by the chapter in which it damned baseball's Earned Run Average as an archaic statistic. While of course, ERA has its problems, I was surprised that its prescription for ERA's shortcomings was to user the simpler Run Average (RA), which is calculated the same way as ERA except using total runs instead of earned runs. This idea was also championed by Michael Wolverton among others at BP. Disputing the notion was Kevin Shearer on Rob Neyer's blog site.

While the arguments are convincing on both sides, neither side convinced me to my satisfaction. I decided to examine things statistically via simulation. The key questions regarding ERA vs. RA concern the concepts of variance and bias. ERA was created for a reason - to remove the effects of defense from a pitcher's record. The creators knew that defense could be a source of bias in a pitchers record - a defense which makes a lot of errors will artificially inflate a pitcher's runs allowed, while a good defense will artificially deflate a pitchers runs allowed.

ERA does remove the bias - by reconstructing the inning without the error it indeed removes the effects of defense. Over the long run, ERA will be neither helped nor hindered by good or bad defense. While it does remove bias, the problem with ERA is that it is not very efficient in doing so. By removing the effect of defense, ERA also throws out a great deal of information - namely everything that happens after the third out of the inning should have been made. After a two out error, it matters not whether the pitcher strikes out the next batter or gives up 5 runs, all of this information is lost by ERA (the fact that ERA fails to capture additional outs as well as additional runs seems to be lost on those who simply claim ERA is too lenient and bails out pitchers with poor defenses). Simple RA captures this information (and thus reduces variance by effectively increasing sample size), but of course is subject to bias due to good or bad defense.

Simulation 1
I wrote a baseball simulation program, which keeps track of a pitchers runs allowed and earned runs. The program doesn't take into account different batters, relief pitchers, etc, but that's not the point here. In my simulation I ran 10,000 seasons of pitchers with 200 IP. Pitchers gave up 4.59 runs per 9 IP and 4.24 ER per 9 IP. The same pitchers pitching in an errorless environment also gave 4.24 ER per game, which was no surprise - as I said earlier, ERA is an unbiased metric of a pitcher's run prevention skills.

What's more interesting is to look at the standard deviation of these numbers. The better statistic will have a smaller SD. Of course, this is not quite a fair comparison because the SD of RA will be larger than the SD of ERA because RA is inflated by the errors. We can deal with this by deflating the RA back to an ERA scale by multiplying RA by a factor of 4.24/4.59. Now that we have this fair comparison we can take a look at the SD of these averages. The statistic which has the smaller SD will more closely adhere to the true ERA of 4.24, and thus be a more precise statistic. The results? The SD of ERA was .625 runs per 9 IP. The SD of the adjusted RA was slightly smaller at .608 runs per 9 IP. This indicates that indeed RA is superior in this situation - RA was more closely clustered around the true value of 4.24, whereas ERA was slightly more spread out.

In the ERA vs. RA comparison there are two competing forms of variability. With RA, variability increases due to the presence of errors which create random noise. However, as we've just shown, this is more than counteracted by the fact that the RA effectively has a larger sample size to work with than ERA, since it throws out no data.

Simulation 2
Of course, this assumes that the defense's rate of error is known and RA can be adjusted perfectly to account for unearned runs. In real life, the defensive liability due to errors is largely unknown. We know that teams make roughly around .017 errors per PA, but this can vary by team and other factors.

To account for this, I reran the simulation, this time making the error rate vary randomly in order to match the rough error rate distribution among major league teams. This change is likely to favor ERA over RA, since extra variability is now be added to RA, while no extra variability will be added to ERA. What were the results? In the end, this extra variability did very little to change the results. The SD of ERA was .628 runs per 9 IP while the SD of RA was .610 runs per 9 IP. The change in the SD of RA is almost negligible and is not significant. The end result is that adding this slight amount of variability into the simulation does virtually nothing to change the argument of ERA vs. RA. This indicates to me that indeed the people at Baseball Prospectus are correct and that RA is a better measure of a pitcher's run prevention skills than ERA.

A Combined Measure

But should ERA be thrown out all together, as BP suggests? Using the simulation data, I ran a regression on the simulated data with a fixed 4.24 ERA as the dependent variable and the pitchers' RA and ERA as independent variables to see the relative weights it would assign ERA and RA. Running the data through the regression (with no intercept), we get the following.

As you can see, RA gets the bulk of the weight, but ERA has usefulness as well. Nevertheless, the regression indicates again that RA is a better indicator than ERA. About 90% of the weight should be given to RA and only 10% should be given to ERA. The standard error on the regression indicates that the standard error of this combined average is .604 points of ERA - down from .610 when using RA alone. However, obviously this distinction is very slight, so using this combined measure is of questionable value, especially since if you are going to go with an advanced statistic, there are many other measures better than either RA or ERA.

Extreme Situations
So, we have shown that RA is a better statistic than ERA in the current MLB environment, but what about other environments? When the propensity for errors is much higher, does ERA become more meaningful? In fact, the opposite is true. With more errors, there will be more plays that ERA throws out. When a lot of errors are made, it actually provides us as observers more sample to observe a pitcher. When the % of PA reached on error is jacked up to 8%, the SD of RA actually decreases strongly from .608 to .533, while the SD of ERA remains the same. This makes RA even more superior in environments where many errors are made.

How about environments where the variability of error rate is high between pitchers? The variability in error rate among MLB teams is very narrow and we saw that factoring this in had little effect. However, if we increase this variation in error rate significantly to range from 0% to 2%, what happens? In this situation, indeed ERA does become a better statistic, with the standard deviations comparable and the importance of ERA nearly matching the importance of RA in a regression. Of course, this is not a realistic situation which occurs in baseball.

Conclusion
In conclusion, I hope I have advanced the debate between ERA and RA. The simulation approach is an advantage because it creates a controlled environment. While ERA does have some usefulness, it's unbiased nature is more than offset by its problems of throwing out too much information. It's neither too lenient nor too harsh on the pitcher, but is simply inefficient. Of course, some pitchers have certain characteristics which make unearned runs systematically more likely (such as extreme groundball pitchers and knuckleballers), which only RA can capture. The simulation didn't take these types of pitchers into account, but this is even another reason to use RA over ERA. While I first viewed the use of RA over ERA with skepticism, this simulation convinced me of its merits as the better of two imperfect statistics.

I've discussed this on my blog several times. The two important things are:
1. as you noted: completely discarding everything after the third out in the reconstruction; we count it in his K, BB, HR, H totals, but not his ER... why not EHR, EK, EBB, EH then?

2. Look at the number of career unearned runs by Santana (FB pitcher) and Webb (GB pitcher). Or FB pitchers as a group compared to GB pitchers. More errors are made on GB than FB, so of course the GB pitchers will have more UER, even if they have a fantastic defense.

3. Ugh, errors as a way to measure defense. How very 19th century.

Click on my name. My comments start at post #6.

I agree at the professional level, RA is probably the better number. But I think when looking at amateur pitchers, where the routine play isn't made as consistently, ERA is better. Due to the high variablity of defense talent backing one pitcher from the next.

Further expanding on my point.... For example, high school ball. Yes, it is a "high error environment", but often times it is just one of the two teams creating that "environment". The ability level from player to player, team to team, is so uneven. Hence the ERA being of more importance, not less, as you suggested. Perhaps if you wanted to compare to pitchers from the same team, you could use RA.

"What's more interesting is to look at the standard deviation of these numbers. The better statistic will have a smaller SD."

This is true, if neither statistic is biased. However, RA may be biased by defense. This would suggest that looking at (Root) Mean Square Error as the appropriate comparison.

However, the posed question is ill-defined: What qualities are you trying to measure when using ERA or RA?

Tango,
Looks like a lot of great discussion on your site - I hope the article provides a bit more statistical rigor that I haven't seen in the debate so far. The GB/FB pitcher bias is another point in favor of RA, but even WITHOUT that, the simulation shows RA is superior. I was surprised at how strongly it showed that.

Billy,
As you mention, if the error rate widely varies from team to team, ERA becomes more important. But from the simulation, it has to vary quite a lot before ERA approaches RA in importance.

Alex,
In the simulation I control exactly how much defense is causing the unearned runs, so neither statistic is biased. The RMSE and SD are identical in this case.

I came across this game the other day -- the Astros gave up 16 unearned runs vs the Mets.

http://www.baseball-reference.com/boxes/NYN/NYN198507271.shtml

It's funny that allowing 15 hits, 3 walks, 5 doubles and a home run leads to 0 earned runs.

My reading comprehension skills are apparently horrible. But the question still remains: what skill, attribute, or quality of a player are you trying to condense into a single number?

If you're interested in removing luck and defense from the measure of that quality, why wouldn't we just skip to (x)FIP?

On an unrelated point, the difference in SDs in simulations 1 and 2 is fairly small, on the order of 3%. So should we care?

Sorry I didn't answer your second question Alex. The measure of interest is the "true" number of runs per 9 IP allowed by a pitcher given errorless defense.

Yeah, you could bypass all this and talk about FIP stats, but I wanted to compare ERA and RA, which is an oft debated subject.

You're right that the difference between simulations 1 and 2 is very slight. The point was that even when teams have different levels of fielding ability it doesn't change the results much at all - RA still beats ERA in precision.

I am fairly new to this new age number thinking ,but am fascinated by it and using the numbers as a tool.I did read an article recently comparing Jason Bay and Manny Ramirez defensivly. The author of the article stated that Manny's defensive matrix or metric showed he was a superior leftfielder than Bay.The writer (claimed) he got this information from Baseball Prospectus. As a Redsox fan since 1975, I can say with quite certainly that Bay is a vastly superior leftfielder than Manny. Thoughts.

Once I accepted that fielding percentage is a really bad way to evaluate defense, I realized ERA was flawed. FIP, or something like it, is better, because defense is not just about making or not making errors (even leaving aside the importance of range, how many crazy scoring decisions does one need to see to realize how inconsistent/biased/justplainbad MLB official scoring is?).

Topher - I saw that Bay was rating out as awful by UZR. I watch a fair amount of Sox games, since my wife is a fan, and while I wouldn't call him a defensive wizard, the numbers don't match what my eyes tell me. It could be a statistical fluke that will even out with time, or it could be our lying eyes!