How Best to Measure a Team's True Talent
One of the first sabermetric principles that many people learn about is how a team's winning percentage can be predicted by the number of runs scored and allowed. This Pythagorean winning percentage takes the following form: WPCT= RS^1.81/(RS^1.81 + RA^1.81). It was introduced by Bill James and is purported to detect whether a team is underperforming or playing over their heads and is billed as a better guide of a team's true talent. This concept has reached so far into the mainstream that it is even included in the MLB.com standings.
Furthermore, sabermetricians can dig deeper into a team's performance, and estimate the amount of runs that a team is expected to score or allow, based on the components of hits, walks, and outs tallied by a team or its opponents. Applying these run values to the Pythagorean winning percentage method can supposedly provide an even better guide to a team's true talent level, since even more of the variability is removed from the equation. Talking to some sabermetricians leaves the impression that W-L record should be thrown out all together and only these deeper metrics should be examined.
But while some claim that the Pythagorean winning percentage or its counterparts are a better guide to a team's ability, is this actually so? This concept has been studied before, but here I take another look at it. Which one of these three metrics (WPCT, Pythagorean WPCT, and component Pythagorean WPCT) is best and is there some way to combine all three metrics to get the the best possible estimator of a team's ability?
Using retrosheet data going back to 1960, I obtained the statistics to calculate WPCT, Pythagorean WPCT, and component Pythagorean WPCT (based upon Bill James' Runs Created). I then randomly selected 25% of each team's games and calculated these metrics from these games only. From these 40 or so games, I attempted to predict the team's actual WPCT in the remaining games that were not sampled. How did each of these metrics fare?
Fitting a Model
First, using teams' regular WPCT from the sample 25% of games, we can fit a simple model to predict the teams' WPCT in the other 75% of games. To increase the power of our dataset, we can randomly draw many such 25% samples and average the outcomes. I drew 100 such random samples and ran the results. When doing so, we get the following formula:
Remaining WPCT = .363*(Current WPCT) + .319.
The RMSE of this estimate of WPCT is .0659, meaning that the winning percentage for the remaining games has a fairly wide range of outcomes - no surprise to any baseball fan. Also no surprise is the fact that the teams' WPCT over 25% of its games is regressed to .500 fairly strongly. A team playing .650 baseball over 40 games has an expected true winning percentage of just .555. The RMSE underscores the uncertainty - a 95% confidence interval has the team's true WPCT somewhere between .424 and .686.
But, does this improve at all when using the Pythagorean formula for estimating WPCT from runs scored and runs allowed? The formula for this is following:
Remaining WPCT = .440*(Pythagorean WPCT) + .280.
In this case, the teams' WPCT regresses less strongly - a team with a .650 Pythagorean WPCT has an expected true WPCT of .566 rather than .555. The accuracy is improved, but the RMSE is still .0648, only slightly better than using regular WPCT.
How about for the Component Pythagorean WPCT using Runs Created? The formula is nearly the same as that for the regular Pythagorean WPCT. It performs the best of the three methods, with a RMSE of .0643, though again, the increase in accuracy is small.
So from the above, we see that with 40 games of information, all three methods have similar accuracy, though the Runs Created Pythagorean formula fares best, and real WPCT fares the worst.
Combining All Three Metrics
How about when all three measures are used to try to predict WPCT? Putting all three measures in the model, we get the following formula:
Remaining WPCT = .103*(Current WPCT) + .094*(Pythag WPCT) + .268*(RC WPCT) + .268.
When comparing the three types of estimated WPCT's, real WPCT gets about 20% of the weight, another 20% goes to the Pythagorean WPCT, and the final 60% of the weight goes to the Runs Created Pythagorean formula. Of course, this is all regressed back to the mean as well, so that a team with a .650 WPCT in each of the three metrics would be expected to have a true WPCT of .570. The RMSE of this estimate is of course lower than each of the three measures separately, but is still high at .0638. This compared to .0659 for using WPCT alone.
What can we take from this information? We see that indeed the sabermetricians are right - a team's performance broken down into runs created components is a better gauge of a team's true talent level than just looking at a team's winning percentage alone. However, using all three metrics provides the best estimate of a team's true talent.
But is all of this worth it? The increase in accuracy when looking at all three metrics is very small. Taking the expected random variability out of the RMSE, using the formula Variability in the Prediction of True Talent = RMSE^2 - Variability of WPCT by Chance, we see that the standard error of our prediction of true talent is .0450 when using the full model, while the standard error is .0479 when using WPCT alone.
This means that a 95% confidence interval around .500 would be (.410,.590) for the full model, while it would be (.404,.596) when using WPCT alone. Is this increase in accuracy really worth all of the trouble? You can make your own judgment, but I think it's fair to say that looking at a team's Pythagorean WPCT or component Runs Created WPCT doesn't necessarily tell you a whole lot more than looking at WPCT alone.
Extensions of the Model
Of course, this discussion has so far only concerned the case where the team has played just 25% of its games. What happens when the result of more of the season is known? Below is a table of model results, showing the coefficients for each metric after 25%, 50%, and 75% of the season is known respectively.
As you can see from the results above, the weights given to each metric remain relatively stable no matter how many games have been played. The WPCT estimated from Runs Created remains the metric with the most weight in the full formula, while the Pythagorean WPCT and real WPCT are about equal in importance. As we approach the 3/4ths mark of the season, we can see that when trying to assess a team based on its performance thus far, about 50% of our estimate should come from the Runs Created measure, and about 25% should be from the team's real WPCT and the team's Pythagorean WPCT respectively. The formula is as follows:
Remaining WPCT = .173*(Current WPCT) + .194*(Pythag WPCT) + .365*(RC WPCT) + .135.
This is slightly more accurate than using WPCT alone, decreasing the SE of the true talent estimate from .0399 under WPCT alone to .0365 by using the full model.
As I said earlier, this increase in accuracy is quite small, so this entire debate may be a matter of much ado about nothing. Someone simply looking at W-L records is apt to have nearly as good of an idea of a team's true talent as someone calculating complicated formulas. Nevertheless, it comes as no surprise that using all three measurements gives a better result than using any one of the metrics.
While the Pythagorean method may be a more accurate measure of a team's true value, it hardly makes the a team's true WPCT obsolete. Simply knowing the components that go into winning a game cannot replace the knowledge of a team's actual record. Things such as bullpen usage, managerial strategy, and player motivation are not modeled in the Pythagorean method. The results of these models show that these factors cannot be ignored and thus, a team's actual W-L record is still relevant.