Behind the Scoreboard August 11, 2009
How Best to Measure a Team's True Talent

One of the first sabermetric principles that many people learn about is how a team's winning percentage can be predicted by the number of runs scored and allowed. This Pythagorean winning percentage takes the following form: WPCT= RS^1.81/(RS^1.81 + RA^1.81). It was introduced by Bill James and is purported to detect whether a team is underperforming or playing over their heads and is billed as a better guide of a team's true talent. This concept has reached so far into the mainstream that it is even included in the MLB.com standings.

Furthermore, sabermetricians can dig deeper into a team's performance, and estimate the amount of runs that a team is expected to score or allow, based on the components of hits, walks, and outs tallied by a team or its opponents. Applying these run values to the Pythagorean winning percentage method can supposedly provide an even better guide to a team's true talent level, since even more of the variability is removed from the equation. Talking to some sabermetricians leaves the impression that W-L record should be thrown out all together and only these deeper metrics should be examined.

Hail Pythagoras?

But while some claim that the Pythagorean winning percentage or its counterparts are a better guide to a team's ability, is this actually so? This concept has been studied before, but here I take another look at it. Which one of these three metrics (WPCT, Pythagorean WPCT, and component Pythagorean WPCT) is best and is there some way to combine all three metrics to get the the best possible estimator of a team's ability?

Using retrosheet data going back to 1960, I obtained the statistics to calculate WPCT, Pythagorean WPCT, and component Pythagorean WPCT (based upon Bill James' Runs Created). I then randomly selected 25% of each team's games and calculated these metrics from these games only. From these 40 or so games, I attempted to predict the team's actual WPCT in the remaining games that were not sampled. How did each of these metrics fare?

Fitting a Model

First, using teams' regular WPCT from the sample 25% of games, we can fit a simple model to predict the teams' WPCT in the other 75% of games. To increase the power of our dataset, we can randomly draw many such 25% samples and average the outcomes. I drew 100 such random samples and ran the results. When doing so, we get the following formula:

Remaining WPCT = .363*(Current WPCT) + .319.

The RMSE of this estimate of WPCT is .0659, meaning that the winning percentage for the remaining games has a fairly wide range of outcomes - no surprise to any baseball fan. Also no surprise is the fact that the teams' WPCT over 25% of its games is regressed to .500 fairly strongly. A team playing .650 baseball over 40 games has an expected true winning percentage of just .555. The RMSE underscores the uncertainty - a 95% confidence interval has the team's true WPCT somewhere between .424 and .686.

But, does this improve at all when using the Pythagorean formula for estimating WPCT from runs scored and runs allowed? The formula for this is following:

Remaining WPCT = .440*(Pythagorean WPCT) + .280.

In this case, the teams' WPCT regresses less strongly - a team with a .650 Pythagorean WPCT has an expected true WPCT of .566 rather than .555. The accuracy is improved, but the RMSE is still .0648, only slightly better than using regular WPCT.

How about for the Component Pythagorean WPCT using Runs Created? The formula is nearly the same as that for the regular Pythagorean WPCT. It performs the best of the three methods, with a RMSE of .0643, though again, the increase in accuracy is small.

So from the above, we see that with 40 games of information, all three methods have similar accuracy, though the Runs Created Pythagorean formula fares best, and real WPCT fares the worst.

Combining All Three Metrics

How about when all three measures are used to try to predict WPCT? Putting all three measures in the model, we get the following formula:

Remaining WPCT = .103*(Current WPCT) + .094*(Pythag WPCT) + .268*(RC WPCT) + .268.

When comparing the three types of estimated WPCT's, real WPCT gets about 20% of the weight, another 20% goes to the Pythagorean WPCT, and the final 60% of the weight goes to the Runs Created Pythagorean formula. Of course, this is all regressed back to the mean as well, so that a team with a .650 WPCT in each of the three metrics would be expected to have a true WPCT of .570. The RMSE of this estimate is of course lower than each of the three measures separately, but is still high at .0638. This compared to .0659 for using WPCT alone.

What can we take from this information? We see that indeed the sabermetricians are right - a team's performance broken down into runs created components is a better gauge of a team's true talent level than just looking at a team's winning percentage alone. However, using all three metrics provides the best estimate of a team's true talent.

Much Ado?

But is all of this worth it? The increase in accuracy when looking at all three metrics is very small. Taking the expected random variability out of the RMSE, using the formula Variability in the Prediction of True Talent = RMSE^2 - Variability of WPCT by Chance, we see that the standard error of our prediction of true talent is .0450 when using the full model, while the standard error is .0479 when using WPCT alone.

This means that a 95% confidence interval around .500 would be (.410,.590) for the full model, while it would be (.404,.596) when using WPCT alone. Is this increase in accuracy really worth all of the trouble? You can make your own judgment, but I think it's fair to say that looking at a team's Pythagorean WPCT or component Runs Created WPCT doesn't necessarily tell you a whole lot more than looking at WPCT alone.

Extensions of the Model

Of course, this discussion has so far only concerned the case where the team has played just 25% of its games. What happens when the result of more of the season is known? Below is a table of model results, showing the coefficients for each metric after 25%, 50%, and 75% of the season is known respectively.

As you can see from the results above, the weights given to each metric remain relatively stable no matter how many games have been played. The WPCT estimated from Runs Created remains the metric with the most weight in the full formula, while the Pythagorean WPCT and real WPCT are about equal in importance. As we approach the 3/4ths mark of the season, we can see that when trying to assess a team based on its performance thus far, about 50% of our estimate should come from the Runs Created measure, and about 25% should be from the team's real WPCT and the team's Pythagorean WPCT respectively. The formula is as follows:

Remaining WPCT = .173*(Current WPCT) + .194*(Pythag WPCT) + .365*(RC WPCT) + .135.

This is slightly more accurate than using WPCT alone, decreasing the SE of the true talent estimate from .0399 under WPCT alone to .0365 by using the full model.

Conclusion

As I said earlier, this increase in accuracy is quite small, so this entire debate may be a matter of much ado about nothing. Someone simply looking at W-L records is apt to have nearly as good of an idea of a team's true talent as someone calculating complicated formulas. Nevertheless, it comes as no surprise that using all three measurements gives a better result than using any one of the metrics.

While the Pythagorean method may be a more accurate measure of a team's true value, it hardly makes the a team's true WPCT obsolete. Simply knowing the components that go into winning a game cannot replace the knowledge of a team's actual record. Things such as bullpen usage, managerial strategy, and player motivation are not modeled in the Pythagorean method. The results of these models show that these factors cannot be ignored and thus, a team's actual W-L record is still relevant.

## Comments

Good job, Sky. I haven't attempted a formal study of the issue but my intuition tells me that capping run differentials (at, say, five) on wins and losses on a game-by-game basis might yield the best results in terms of estimating a team's "true talent." This could make for a good study if you were up for it.

Would you care to share the projected winning percentages of all the teams for the rest of this year?

Interesting study. Would you consider there to be any value in refining the predictions based on a "strength of schedule" factor?

At the 41 game mark you probably don't have a very balanced distribution of teams played. Perhaps you could compute expected winning percentages as you've done and then refine based on your estimates of opponents played strength?

Good stuff, Sky. You might be interested in this "old" conversation:

http://www.battersbox.ca/article.php?story=20040923122101999&query=clay%2Bdavenport

By the way, Rich, I once did a study capping the extra runs in games and found that it didn't help. In fact, it hurt a bit. Those extra runs do tell a bit more of the story.

Unfortunately, I have no idea where I posted that.

Thanks for the comments. I think that overall adding strength of schedule would help - I think BP uses this to calculate their "3rd order wins".

Good idea to give the team expected WPCT's Andrew - I may do that if I have time. Though, as the article shows, you're bound to be nearly as accurate by just looking at regular WPCT.

Studes, I think I remember reading your piece somewhere. Unfortunately I don't remember where either!

"knowing the components that go into winning a game cannot replace the knowledge of a team's actual record. Things such as bullpen usage, managerial strategy, and player motivation are not modeled in the Pythagorean method."

Well yeah. What bothers me about the Pythagorean method is that it seems that being able to win close games is a skill, some teams do it better than others. It may be a question of having a better bullpen or better game strategy. A team that wins alot oc close games but gets blownout now and then will have a bad Pythagorean winning percentage but a good won loss record, and actually be a good team.

Pythagorean percentage might be more valuable used over three year periods, where being on the receiving end of occasional blowouts will have less impact on the statistics.

If we are trying to isolate the effect of luck on a season, and I think it actually has a pretty big effect, maybe the place to start is to see if a team had an unusually large or small number of injuries during the season? Sometimes you hear of cases where a bad season is explained due to a large number of injuries, but I have never heard a team's success explained by having no injuries that year, though this has happened.

Another place to look at is how well a team's opponents have been playing, the month in which they happen to be playing that team. Or whether an unusually large percentage of a team's wins came at the beginning of a season. Over the course of a season, a team can adjust and improve and a late season charge up the standings can result. A strong early season but mediocrity the rest of the way probably indicates luck.

This was a great analysis, but I wonder if it obscures the point of using Pythagorean methods to predict win/loss records. From my point of view, the main benefit of these methods is to try to identify teams that may have been particularly lucky or unlucky up to a certain point in a season, so we can get an idea of how we might expect them to perform the rest of the year. By taking a data set as large as the one used in this analysis, it seems that you are simply showing that the majority of teams are not particularly lucky or unlucky at all, even through only 40 games of a season. This is valuable data and an interesting conclusion, but I'm curious how well Pythag would perform through this large dataset if you look only at outliers.

The typical small-scale analysis done during the course of a season is to calculate the difference between Pythag-predicted Win% and actual record and identify outliers. The purpose is that using Pythagorean methods to predict the rest of the season should be better for these outliers than simply using their current Win%, but the assumption is that regular Win% should be just as good for the rest of the "luck-neutral" teams (which your analysis supports), and the number of outliers should be small. The question, then, would be whether or not Pythagorean methods actually do a better job of predicting rest-of-season record for these outliers than simply extrapolating current Win%.

Sky,

I enjoyed your study, especially because I post a weekly feature involving component-based winning percentage estimates. Like a lot of things in sabermetrics, a lot of effort often results in very small improvements, especially when you look at average in a large dataset. Reminds me of the run estimator debates (though, like Will mentioned, it's at the extremes where things get interesting).

At BtB, Colin Wyers brought up the concern of multicollinearity with respect to the stability of the coefficients in your full model. How much does the R^2 (and adjusted R^2) change as you go from your simple models to your full model? When all variables are in your model, are any type-III tests of the coefficients significant? I'm guessing none are, because of the multicollinearity.

This would be consistent with your claim that there's minimal improvement in looking at components. But I'm not sure that you can claim strong support for the 1:1:2 weightings you propose at the end of the article, given how unreliable coefficients can become in multicollinear regression models...

Thoughts?
-j

Nice posts from Ed, Will, and Jinaz - sorry for my tardiness in reply.

Jinaz, indeed there is collinearity between the estimates. Taking just one sample gives farily wild estimates for the coefficients because of it. However, things stabilize when taking over 100 such split-samples, so I'm fairly confident that those are the correct weights. The R-squared as you suggest, increases but does not change much.

Will, yes the points where there is a big difference between the Pythag and WPCT are the most interesting. However, this is also what drives the regression analysis. The cases with big differences are automatically given more weight.

Ed, indeed winning close games is a skill, which is why it still has weight in the model. If it was not a skill, but simply random, then using straight PythagWPCT would be best.