F/X Visualizations March 19, 2010
Comparing Team-Win Projections

I love preseason projections: the fact that so many smart people put some much work into it, the promise of the season to come, thinking about my own ideas for the season, and comparing across projections. I hope you will indulge my impulse to do this last one.

Here I am going to compare the projected win totals -- but it would be very cool to do the same for player projections -- across six different projections systems. First the projected win totals based on the FAN projections as fangraphs; BPro's PECOTA; THT's OLIVER; Rally's CHONE; RLYW's CAIRO and, though it is not a projection per se, the Vegas over/under lines.

Here are the RMSE between each of the six projections systems.

Interestingly the FANS are the closest to the other projection systems. PECOTA's, CHONE's and THT's most similar projection system is the FANs. On the other end CAIRO is often the most dissimilar, with PECOTA, the FANS and THT all having CAIRO as the most dissimilar projection. That is not to say that makes the FANS 'right' and CAIRO 'wrong.' I don't think similarity to other projection systems makes it any more or less likely to be right, just an interesting thing to notice.

Another way of analyzing this is to use principal component analysis (PCA). Picture each projection system as a 30-valued vector. You could plot each of the six systems in 30-space and see how close they are to each other, but, unfortunately, I cannot display 30-space on the computer screen. PCA is a tool to reduce the dimensionality of a data set. As an example if all the systems projection projected the same number of wins for all teams expect the Yankees and Red Sox, we could just look at their projections for the Yankees and Red Sox and get all of the information of the variation between the systems. In this case it is not as neat, but we can still find the teams which account for the most variation between the systems. By reducing dimensionality you lose some information, but the hope is the information lost is largely correlated (redundant) and much of the variation can be reduced to a handful of dimensions.

Each principal component is a linear combination weighting the importance of each team's projection, so in the example above all teams except the Red Sox and Yankees would be weighted as zero. Principal component one is the component that accounts for the greatest amount of variation. The most heavily weighted teams are the ones that drive each projection's score on the component and are most responsible for producing the variation in the projections. In this case projections that score high on component one project lots of wins for the Yankees and Reds, while those that score low on component one project lots of wins for the Orioles, Royals and Astros.

Here you can see the FANS and CHONE clustering out relatively closely, with Vegas and PECOTA not that far off. Then THT and CAIRO falling out far away. CAIRO because of its love of the Reds, Twins and Mariners, while THT for its love of the Braves and Rangers, and to a lesser extent the Yankees. Again it would be very cool to do this for player projections and see whether the principal components to fall out as particular player types.

Finally I wanted to see which teams had the most disagreement or consensus. Here is the average pair-wise disagreement for each team.

Florida has almost no variation. THT likes them to win 78 games, but everyone else sees them winning 80. On the other end of the spectrum the Yankees' difference is driven by THT, 103 wins, and PECOTA, 89 wins.