Findings from the Free Agent Market
Curt Flood really started something with this whole free agency thing, huh? Using ESPN’s Free Agent Tracker, I collected data for all free agents since 2006 and used regression analysis to pick up on some trends. WAR to Wages This offseason, Fangraphs unveiled its Wins Above Replacement measure in the Value section of its stats pages. WAR is a statistic that combines offensive, defensive and positional value and sets it against a replacementlevel baseline to find the marginal wins a player contributes to his team. There has been debate over how to convert these marginal wins into a marginal value in terms of dollars. One of the first things I looked at was whether the relationship between WAR and salary was linear or nonlinear. I plotted the WAR from each free agent's contract year—excluding those who were injured all year or who came over from Japan—against the average annual value of the contract they signed.
The regression lines look rather similar. It would appear that the nonlinear regression has an advantage at the extremes, since it won’t predict negative salaries for very negative WAR and it better captures the exponential value of superstar players. However, there is little difference between the regression lines for the vast majority of players, those between 0 WAR and 5 WAR. The R^{2} values, which measure the percentage of variance of Average Annual Value that is explained by WAR,, are similar at an impressive .62.64 range. This affirms that a single year of WAR captures a lot of a player’s value. Keep in mind when looking at these R^{2} values that the R^{2} will always increase in a polynomial equation due to the nature of adding a new term, so we definitely cannot make any conclusions about either method from this graph alone. Time 100’s own Nate Silver, in deriving Marginal Value Over Replacement Player, used a nonlinear form of WARP . I have duplicated his graph here which projects WARP for 2005's free agent class by using three years of WARP from 20022004 instead of the one previous year of WAR I used for 20062008 free agents. I have superimposed a rough line of best fit to portray the difference between a linear and nonlinear model.
Phil Birnbaum shows that individual skills in the major leagues may be normally distributed. Anecdotally, this is reaffirmed by the 2080 scouting scale, which is based on a normal distribution with a mean of 50 and standard distribution of 10. Furthermore, Tom Tango shows that “when you consider the number of opportunities each player gets (in the Major Leagues), the total effective talent distribution is rather typical.” However, when observing only the Major Leagues, we neglect the fact that most subpar baseball talent resides at another level. There is an abundance of freely available talent that could provide marginal upgrades to current Major Leaguers. What this means in terms of player value is that belowaverage players will be disproportionately underpaid compared to aboveaverage players due to the difference in the supply within each pool. Bill James once wrote “talent in baseball is not normally distributed. It is a pyramid. For every player who is 10 percent above the average player, there are probably twenty players who are 10 percent below average.” I believe this theory holds if by baseball he means the total baseball universe and by average he means the Major League average. So, Tango may be right that, at the Major League level, talent follows a normal distribution, but when we add talent from all player pools, the curve does begin to look like the right tail of a normal distribution. Think of it this way: would you rather have the right side of the Cardinals’ infield or the Reds’ infield? The combinations of Albert Pujols/Skip Schumaker and Joey Votto/Brandon Phillips will both produce 8 WAR, give or take. Through the currently dominant model for fairmarket evaluation, both sets of players are worth some $35 million if you simply multiply their WAR by $45 million. But my intuition tells me that I'd rather have the pair on the Cardinals. The key is that Pujols takes up only one roster spot and provides the same value of a pair of players who take up two. I might be able to upgrade over Schumaker on the cheap eventually. We also must account for the fact that freely available talent is, well, free, while the superstars who bring in 5+ WAR will need to be acquired through trading or bidding. Furthermore, I found statistically significant evidence that the Type A tag for free agents is correlated with increased pay. In a practical sense, the Type A label decreases a player's value in a free market since it costs prospective teams a firstround pick to acquire the player or the label costs the player in leverage if he tries to resign with his former team. However, Type A free agents tend to be the best players in my sample, so it is evident that teams ignore the Type A tag and are willing to spend what it takes to reel in superior players. Separating position players and pitchers, I find that is much easier to predict position players' salaries in general, and the nonlinear regression fits better for position players than it does for pitchers. In separating the two pools of players, I decided to test for some skills that do not translate into a hitter’s or pitcher’s WAR, but still might directly relate to his salary. General Managers dig the fastball Fangraphs keeps track of pitch usage and velocity for all pitchers since 2002, and all the data can be easily exported to a spreadsheet. This is a good thing for baseball analysts. Dave Allen and Dan Turkenopf both used pitch f/x data to show how velocity relates to production. In these regressions, I account for a player’s WAR, and therefore can try to isolate the effect of a pitcher’s fastball velocity on his salary. Here is the regression output. Source  SS df MS Number of obs = 149 + F( 4, 144) = 62.82 Model  1.7252e+15 4 4.3131e+14 Prob > F = 0.0000 Residual  9.8863e+14 144 6.8655e+12 Rsquared = 0.6357 + Adj Rsquared = 0.6256 Total  2.7139e+15 148 1.8337e+13 Root MSE = 2.6e+06  aav  Coef. Std. Err. t P>t [95% Conf. Interval] + WAR  2399138 153233.6 15.66 0.000 2096260 2702016 fbv  164514.8 72588.22 2.27 0.025 21038.76 307990.9 o7  423055.5 545027.9 0.78 0.439 1500344 654233.1 o8  1365307 508682.7 2.68 0.008 2370757 359857.4 _cons  1.19e+07 6496299 1.83 0.069 2.47e+07 954444.2 
I created two player pools, separating those with aboveaverage fastball velocities and those with belowaverage fastball velocities. The average fastball in my sample of 149 pitchers travels 89.7 miles per hour. The WAR of both player pools is nearly identical, as the harder throwers average .97 WAR compared to .96 WAR for the softer throwers. Yet the harder throwers earned $4.9 million per year in free agency compared to $4.2 million for the latter group. Perhaps fastball velocity predicts future performance, or perhaps there is an allure to signing a player who can light up the radar gun, or maybe fans come out to see fast pitchers. No matter the case, throwing hard gets you paid. I also included timefixed effects in this regression, setting dummy variables to represent the year during which the pitcher became a free agent. We find statistically significant evidence of deflation in 2008. While 2006 and 2007 appear stable in terms of free agent salaries, pitchers with similar production in 2008 were liable to lose on average a million dollars per year on their contract because they hit the market at the wrong time. General Managers dig the longball By longball, I don’t mean home runs. I mean actual distance. From Hit Tracker, I included the average true distance in feet of home runs for all players in my dataset..I also included weight of a player in pounds, which might measure raw power or might measure nothing, but was significant in the regression. Unfortunately, weight is also probably the least accurate data point I could use since there are no reliable sources for it. Source  SS df MS Number of obs = 169 + F( 3, 165) = 123.05 Model  2.5996e+15 3 8.6653e+14 Prob > F = 0.0000 Residual  1.1620e+15 165 7.0421e+12 Rsquared = 0.6911 + Adj Rsquared = 0.6855 Total  3.7616e+15 168 2.2390e+13 Root MSE = 2.7e+06  aav  Coef. Std. Err. t P>t [95% Conf. Interval] + WAR  2256088 125521.3 17.97 0.000 2008253 2503923 true  28062.52 13259.32 2.12 0.036 1882.712 54242.32 weight  24497.9 10709.87 2.29 0.023 3351.842 45643.95 _cons  1.49e+07 4881150 3.05 0.003 2.45e+07 5253468  These measures are essentially independent of WAR but do affect salary. I believe home run distance and weight are actually capturing the phenomenon that has shown that there is a stronger correlation between slugging percentage and salary than between salary and most any other basic statistic. Weight and True Distance correlate very well with slugging percentage. We can say with confidence that there is a bias toward heavier players who hit for power, all else being equal. For every ten pounds of weight or ten feet in home run distance, a hitter can expect a positive return averaging around 250 grand. This is not to say whether paying these players more for the ability to throw fast or hit long home runs is efficient or not. I did this analysis to observe trends in the market over the last few years, and I am not trying to comment on any sort of inefficiencies that may exist. Thanks to all the data sources I used in this study including ESPN, Fangraphs, Hit Tracker, Forbes, and Fantasypitchfx Edit: At Jake's request, I have separated the data series by year and added separate trendlines for each year.

Comments
Easy suggestion: use "millions of dollars" instead of just dollars. It will just make it easier on the eyes.
Posted by: Alex at May 7, 2009 5:57 AM
Why did you regress against the previous year WAR instead of some sort of projection of next year WAR? For example Marcel's 5/3/2 weighing of the last 3 years seems like a sensible choice.
The criticism of regressing salary against WARP was that WARP had a very low replacement level, which made the resulting line have a parabolic shape. Tango's opinion was that with a properly constructed WAR (fangraphs uses his research I believe) a linear relationship was better.
Posted by: Anonymous Coward at May 7, 2009 6:16 AM
Alex, thanks for the suggestion. I'll try that out for the graph.
Anonymous, I agree that creating a projection would have been a better method, and a 5/3/2 rating as Silver and Marcel did would have been easy. It was just that I collected data a couple months ago and when I decided to use it for these purposes I was too lazy to go back and find players' previous years worth of WAR. Not really any excuse.
I am aware of the criticism of regressing salary against WARP. I am not sure which relationship is "better," but a parabolic shape is inevitable when using a secondorder polynomial equation, as Silver did, and I repeated. A linear shape appears to be less complicated, which is a plus, and seems to get similar results to more complicated models for 90% of players, which is also a point in its favor.
Posted by: Jeremy Greenhouse at May 7, 2009 9:28 AM
Interesting.
In the first chart, I might make a suggestion for the data visualization. Since you lumped several years of contracts into one graph it misses a chance to explain a whole lot more. A new chart (same basic format) with the contracts grouped by the time of signing would allow us to distinguish yearbyyear patterns. I see the chart is from excel, it would be quite painless to assign different groups different colored dots.
The external environment makes a huge deal as we saw this past free agent period.
Posted by: Jake Russ at May 7, 2009 9:32 AM
The correlation between home run distance and fast ball speed and salary is not surprising. For projecting future performance, a player who hits the ball farther or throws faster would appear less likely to regress and have a greater possibility of breaking out than the weaker, but equally valuable player.
Posted by: Doug B. at May 7, 2009 10:54 AM
Doug, agreed. The important thing is that the numbers bear that out.
Posted by: Jeremy Greenhouse at May 7, 2009 12:47 PM
Jeremy,
Interesting.
I have a suggestion for the first chart, if you group the scatter points by years, you can get excel to plot the groups in different colors.
This would add a lot to the graph, there is a lot going on and being able to see the patterns based on the years would be very helpful to the message you are trying to get across.
As we saw this free agent period, the external market factors can have a big play on salaries.
Posted by: Jake Russ at May 7, 2009 1:20 PM
Jake, thanks for the suggestion. I don't currently have the data in front of me, but I'll try playing around with it when I do.
Great article on the height of Major Leaguers. I tested for height but nothing significant came up in its relationship with salary. Maybe I should have broken it down into tall, average, and short players since the relationship with height and production isn't linear apparently.
Posted by: Jeremy Greenhouse at May 7, 2009 11:05 PM
Opps on the double post. I figured 5 hours after I'd originally posted the comment and more people had been put up there that mine had failed somehow.
Would love to see the chart again after you make whatever adjustments to it.
Appreciate the comment on my article, I'm not surprised height wasn't significant. It hasn't been shown to be significant influence on any performance metric I'm aware of either. And its not a case of not being 95% sig, its not even been close. I've been looking at that question a while a now. Plus the data doesn't have enough variation in it because 95% of MLB pitchers are between 6'0  6'8". So when we see outcomes that persist in this fashion, I have to chalk it up to natural selection telling us something.
Posted by: Jake Russ at May 8, 2009 6:00 AM