Common Run-Production Formulae Evaluated
A Review of BasicsThere are two sets of equations that together constitute the backbone of the art of modern statistical analysis: those that project team games won from runs scored and runs yielded, and those that project team runs scored (or yielded) from some combination of reasonably available team statistics. Since that second type is so important, it is worth taking a look at the many specimens out there—their logical bases and their actual performance. Here we will look at what the more common formulations are and how they stack up against one another. The survey will cover the period of 1955 through 2009. The reason it starts in 1955 and no earlier is simply that several of these methods use stats that simply weren't available before 1955 (such as IBB or SF). As an aside, let me say that in the course of preparing this overview I was struck by two things: how few people seem to understand how to write out equations, in particular how to use nested parentheses, and how many seem willing to specify some non-standard statistic without then defining it exactly. As to writing out equations, first consider this piece of simple arithmetic: X = 3 x 5 + 7 Is the wanted answer 22 or 36? That depends on whether the writer intended-- X = (3 x 5) + 7 That is not an artificial example: one of the formulae evaluated below is given (in several places around the web) in exactly this form:
Jolly good luck deciphering that without extrinsic information. On further examination of the associated text, it turned out that what was meant was— R = (A x [B / { B + C } ]) + D — which brings up the other point about writing out equations: there are other enclosure marks than the parenthesis, to wit the bracket and the brace, both of which are illustrated in the preceding example. Using them makes untangling nested expressions very much easier. (In principle, there is an implied order of precedence for arithmetic operations such that parentheses are often not needed, but not only do few people know it—I'd have to look it up—but there is never any guarantee that the writer of a given equation knows it either, or even knows that it exists.) My other peeve is illustrated by these sorts of formulae: R = ( [1B x 3] + [2B x 5] + [3B x 7] + [HR x 9] + [BB x 2] + [SB x 1] - [Outs x 0.61] ) x 0.16 In the first, whatever is "Outs"? In the second, whatever is "OOB" (even when expanded to "Outs on Base")? Is "Outs" all outs made by the team? Outs made only by batters? A particular estimate of all outs (such as [AB - H] + SH + SF + CS + GDP)? And what about OOB? Is it all team outs minus batters' outs? Some particular combination of standard stats (such as GDP + CS)? Or what? Which bodily part experiences the pain if the actual, exact meaning is explicitly stated? (Mind, not every formula presenter is guilty of all, or even any, of those sins; but altogether too many are.) An interesting side question is just what stats is it "fair" to use? For example, one writer states that he means a particular term in a particular formula to signify an out made by a player trying to stretch a single into a double or a double into a triple (or the rare case of a triple into an inside-the-park home run). That's clear, and no doubt meaningful in the context, but whence such data? OK, yes, Retrosheet.org has it all there for those with the diligence and patience to mine it, and Baseball-Reference.com has done an awful lot of that mining. But whether a particular stat is "readily" available can be a tough call. I suppose at bottom much depends on ultimate purposes: if the idea is to write up a technical paper examining the mechanisms of run-scoring, then anything that can be extracted from the record is fair dinkum; but if the idea is to make a tool suited for frequent and straightforward work, then using stats not readily available would seem to render the equation containing them unsuited for its purpose. There are, though, a couple of stats that are sort of on the margin. Those are CI, catcher's interference, a typically very small but nonetheless official and significant stat, significant in that it is a component of PA, plate appearances—but is almost universally left out of published PA tallies and almost never published in itself (and suppose there's a Dale Berra or Roberto Kelly on the subject team?). And there's Eb (opponents' errors allowing an otherwise-out batter to reach base, which Baseball-Reference lists as ROE for "Reached On Error"). Omitting CI will—for most teams in most years—have very little, if any, effect, but I am surprised that Eb is so generally unused. (In the one case it is used, estimating it instead of using the exact number decreases average accuracy by about 0.08 of a run, which is about 0.1%; that may not seem like a lot, but wait and you'll see.) Before we get to specifics, we ought also to consider what we are looking for and how to determine if we are getting it. What we want, of course, is accuracy: we want to feed in the stats for a team and, ideally, always get back the exact number of runs actually scored by the team that posted those stats. Obviously, we will not in general be able to get perfect results, so the way we evaluate various equations is by how closely they approximate perfection. Formula makers have devised various ingenious ways to measure how well such things do; here, I will use some simple metrics that seem to my possibly naive mind to well express what we are seeking. The first, and foremost, is simply average percentage error. If formula X estimates Rest runs for a given team in a given year, and that team actually scored Ract runs—so that the absolute error is Rest - Ract runs—the percentage error will be: Epct = 100 x ( [Rest - Ract] / Ract) Expressing error as a percentage is important, because absolute error sizes—actual numbers of runs off— are misleading: an absolute error of 10 runs signifies one level of accuracy for a team that scored 400 runs and quite another for one that scored 800 runs. If we then take the unsigned value of the percentage error (that is, ignore whether it is positive or negative), we have a measure of the relative size of the error. We can then just average all the percentage error sizes over whatever time span we are examining to get an overall average percentage error size. That tells us how closely, on average, the subject formula's estimate of runs came out relative to the actual value. But average size of error is not the only metric of importance. If a runs predictor is truly modelling run scoring fairly well, then its errors ought to be symmetrical: that is, they should scatter evenly around perfect accuracy. A formula that comes in with a given average size of error but has, say, twice as many over-estimates as under-estimates is clearly not working as well as one of roughly equal size accuracy that comes in with its errors about evenly divided between over and under. Finally, we would expect that the better a runs-predictor is working, the more nearly its cumulative total error with + and - considered will trend to zero. That is, the cumulative sum of all its errors over the subject time span (with over- and under-estimates cancelling) should be nearly zero. This is related to but slightly different from the criterion above. And for completeness, we should still also tabulate the absolute sizes of errors, both as an average error in runs and as—to keep the control freaks happy—as a standard deviation in runs. With all that understood, we can turn to particular run-scoring formulae. All such run-scoring equations fall into two broad classes, which we can call "linear" and "multiplicative"; each has its devotees, and we will take an overview of each class separately. The FormulaeThe Multiplicative ApproachThe TheoryThe basic idea behind multiplicative approaches is quite simple: run-scoring consists in getting runners on, then driving them in. Equations based on that principle are "multiplicative" because they are probabilistic--that is, they seek to estimate the probability of runs scoring based on the occurence of certain game events. It is a base fact of probability analysis that the probability of two independent events both occuring is the multiplicative product of the independent probabilities of each one occurring: if the chance of a randomly selected person being male is 50%, and the chance of a randomly selected person being blue eyed is 16%, then the probability that a randomly selected person is a blue-eyed male is 8% (0.5 x 0.16). In multiplicative run-scoring equations, the factors being multiplied represent the probability of a batter getting on base and the probability of another batter advancing any runners already on base. For the first term, the chances of a batter getting on base, it might seem that all that is needed is the now-familiar on-base percentage; but the OBP does not take into account the reality that a man who has successfully reached base may then be thrown out on the bases. A man thrown out on the bases may as well have never reached base (as far as the chances of his becoming a run scored), so multiplicative formulae need to in some way estimate net runners on base. That is not as easy as it might sound, because some data are not so easy to obtain. For example, by definition, total plate appearances equals runs plus left on base plus total outs: PA = R + LOB + Outs so that R + LOB = PA - Outs (And, of course, R + LOB is the number of men who reached base and were not later thrown out.) But total team Outs made is not so easy a datum to come by, unless one can find lines of "opponents' pitching"; otherwise, one has to assemble it from numerous pitching splits. If one has that capability, then one can use the exact datum; if not, one has to estimate it. (Sidebar: for reasons best known to themselves, few if any stat services any longer tabulate LOB, once one of the fundamental stats ("No runs, two hits, one man left on base, and at the end of five . . . ." It can be adduced, using the simple equation above, if one can first assemble a total team Outs datum.) If one has to estimate, some stats for runners thrown out on base are commonly available: caught stealing (CS) and grounded into a double play (GDP, or GIDP). But there are far more ways than those to be put out on the bases: pickoffs, throwouts trying to extend a hit, and so on. The general approach of multiplicative formulations is to either take the gross OB and multiply by an empirical estimation constant, or to take the gross OB, subtract what is known about outs on base, then apply an empirical estimation constant. The base-advance component is the trickier of the two, and it is in constructing that component that multiplicative equations most differ from one another. The simplest and most obvious runner-advance stat is hits; moreoever, since the more extra bases a hit goes for the more it will advance any runners on, hits in any run-advance component are invariably weighted. The simplest weighting, one commonly used, is the Total Base (TB) value, which assigns each hit a weight equal to the number of bases (that is, for example, 3 for a triple). More advanced approaches use different weightings that presumably better represent the effective runner-advance value of a given hit. (To clarify: if one examines the eight possible base-occupancy situations, it is clear that overall a triple will not have 1.5 times the advance value of a double—what the exact relative values may be is something each formulator works out on his own, by such means as seem good to him.) But, while hits must clearly dominate base-advancing, there are many other stats that reflect actions that can advance runners on base. Those include walks, hit batsmen, and catcher's interference, which will move along any runners on first or in sequence thereafter; stolen bases, which are pure (no batter action) base advances; sac bunts and sac flies; wild pitches and balks; and certain errors. Determining values for these lesser but not negligible actions is another thing each analyst working on the question has to do for himself. (Note, though—and this applies to the linear methods, too—that while certain of the "lesser" stats may triflingly increase accuracy for a formula that works with actual, historical data, they will be deceptive if used when such formulae are to tried prospectively (that is, for predicting the future based on the past), because those actions are not under the control or influence of the offense. Such things as balks, wild pitches, and opponents' errors are essentially random happenings, and so a general empirical constant is best used to stand in for those things as a whole.) The FormulationsI will here just list each and show the equation as I gleaned it from one or more sources on the web. If any of those equations seem to anyone reading this as incorrect expressions of the maker's intent, please email me. The accuracy surveys will come after we have introduced all the equations of both classes. At least as early as 1964, a run-scoring equation of passable accuracy existed: Earnshaw Cook's "DX", which has an average accuracy of around 3½ percent, and which had a "simplified" form essentially identical to the original famous "Runs Created" formulation Bill James put forth 15 or 20 years later. For this evaluation, I tried to use all the current methods I could find documented around the web. I probably missed some, and would be pleased to hear from anyone who has one or more others to suggest (just email me with the formula—written out nicely, please, as spoken of earlier—and some info on who made it when), and if enough roll in I will try to assemble a follow-up survey. But for now, these are they: Basic Runs Created:
This (hereafter RCbasic) was Bill James' first opus. Its chief virtue is its extreme simplicity of both form and calculation: one can easily understand it, and one can easily reckon it. Stolen-Bases Runs Created:
This (hereafter RCsb) is a modification of the "Basic" version to account for the value of, yes, stolen bases (and the corresponding caught-stealings). "Technical" Runs Created:
This (hereafter RCtech) is a substantially greater modification of the "Basic" version, to account for all sorts of other lesser data. "Technical" Runs Created, 2nd Version:
This (hereafter RCtech2) is a minor variation of the form above. "Technical" Runs Created, 2nd Version, alternate:
This (hereafter RCtech2a) is another very small variation of the RCtech2 form (0.26 becomes 0.24). "Technical" Runs Created, 3rd Version:
This (hereafter RCtech3) is the most complex yet of the variations on the RC formula; it is the only one to assign non-TB weights to base hits. Base Runs:
A - H + BB + HB - HR - (0.5 x IBB) This (hereafter BR) is David Smyth's offering in this category. Wikipedia cites Tom Tango as stating that BaseRuns models the reality of the run-scoring process significantly better than any other run estimator. (We shall see.) Total Offensive Productivity:
This (hereafter TOP) is mine own. It is sufficiently complex that the making of it (above) is split into multiple pieces for comprehensibility, since it uses the y = mx + b method for best-fitting the relation between runners scored and base-advance events. Total Offensive Productivity, Dumbed-Down:This (hereafter TOPdd) is as above, but with all coefficients rounded to only two decimal places of accuracy. No recalculating was done (though the coefficients do interact). The point was to see if using three decimal places, which many but not all formulae do, made any material difference. Total Offensive Productivity, No Error Data:
This (hereafter TOPnoEs) is the full formulation except with opponents' errors (Eb)—and thus net runners on base—estimated by a couple of empirical coefficients. I inserted it here to show how much estimating net on-base does or does not cost accuracy as compared to using exact values (because they are not always simple to obtain). Because this is estimating a datum that should be known exactly, it uses full-accuracy constants (no point in double-crippling it) The Linear ApproachThe TheoryIn a sense, there is no theory to linear methods (usually referred to as "linear weights", though that really signifies only one such method). Linear methods are based on what we might call the "ant on a globe" principle: place an ant on the surface of a sufficiently large globe and the surface, though actually curved, will seem flat. Indeed, we humans experience that every day on planet Earth, which is why so many people believed it flat for so long. Linear methods are not concerned with the full shape (and hence describing equation) of the relations between common baseball stats and runs scored: they assume that over the relatively short stretches of such curves that we are in practice concerned with, the relations can be considered to be straight lines (hence "linear"). From that assumption, it follows that one can construct runs by simply adding up the effects of each stat that might have some influence on run scoring, with that stat appropriately "weighted" by an empirical constant derived from experience. The chiefest objection to linear methods is that they do not actually model run-scoring, which is a non-linear process. Countering that indubitable assertion is the sheer fact that they can and do produce good results. Further, they have this virtue: you can construct team values from individual-player values by simple addition. (You cannot do that for multiplicative methods because in general the product of the averages is not equal to the average of the products. What that mouthful means can be shown quite easily: The FormulationsEstimated Runs:
This (hereafter ER) was created by Paul Johnson and got a nice write-up from Bill James; James seems to despise linear methods, and it is widely reported around the web that he apparently did not recognize Johnson's formulation as a linear method. There are other variants of this method, as described farther below; which version came first I cannot readily ascertain. Estimated Runs a:
This (hereafter ERa) is the above, but with HB and CI included; I just tried those on an off chance, and it much the results, so I include it. Estimated Runs 2:
This (hereafter ER2) is a variation on the method above; as I said, I don't know which came first. Estimated Runs 3:
This (hereafter ER3) is a yet another variation on the ER method. (The numbering, again, does not here imply a sequence.) Extrapolated Runs:
This (hereafter XR) is one of Jim Furtado's efforts at a linear formula; there is another one, listed below. I am unsure of their order of creation. Extrapolated Runs 2:
This (hereafter XR2) is a modified version of the above. I am unsure, actually, which version preceded which. The Shoot-OutThe ResultsJust for fun, I also included, as a sort of baseline, what one might call an "worst-possible-way" method. All it does is assign every team in every season the league-average runs for that league and season—that is, it doesn't "predict" at all, but assumes every team is "average". Any way of "projecting" runs that does worse than this is actually "anti-predicting". The column headings are mostly self-explanatory, but here are notes on a couple. "Cumulative Error" is all actual errors added up, with sign (that is, plus and minus); the lower, the better. "Per Team-Year Error" is just the Cumulative Error divided by the number of team-seasons it was gathered over; it is not terribly important, but helps put the cumulative number in some sort of perspective. As noted, the data are from the years 1955 through 2009, inclusive. The formulations are listed in order of average percentage accuracy, lowest to highest. The envelope, please . . . .
(The darker lines are multiplicative measures, while the lighter are linear.) Some ReflectionsFirst off, it is manifest that the best of the multiplicative and the best of the linear methods produce results that are quite close enough for folk music. Second, it is clear that the differences in performance of all these methods are far less consequential than the general accuracy of all. For perspective, let's keep in mind that a difference in accuracy of 0.14% is only about one run per team per season. Look at it: best to worst is only an average difference of less than 4 runs per team per season. One thing, though, that is clear is that none of the linear methods is really close to a symmetrical distribution of its errors. That is scarcely a fatal flaw, but it does suggest that they are, as is known, not modelling process but empirically matching data. Now there are a lot of empirical constants in the multiplicative methods, too, but the thing is that the linear systems are their constants, and nothing else. I thought it might be useful to take a look at graphical representations of a couple of these methods. For economy, I chose the best linear and the best multiplicative methods. Here they are: There are differences, but you've got to look awfully hard to find them. And you will also notice—again, if you look carefully—what a tabled presentation would show better (but is too long for here), which is that these two rather different methods get mostly the same results for the same teams (look at the odd little dots that are fairly isolated), which demonstrates what we already knew: that variations from projection are essentially chance. My own summing-up is that if you need convenient ease of use, as when doing calculations by hand, the XR method is easiest. If you want the sense that you're really modelling what happens, want best available accuracy, and have the use of a computer to do the heavy lifting of calculation, use the TOP formula. (The needed stats can be downloaded from various standard sources.) The question of how these various methods can be used to analyze individual players is a fascinating one, but, owing to length, one for another time.
|
Comments
That sound you heard? My head exploding.
Posted by: quotemeister at November 22, 2009 10:53 PM
X = 3 x 5 + 7 is always (!) 22 and R = A*B/(B+C) + D is totally clear, too.
The rules in math are pretty clear, multiplication and division before addition and subtraction. And the brackets around A*B/(B+C) are superfluous.
After that, the above comment says it all. :)
Posted by: BJ at November 23, 2009 1:39 AM
sv The Linear Approach:The Theory / Sidebar ... 4 x 8 = 24?
Posted by: Stoney Breyer at November 23, 2009 5:37 AM
While the average absolute errors are very similar, the fact that all the linear methods appear to be biased in one direction or the other is a significant finding. My question: if you establish a tolerance around the actual = fitted line (say, +/- 2%) would you still see the same bias?
Posted by: James M. at November 23, 2009 10:22 AM
Thanks to Eric Walker for sharing this with everyone! Much appreciated.
Hope to see how these various methods can be used to analyze individual players, it would indeed be fascinating.
I would also like to see a discussion of the Silly-ball era, as not many people are familiar with this.
Posted by: obsessivegiantscompulsive at November 23, 2009 11:20 AM
BJ: I did note, in a boxed comment: (In principle, there is an implied order of precedence for arithmetic operations such that parentheses are often not needed . . .) Those rules are, indeed, "pretty clear"--in fact, they are definite--but they are not so widely known; as I also noted: there is never any guarantee that the writer of a given equation knows it, so that one can never be sure what was meant. Parens, brackets, etc. are often not needed, but they do no harm and may do some good.
Stoney: yes, embarrassing typo--or thinko--which is, or soon will be, amended, though the point is (I hope) clear.
James: I did not emphasize the point, but I do agree that the asymmetry of linear results is significant. I'm not sure exactly what you mean by establishing a tolerance around the line. Just so it's clear, all I did here was apply the formulae, exactly as shown above (and that's why they're shown, so it's clear what I was testing with and so others could replicate the results), to 55 years of data and record the results (in all cases, projections were rounded off to the nearest whole run).
Posted by: Eric Walker at November 23, 2009 1:16 PM