Designated HitterJuly 01, 2005
Introducing Monte Carlo Win-Loss
By Sean Forman

The flaws in pythagorean Win-Loss percentage (commonly the square of runs scored divided by the sum of the square of runs scored and the square of runs allowed) are fairly well known. 20-2 blowouts count as only one win, but may affect pythagorean win-loss percentages dramatically. One-run wins and losses count as a whole win or loss, while pythagorean win-loss will treat them as nearly half a win and half a loss. All of these are true, but the method is still pretty darn good.

There has been a fair amount of work on what just is the best exponent to use, and I've settled on 1.83 for Baseball-Reference, but other choices abound and some have even resorted to variable exponents to squeeze out those last three to four wins of error. I'm not going to go down that path here. I would like to look at a different way to approach this issue that accepts that teams have blowouts and one-run wins and incorporates this into the method.

At the 2004 SABR convention in Cincinnati, I presented a talk on monte carlo simulation of pennant races (http://www.bb-ref.com/sabr/). The idea behind Monte Carlo Win-Loss Percentage is similar. (Monte Carlo techniques are common computational solution techniques used to simulate complicated systems. Basically, you run a lot of simulations and aggregate the data.)

  • Get the runs scored and allowed for a team's 162 (or whatever) games.
  • Randomly order the runs scored values (doing both is redundant).
  • Play 162 games by reading down the list of runs scored and allowed and when runs scored exceed runs allowed the team "wins" and vice-versa the team loses. (Ties do occur and are half a win and half a loss.)
  • For example (with ten games),

    Team's actual results Randomly ordered and simulated
    RSRAW/L RSRAW/L
    54W 24L
    115W 45L
    41W 51W
    53W 43W
    51W 51W
    61W 101W
    107W 127W
    127W 117W
    43W 63W
    20W 50W
    Record10-0 Record8-2

  • Do this 1000 times (an arbitrary choice, but the numbers don't vary much in consecutive 1000 season runs) and aggregate the data.

What are the Flaws?

Well, this method assumes that runs scored and runs allowed are independent of each other and that clearly is not the case. Managers manage to the score and the four runs allowed by mop-up relievers in the bottom of the ninth could turn a real win into a monte carlo loss (the same thing happens with pythagorean to a lesser degree). However, I think this method more correctly handles the cases where a team has a lot of one-run wins or many blowouts. Suspended games are somewhat problematic and tie games are troubling, but I think all of this gets evened out over the long run.

Does it work?

Yes, but not well enough to supplant pythagorean win-loss records.

I've computed Monte Carlo Win-Loss Percentages (mcWL%) for every team from 1901 on and it does a little better than Pythagorean Win-Loss Percentage (pythWL%) with a 1.83 exponent.

Root-mean square error between mcWL%, pythWL%, and actual WL% for 2076 seasons since 1900.

  • RMSE monte carlo method: 0.025023
  • RMSE pythagorean method: 0.026026
  • So one measly percentage point, or one-sixth of a game better estimate over the course of a season. Also the mcWL% was as closer or better than pythag in 53% of the cases.

    So not great, but competitive.

    What can you do with this?

    I have a couple of ideas, but I'll expand on those later. One thing that is neat about these simulations is that you can count how many times the team's actual wins exceeded the simulated seasons wins. For instance a team that exceeded the simulation all 1000 times was probably very lucky to do so, and a team that never did was very unlucky (I call this percentile). We can also track their best and worst results along with the average.

    Luckiest teams by mcWL% - WL%

    team_ID    year_ID    W   L    mcW   mcL  HighW  LowW     WP   mcWP  pythWP  lucky  percentile
    BOS           1946  104  50   93.4  62.6  102.0  84.5  0.675  0.599   0.629  0.076       1.000
    NYG           1909   92  61   83.1  74.9   93.5  74.5  0.601  0.526   0.560  0.075       0.998
    NYG           1913  101  51   92.6  63.4  102.5  83.5  0.664  0.594   0.627  0.070       0.998
    NYY           2004  101  61   89.8  72.2   98.5  78.5  0.623  0.554   0.548  0.069       1.000
    BRO           1954   92  62   81.3  72.7   92.0  71.0  0.597  0.528   0.523  0.069       1.000
    CHW           1959   94  60   84.9  71.1   94.0  74.5  0.610  0.545   0.559  0.065       1.000
    CIN           1981   66  42   59.1  48.9   67.0  51.5  0.611  0.547   0.524  0.064       0.999
    CIN           1944   89  65   80.0  75.0   91.5  71.0  0.578  0.516   0.530  0.062       0.999
    NYY           1943   98  56   88.9  66.1   99.0  80.5  0.636  0.574   0.595  0.062       0.999
    PIT           1908   98  56   88.9  66.1   99.5  79.0  0.636  0.574   0.600  0.062       0.997
    SLB           1902   78  58   71.6  68.4   80.0  62.0  0.574  0.512   0.509  0.062       0.990
    NYM           1972   83  73   73.5  82.5   83.0  65.0  0.532  0.471   0.459  0.061       1.000
    PHA           1931  107  45   98.4  54.6  106.5  88.0  0.704  0.643   0.640  0.061       1.000
    STL           1917   82  70   73.9  80.1   83.5  65.0  0.539  0.480   0.470  0.059       0.996
    NYG           1925   86  66   77.0  75.0   86.0  66.5  0.566  0.507   0.522  0.059       1.000
    CHC           1907  107  45  100.1  54.9  109.5  91.0  0.704  0.646   0.670  0.058       0.992
    NYG           1906   96  56   88.0  65.0   97.5  74.5  0.632  0.575   0.592  0.057       0.997
    PIT           1905   96  57   88.4  66.6   98.0  77.0  0.627  0.570   0.588  0.057       0.996
    PIT           1909  110  42  102.0  51.0  113.5  94.0  0.724  0.667   0.694  0.057       0.997
    BRO           1924   92  62   83.3  70.7   94.0  74.5  0.597  0.541   0.528  0.056       0.999
    
    Unluckiest teams by mcWL% - WL%

    team_ID    year_ID   W    L   mcW    mcL  HighW  LowW     WP   mcWP  pythWP   lucky  percentile
    BSN           1935  38  115  53.3   99.7   64.5  43.0  0.248  0.348   0.327  -0.100       0.000
    NYM           1993  59  103  71.4   90.6   81.5  61.5  0.364  0.441   0.454  -0.077       0.000
    CIN           1937  56   98  68.2   86.8   79.5  59.0  0.364  0.440   0.434  -0.076       0.000
    PHI           1936  54  100  65.6   88.4   74.5  55.0  0.351  0.426   0.416  -0.075       0.000
    STL           1909  54   98  66.1   87.9   76.0  55.0  0.355  0.429   0.398  -0.074       0.000
    SLB           1905  54   99  66.5   89.5   78.0  58.0  0.353  0.426   0.421  -0.073       0.000
    PIT           1917  51  103  63.2   93.8   71.5  54.5  0.331  0.403   0.388  -0.072       0.000
    BSN           1912  52  101  63.6   91.4   73.0  53.5  0.340  0.410   0.402  -0.070       0.000
    DET           1952  50  104  61.3   94.7   71.0  49.5  0.325  0.393   0.374  -0.068       0.001
    PHA           1945  52   98  63.3   89.7   72.5  53.5  0.347  0.414   0.385  -0.067       0.000
    NYM           1962  40  120  50.9  110.1   61.0  41.0  0.250  0.316   0.313  -0.066       0.000
    WSH           1907  49  102  60.3   93.7   69.0  51.0  0.325  0.391   0.361  -0.066       0.000
    BRO           1912  58   95  67.9   85.1   79.0  58.5  0.379  0.444   0.433  -0.065       0.000
    HOU           1975  64   97  74.8   87.2   83.5  64.5  0.398  0.462   0.469  -0.064       0.000
    SDP           1994  47   70  54.5   62.5   63.5  46.0  0.402  0.466   0.453  -0.064       0.003
    PHA           1946  49  105  59.0   96.0   68.0  50.5  0.318  0.381   0.387  -0.063       0.000
    PHI           1930  52  102  62.4   93.6   73.0  50.5  0.338  0.400   0.392  -0.062       0.002
    SLB           1911  45  107  54.5   97.5   63.5  42.5  0.296  0.358   0.341  -0.062       0.002
    PHI           1923  50  104  59.8   95.2   69.5  47.5  0.325  0.386   0.367  -0.061       0.002
    BSN           1911  44  107  54.9  101.1   63.0  43.5  0.291  0.352   0.333  -0.061       0.001
    
    Teams for which pythWL% and mcWL% differ the most

    team_ID    year_ID    W    L    mcW    mcL  HighW  LowW     WP   mcWP  pythWP  percentile
    BRO           1918   57   69   57.0   69.0   64.5  47.0  0.452  0.452   0.387       0.543
    CHC           1905   92   61   96.2   58.8  106.5  86.0  0.601  0.621   0.680       0.083
    BSN           1904   55   98   57.8   97.2   67.5  49.5  0.359  0.373   0.316       0.187
    CHW           1905   92   60   91.8   66.2  100.0  81.5  0.605  0.581   0.636       0.563
    BSN           1906   49  102   53.6   98.4   63.0  45.0  0.325  0.353   0.300       0.059
    STL           1908   49  105   50.5  103.5   63.0  42.0  0.318  0.328   0.277       0.328
    WSH           1903   43   94   49.3   90.7   59.0  40.0  0.314  0.352   0.302       0.015
    CIN           1901   52   87   54.0   88.0   62.5  45.5  0.374  0.381   0.334       0.264
    SDP           1972   58   95   62.6   90.4   73.0  53.0  0.379  0.409   0.362       0.075
    WSH           1947   64   90   63.0   91.0   72.5  51.0  0.416  0.409   0.363       0.660
    CHC           1909  104   49  102.9   52.1  111.5  92.0  0.680  0.664   0.709       0.651
    CLE           1908   90   64   86.9   70.1   96.5  77.5  0.584  0.553   0.598       0.871
    NYY           1939  106   45  104.8   47.2  114.0  94.0  0.702  0.690   0.734       0.683
    BRO           1909   55   98   60.6   94.4   70.0  52.5  0.359  0.391   0.347       0.031
    BSN           1905   51  103   54.6  101.4   63.0  46.0  0.331  0.350   0.306       0.119
    DET           1905   79   74   72.4   81.6   81.5  61.5  0.516  0.470   0.426       0.993
    HOU           1963   66   96   64.8   97.2   73.0  55.0  0.407  0.400   0.357       0.686
    PIT           1918   65   60   64.6   61.4   72.0  55.0  0.520  0.513   0.556       0.581
    STL           1944  105   49  102.8   54.2  112.5  93.5  0.682  0.655   0.697       0.784
    BRO           1910   64   90   68.6   87.4   79.0  57.5  0.416  0.440   0.398       0.065
    

    I've also made available a dump of my simulation results. The fields are tab-delimited. You can import this into excel easily using the text to columns command (the most useful command for any stathead, well after sorting). Simulation Data.

    The columns are straightforward, except for stdW which is the standard deviation of the wins totals across the 1000 simulations, and bstW and wstW are the best and worst win totals of all 1000 simulations.

    [Additional reader comments and retorts at Baseball Primer.]

    Comments

    The big advantage of Pythagoran records is that they're good at predicting future performance. Can the Monte Carlo method be used as a forecasting tool?

    Do you have a hypothesis on why there are so many teams from the first half of the twentieth century? If the opportunity to have a lucky or unlucky season was truly random there would be more teams from recent times. This would be due to the fact that there are more teams in the league in recent years, therefore more teams have an opportunity to be either lucky or unlucky. Whew, that's confusing. I hope you get where I'm coming from.

    >>The big advantage of Pythagoran records is that they're good at predicting future performance. Can the Monte Carlo method be used as a forecasting tool?

    Sean can provide real #s, but it's clear that the pythag and the MC% are very highly correlated (probably .95 or higher). In that case, anything you can predict with pythag you can predict with MC (but maybe a little better or worse).

    But there doesn't seem to be a big advantage of MC over pythag in terms of accuracy and pythag is much easier to calculate. And I'm too lazy to download Sean's simulated data, but I'd bet that the empirical standard deviation from the sims is very close to the expected standard deviation you'd get using the pythag winning percentage.

    Do you have a hypothesis on why there are so many teams from the first half of the twentieth century?

    I was wondering about that myself. My best guess is that either pythag or MC doesn't work that well in the tails, perhaps especially the low-scoring tails, of the distribution. Given 12 of the teams in the last table are from between 1901-1909, it seems that the two methods must diverge in low-scoring eras. Which one is better I have no idea.

    I am curious where the 2001 Seattle Mariners would stand in any dicussion about the luckiest team ever. Observationally, I have never seen a team get as many breaks as that team. I remember going to a game that year at Safeco where they scored 5 runs for a win and didn't hit a ball hard all game long. I remember laughing to myself all the way home after that game. Unfortunately, there are not too many laughs in Seattle these days after a Mariner game. The luckiest single performance I have ever seen was in 1965 when Robin Roberts shut out the Phillies in a complete game while giving up a ton of hits and rockets that were turned into outs. The last out of the game was a shot by Dick Allen that I thought was going to hit the roof of the Astrodome but instead settled into the glove of the centerfielder who caught it at the wall in straight-away centerfield.

    Thanks for the comments.

    I agree, pythag is far easier to compute. What I like about this method is that it tends to view things in terms of a distribution of possible outcomes. If I had more time I would have looked a bit more closely at how the outcomes are distributed.

    I also found it interesting that a team's win totals could range from 102 to 84 wins just by randomly re-ordering the runs scored and allowed. That in no way changes how many runs were allowed and scored in a season or even in each game.

    Also, I checked the 1900-1909 seasons and pythag is very slightly better.

    Bill James tried something similar in the 1986 Baseball Abstract, and got similar conclusions; the gain wasn't worth the extra work.


    He computed the won-lost percentage of all NL teams for each number of runs scored. Each team was credited with an offensive won-lost percentage based on the number of runs it scored in each game. For example, since NL teams in 1985 won 60% of the time they scored four runs, a team which scored four runs 20 times would be credited with 12 offensive wins and offensive losses for those games. Defensive wins were calculated similarly.

    The offensive and defensive won-lost percentages were then combined to get an estimated total of wins for each team. The standard error by this method was slightly less than projected by the Pythagorean formula, but the difference was trivial.

    I don't know how to do html, so I'm afraid all this is going to come out in one big paragraph. Sorry.


    Why do a simulation when you can easily compute the result over all pairings?


    Let's look at your 10-game data.


    When they score 5 runs, they'll go 7-1-2 (you note that ties are troubling, but you don't say how you handled them; I'll count them as half-a-win, half-a-loss, though I won't make any effort to defend this choice), which I'll read as 7.5-2.5.


    When they score 11 runs, they go 10-0.


    Here's the full table in the format, (runs scored, record when scoring this many runs, number of times scoring this many runs, total result of scoring this many runs):


    (5, 7.5-2.5, 3, 22.5-7.5),


    (11, 10-0, 1, 10-0),


    (4, 6.5-3.5, 2, 13-7),


    (6, 8-2, 1, 8-2),


    (10, 10-0, 1, 10-0),


    (12, 10-0, 1, 10-0),


    (2, 4-6, 1, 4-6),


    The total is 77.5-22.5, so the expected winning percentage is
    .775, no simulations necessary.


    You could probably even find a way to compute standard deviations, and thus compile your tables of lucky & unlucky teams.


    I note that 4 of your luckiest teams were McGraw teams (John, not Tug). If most of McGraw's teams were "lucky" then it's just possible you've stumbled on a good measure of what a manager contributes to a team.


    I guess BOS must be the AL team, but if I hadn't seen BSN farther down the page I wouldn't have known. Similarly the existence of SLB implies STL must be the NL team. It's unfortunate that the reader has to figure these things out.


    Good stuff.

    Well, it came out in one big paragraph in preview, but then it came out in separate paragraphs in real life. Go figure.

    Post a comment




    Remember Me?

    (you may use HTML tags for style)