The Baseball Analysts: PitchF/X Detective: Has Bradley's Strike Zone Been Widened

Strasburg, Boras, and Everything Else You Wanted to Know About the 2009 Draft »

PitchF/X Detective: Has Bradley's Strike Zone Been Widened

By Dave Allen

Last weekend Milton Bradley claimed that his strike zone had been expanded in retaliation for his early season run-in with umpire Larry Vanover.

Bradley believes his strike zone is being widened, forcing him to chase pitches he normally doesn't swing at or risk being called out on strikes.

Asked if there have been repercussions from Vanover's fellow umpires since the incident, Bradley didn't mince words.

"There always is," he replied. "No matter what, I'm the type of guy [where] I don't care what somebody does to a colleague of mine. I'm not going to treat him any differently. I do things straight up, because I'm a straight-up, honest individual.

"Unfortunately, I just think it's a lot of 'Oh, you did this to my colleague,' or 'We're going to get him any time we can. As soon as he gets two strikes, we're going to call whatever and see what he does. Let's try to ruin Milton Bradley.'

"It's just unfortunate. But I'm going to come out on top. I always do."

This claim was brought to my attention in Craig Calcaterra's ShysterBall blog where he suggested that someone with "PITCHf/x-fu" could check this assertion. I am not 100% sure what "PITCHf/x-fu" is, but I like to think I have it. Either way I thought this was an exciting new application of the pitchf/x data, so I decided to take Craig up on it and see if Bradley's strike zone has been any different this year.

First off we need the smallest bit of background on the strike zone. It is called differently to right- and left-handed batters; the outside edge is extended out a couple inches to lefties. In addition, its size is count-dependent, expanding in hitter's counts and shrinking in pitcher's counts. These two facts make an assessment of Bradley's claims a little tricky. He is a switch hitter so we have to break up the analysis for him as a LHB and as a RHB. And any differences could be the result of differences in the fraction of time he is in hitter's versus pitcher's counts this year compared to the past.

The pitchf/x system was phased-in in 2007 and has been operational in every game since, so I am going to compare pitches Bradley took in the part of 2007 covered and all of 2008 to those he took in 2009 thus far (ignoring the count issue temporarily). Here are the pitches he took as a RHB. Remember, the images are from the catcher's, so negative values of x are inside to a RHB and positive inside to a LHB. The gray dots are balls and the black dots called strikes.

There are too few taken pitches in 2009 as a righty to make much of a firm conclusion, but it does not look terribly out of whack. There are two called strikes on the inside edge, but right below them are four balls also along the inside edge.

Here are pitches he took as a LHB.

Bradley has way more at-bats as a lefty and thus there are more taken pitches. These addition pitches allowed me to make called strike contours. These contours are closed lines such that a pitch inside the line is a strike 50% of the time or more and a pitch outside the line is a ball 50% of the time or more. Here you can see how the outside edge of the strike zone is shifted farther outside to Bradley as a lefty, as is the case to all LHBs. The inside edge of the pre-2009 and 2009 zones are almost exactly the same. Up and outside the pre-2009 zone is larger, but down and outside the 2009 zone is larger. As a whole the two are almost exactly the same size.

To make this conclusion statistically explicit, and correct for the count, I ran a binomial logistic regression. This is a regression in which the dependent variable only takes two values, in this case 1 if a taken pitch is called a strike and 0 if it is called a ball. The dependent variable is regressed against any number of ordinal and/or categorical variables. In effect this binomial logistic model uses these regressors to calculate the probability a taken pitch is called a strike, and tells you which of the regressors are statistically significant in determining that probability. The technique is identical to that taken in my earlier strike zone post, but this time I restrict the analysis to just Bradley's data.

I regressed Bradley's strike/ball taken pitches against the horizontal distance between that pitch and the horizontal middle of zone (with a different middle for Bradley as a LHB and RHB), the vertical distance from that pitch and the vertical middle of zone, the interaction of these two distances, the number of balls and strikes (to control for the count) and a categorical factor of pre-2009 or 2009.

 Binomial Logistic Regression
+-----------------+----------+------------+---------+------------+
|                 | Estimate | Std. Error | z Value |    P(>|z|) |
+-----------------+----------+------------+---------+------------+
| (Intercept)     |    5.995 |      0.370 |   16.21 |  < 2e-16 * |
| x Dist.         |   -0.364 |      0.022 |  -16.37 |  < 2e-16 * |
| y Dist.         |   -0.526 |      0.031 |  -17.48 |  < 2e-16 * |
| x*y Interaction |    0.012 |      0.000 |   13.87 |  < 2e-16 * |
| Num. Strikes    |   -0.897 |      0.178 |   -5.03 |  4.8e-07 * |
| Num. Balls      |    0.251 |      0.085 |    2.96 |    0.003 * |
| 2009            |   -0.023 |      0.217 |   -0.10 |    0.914   |
+-----------------+----------+------------+---------+------------+

Regressors with a negative estimate decrease the likelihood of a pitch being called a strike. So as the x or y distance increases the probability of a strike decreases, as expected. As the number of strikes increases the probability of a strike decreases (the strike zone shrinks in pitcher's counts) and as the number of balls increases the probability of strike increases (the strike zone expands in hitter's counts). All of these effects are strongly significant and mirror the results for all hitters.

The difference between the pre-2009 and 2009 zone is very slight, and if anything the 2009 zone is slightly smaller. Taken pitches in 2009, correcting for distance and count, are slightly less likely to be strikes. But this effect is very non-significant. There is over a 90% chance the difference between pre-2009 and 2009 zones is just due to chance alone. There is no statistical difference between Bradley's zone this year and his zone in 2007 and 2008.

I can understand Bradley was frustrated on Sunday. The Cubs had just lost seven straight games, and in five of those games they scored either zero or one run. He is hitting a meager .196/.322/.373 this season, but he has his decreased BABIP and LD% and increased GB% to blame for it, not the umpires.

Comments

Good stuff, Dave. Sutcliffe last night said that Tex has seen more fastballs since ARod has returned. That is a fairly easy study to do it would seem, can someone attack that one?

Posted by: Joe at May 28, 2009 4:19 AM

Joe, yeah that is very easy to check with the pitchf/x data. It looks to me like before May 8th Tex was thrown 288 fastballs out of 479 pitches, about 60%. Since May 8th he has seen 167 fastballs of 287 pitches, 58%. He has seen fewer fastballs since A-Rod has been back, but the difference is small.

Posted by: Dave Allen at May 28, 2009 9:05 AM

Thanks, Dave. Appreciate you looking for me.

Posted by: Joe at May 28, 2009 9:13 AM

Great stuff. I was hoping someone would use the PitchF/X data to determine if Bradley had a legitimate gripe. Well done, I enjoyed this.

Posted by: Tyler at May 28, 2009 9:59 AM

Excellent, Dave. Too bad it isn't this easy to do a binomial regression on the historical existence of dinosaurs. :P

Posted by: nightfly at May 28, 2009 11:01 AM

The day Bradley complained I said to myself "Oh Milton, the internet will investigate. I can only hope that your perception is accurate."

I happen to like rooting for the guy, so I am disappointed that he made an apparently baseless claim.

Posted by: juan at May 28, 2009 11:17 AM

Could you tell if Teix had more pitches in the strikezone? Or could you show me where to check? Thanks and great work again!

Posted by: bpasinko at May 28, 2009 12:16 PM

bpasinko,

It does look like he has seen more pitches in the zone since Arod's return:

Before May 8th: 219 in the zone of 479 (45.7%)
May 8th and on: 141 in the zone of 287 (49.1%)

Posted by: Dave Allen at May 28, 2009 12:28 PM

This site is quickly becoming my favorite baseball place. I was hoping someone would look at this. Like juan, I expected to find there was something here, I'm heartened to find there isn't. It looks like, over time, tools like this are going to leave the umpires more trusted rather than less. Which can only be good for baseball.

Posted by: LarryinLA at May 28, 2009 2:28 PM

Sorry for the double post, but I'm curious about the contour algorithm. How is it generated? I assume angular slices of some width are taken and then a curve is fit to the strike cumulative-density-function (CDF) and the location where the curve crosses 50% is used. I'm not sure you get the precision implied by your charts. At any rate, beating a dead horse, the plots should give some description of their precision.

Posted by: LarryinLA at May 28, 2009 2:35 PM

LarryinLA,

great point about indicating the level of confidence in the contours. For example, there is little confidence in the location of the up-and-in portion of the 09 contour, because there are very few pitches in that location. On the other hand, we have lots of confidence in the location of the outside portion of the pre-09 contour since there are lots of pitches in that location. I probably should think about making the width of the line proportional to the confidence in its precision.

I sorta feel like I cheated in my method making the contours. Maybe you or someone else can reassure me my method is valid or tell me why it is not and suggest a better one. Here is what I did:

1) Consider the lattice of points 1 inch apart.
2) For each one of these points I found the distance-weighted average probability a pitch within 6 inches was a strike.
3) Now for each of these lattice points I have a guess at what the probability of a taken pitch in that location being called a strike.
4) Draw the line that separates lattice points with a value greater than 0.5 from those less than 0.5

There are probably better ways of doing it and I would love to hear them.

Posted by: Dave Allen at May 28, 2009 3:04 PM

Dave,

As usual excellent and exciting post!

Posted by: snowball2 at May 28, 2009 3:26 PM

Dave,

That is a sensible approach, though six inches seems both somewhat arbitrary and also a little large. I don't think you can pinpoint the zone boundaries any more accurately than the 6" cell size you are effectively using with that approach. An adaptive cell size approach might be better. I have to think about it a little more, but I'm pretty certain whatever the approach, the up-and-in portion of the '09 LHB strike zone should be some indeterminate shade of grey.

Any rigorous method here is likely to give depressingly large uncertainty. Better to acknowledge it though.

Posted by: LarryinLA at May 28, 2009 5:40 PM

Another way of looking at the question of whether Milton Bradley's strike zone has been widened might be to compare the percent of out-of-the-rule-book-strike-zone pitches called strikes with Bradley at bat to the percent called when other Cubs players are at bat.

You could form a 2x2 table with columns for Bradley and 'other Cubs' and rows for out-of-strike-zone pitches called strikes and those called balls. You could then calculate an odds ratio to see if OOSZ pitches are more likely to be called strikes against Bradley than it is against other Cubs. You could control for home plate umpire by forming a 2x2 table for each umpire and calculating the common odds ratio over the separate 2x2 tables.

As for the contour, couldn't you use your logistic regression with only the x and y values and their interaction? Predicted values from the regression (using a range of x and y values with small increments) could be used to plot a response surface (like the 'heat maps' I've seen, perhaps even at this site).

Posted by: Keith Karcher at May 28, 2009 5:47 PM

I only skimmed the article, but perhaps he has more GB's and fewer LD's because they're calling a lower zone on him?

Posted by: dk at May 28, 2009 6:13 PM

Larry

You are definitely right that the choice of six inches is arbitrary, but I don't think that it means the boundaries are only accurate to six inches. I take the distance-weighted average of the pitches within six inches. So a pitch 0.5 in. away from a give lattice point is weighted by 1/exp(0.5) and a pitch 6 in. away is weighted by 1/exp(6). The closer pitch is worth 244 times more in determining that lattice point's value. I think this in effect gives me adaptive cell sizes. The value of lattice points with lots of nearby pitches will be swamped out by the value of those pitches. But lattice points with only far away pitches will take those values.

Still I agree with you that the method is suspect and I should show the uncertainty in the contour lines in some way.

Keith,

The only issue with the first suggestion is if Bradley's OOSZ pitches are, on average, closer or further from the zone than the rest of the Cub's OOSZ pitches. Differences in called strike percentage for OOSZ pitches could be of that rather than the zones being called differently.

There have been heat maps here before and presenting a strike-percentage heat map rather than just the one contour is a great idea!

Posted by: Dave Allen at May 28, 2009 6:22 PM

Dave,

I guess I skipped over the weighting a little too quickly in thinking about the method. I agree, this does give something of an adaptive cell size. I was thinking of sparsely populated regions when I suggested the uncertainty approached the cell size. The uncertainty at a given location is clearly going to be related to the pitch density in the area.

I also had thought plotting heat maps from the regression (and the data) would be interesting. Harder to compare, though a difference map could do that. The slopes/color gradients would give a feel for the uncertainty there.

Posted by: LarryinLA at May 28, 2009 8:38 PM

Thanks Dave,

The whole lineup protection thing really interests me, in regards to Arod and Teix like Joe asked. So he's seeing less fastballs but more pitches in the zone, somewhat contradictory to saying Arod's really helping him out. Is a 3.4% increase in strikes a significant amount, or enough to say Arod plays a large part in Teix's recent performance?

I struggle with this one, not sure if it's an old baseball adage that is proven or proven wrong with statistics.

Posted by: bpasinko at May 29, 2009 7:34 AM

Keep in mind Tex was hurt (wrist) and had been quoted that it was almost completely healed about a week before ARod returned.

Posted by: John at May 29, 2009 9:25 AM

bpansiko,

The probability of seeing 141 or more pitches in the zone of 287 pitches if the underlying in-zone percentage is 45.7% is just over 11%. So the difference between before and after A-Rod's return does not meet the p=0.05 level commonly used in statistics.

The lineup protection issue is very interesting. I am sure there would be a cool way to look at it with the pitchf/x data.

Posted by: Dave Allen at May 29, 2009 10:39 AM

Just a quick question - if you're looking at taken pitches, isn't that incorrect if Bradley is doing what he says he's doing - being forced to swing at bad pitches? Meaning your populations are not necessarily the same criterion - 2007 - 2008 you're looking at taken pitches. 2009 you're looking at taken pitches, but Bradley is claiming he's now swinging at bad pitches because the umps are calling them strikes. There's obviously not enough data to draw a firm conclusion, but if the umps *were* widening the zone on him early in the season and he's now swinging at bad pitches, that would explain why the zones look the same. You should probably look at swings at pitches outside the zone.

Posted by: Dennis at May 29, 2009 2:44 PM

Good point Dennis. Here are the percentage of out of the zone pitches that Bradley swung at:



2007   20.5%

2008   20.8%

2009   18.5%

So Bradley is actually swinging at fewer pitches outside the zone this year than the past two.

Posted by: Dave Allen at May 29, 2009 7:57 PM

Taking the 219 out of 479 and 141 out of 287 figures as samples, I find no statistically significant difference at the 5% level (two tails). I was lazy and used the bootstrap (B = 50,000), so someone using the exact binomial test may get a different result.

Posted by: Will Dwinnell at June 5, 2009 4:36 AM