How Can I Get My Hands on the Pitchf/x Data?
I often get emails from my readers here and at fangraphs asking how they can access the Pitchf/x and batted-ball location data I use in my posts. In the past couple months a host of new tools have become available online that make the data much more accessible. So in this post I thought I would highlight these new, and the longstanding, online tools for accessing the data.
First off Major League Baseball Advanced Baseball (MLBAM) releases the GameDay data (pitchf/x, batted ball, boxscore, etc.) every day in .xml files
. For the casual fan it is a bit tricky to find these data. And even once they do each game has its own series of files so pulling out all the data by hand would be a Herculean task. And finally once you have all the data, over a million pitches each with tens of values (start speed, end speed, break, pfx_x, pfx_z, the nine fit parameters,…) it is just too much data to handle in excel, so a database is necessary.
So let's look at the online tools to address each of these potential stumbling blocks. First off actually finding the .xml files and making sense of them. The best place for this is Alan Nathan's tutorial. He directs you to the site and then clearly defines each of the values in the pitchf/x data set.
Still this .xml file might not be of the most use to everyone. If you want to look at one pitcher's pitchf/x numbers over the course of a single game there is a great tool that has been around for while. Brooks Baseball
displays pitch statistics, pitch speed over the course of a pitcher's appearance, a strikezone plot, and a number of pitch identification (movement vs speed) plots. The site makes if very easy to see, and download, an individual pitcher's data for a single game.
Another easy resource are the pitcher pages at FanGraphs. Each pitcher page has a 'PitchFX' section that, like Brooks Baseball, gives charts for individual games (they do not have the strike zone plots like Brooks but add a release point chart). Beyond the individual game section they have an overview section with the percentage thrown, average velocity, and horizontal and vertical spin deflection for each pitch the pitcher throws. Finally they have season-long velocity charts for each pitch type. So you can see, for example, how Jon Lester gained speed on his fastball through 2008 and kept those gains in 2009.
Recently two new tools allow you to slice the data a little finer. The F/X tool by TexasLeaguers allows you to split out any pitcher's data by batter handedness, count, and date range. They produce similar plots as Brooks (pitch location, horizontal by vertical spin deflection, also release point and pitch trajectory) but for the range of dates considered rather than a single game. In addition it gives results (percent swing, whiff, in play) for each pitch type. This site also has pitch data for batters: percentage of each pitch type seen and statistics against them each of them. For batters it also creates graphs with batted ball locations and swing/take/called strike zone charts. Again you can split out by pitcher handedness, count and date range.
But if you would rather get the data in excel and create your own charts or do your own statistics you can use Joe Lefkowitz's pitchf/x tool. Here you can slice and dice the data in innumerable ways (pitcher, batter, pitching team, batting team, umpire, date, pitch type, runners on …) and then choose which pitchf/x numbers you want spit out into an excel file.
Another new tool to view the batted ball data (whose locations are from the MLBAM's gameday) including the ability to overlay an individual player's or park's locations on a different park's outline can be found here here. Peter Jensen showed that these batted-ball locations are not terribly out of line from BIS and STATS's, which unlike the MLB's are not free. But that does not mean we should take them as gospel, there is a great discussion of the limitations of this type of overlaying of data over at the Book Blog, particularly germane are the concerns of Nick Steiner and Greg Rybarczyk. Still a very cool site that promises more in the future.
Getting the Raw Data
Still some people are going to want even more unfettered access to the data, and if that is you, you will most likely need computer skills beyond the ability to use excel and a web browser. If so you could head over to Darrell Zimmerman's Pitchf/x database
. It is in MySQL (a very popular open source database system) format. This way you get all the data in a nice database without having to scrape it off MLBAM's site yourself.
Still if you want to have the data updated daily you need to scrape it for yourself. So that brings us to Mike Fast's instructions to scrape the data using a perl script and then get it into a MySQL database. These are an incredibly helpful set of instructions have been around since almost the beginning of the pitch/x era and helped many current, including this one, get access to the data. Nick Steiner used them as a guide to show how to do it with a Mac.
Finally as of just days ago Josh Hermsmeyer, who brought us the injury database, has a pitchf/x and MILB data extractor for Mac users. The extractor is built on PHP rather than perl and has GUI interface that probably makes it easier to use that command-line based systems. I have not tried it yet, but it looks great to me and would love to hear how it works.
Anyway I hope that helps. If there are any other tools I am missing please mention them in the comments and if I have incorrectly stated what one of these data sources offers people email me or tell me in the comments to I can correct it.