F/X VisualizationsMarch 05, 2010
How Can I Get My Hands on the Pitchf/x Data?
By Dave Allen

I often get emails from my readers here and at fangraphs asking how they can access the Pitchf/x and batted-ball location data I use in my posts. In the past couple months a host of new tools have become available online that make the data much more accessible. So in this post I thought I would highlight these new, and the longstanding, online tools for accessing the data.

The Basics

First off Major League Baseball Advanced Baseball (MLBAM) releases the GameDay data (pitchf/x, batted ball, boxscore, etc.) every day in .xml files. For the casual fan it is a bit tricky to find these data. And even once they do each game has its own series of files so pulling out all the data by hand would be a Herculean task. And finally once you have all the data, over a million pitches each with tens of values (start speed, end speed, break, pfx_x, pfx_z, the nine fit parameters,…) it is just too much data to handle in excel, so a database is necessary.

So let's look at the online tools to address each of these potential stumbling blocks. First off actually finding the .xml files and making sense of them. The best place for this is Alan Nathan's tutorial. He directs you to the site and then clearly defines each of the values in the pitchf/x data set.

Web Tools

Still this .xml file might not be of the most use to everyone. If you want to look at one pitcher's pitchf/x numbers over the course of a single game there is a great tool that has been around for while. Brooks Baseball displays pitch statistics, pitch speed over the course of a pitcher's appearance, a strikezone plot, and a number of pitch identification (movement vs speed) plots. The site makes if very easy to see, and download, an individual pitcher's data for a single game.

Another easy resource are the pitcher pages at FanGraphs. Each pitcher page has a 'PitchFX' section that, like Brooks Baseball, gives charts for individual games (they do not have the strike zone plots like Brooks but add a release point chart). Beyond the individual game section they have an overview section with the percentage thrown, average velocity, and horizontal and vertical spin deflection for each pitch the pitcher throws. Finally they have season-long velocity charts for each pitch type. So you can see, for example, how Jon Lester gained speed on his fastball through 2008 and kept those gains in 2009.

Recently two new tools allow you to slice the data a little finer. The F/X tool by TexasLeaguers allows you to split out any pitcher's data by batter handedness, count, and date range. They produce similar plots as Brooks (pitch location, horizontal by vertical spin deflection, also release point and pitch trajectory) but for the range of dates considered rather than a single game. In addition it gives results (percent swing, whiff, in play) for each pitch type. This site also has pitch data for batters: percentage of each pitch type seen and statistics against them each of them. For batters it also creates graphs with batted ball locations and swing/take/called strike zone charts. Again you can split out by pitcher handedness, count and date range.

But if you would rather get the data in excel and create your own charts or do your own statistics you can use Joe Lefkowitz's pitchf/x tool. Here you can slice and dice the data in innumerable ways (pitcher, batter, pitching team, batting team, umpire, date, pitch type, runners on …) and then choose which pitchf/x numbers you want spit out into an excel file.

Another new tool to view the batted ball data (whose locations are from the MLBAM's gameday) including the ability to overlay an individual player's or park's locations on a different park's outline can be found here here. Peter Jensen showed that these batted-ball locations are not terribly out of line from BIS and STATS's, which unlike the MLB's are not free. But that does not mean we should take them as gospel, there is a great discussion of the limitations of this type of overlaying of data over at the Book Blog, particularly germane are the concerns of Nick Steiner and Greg Rybarczyk. Still a very cool site that promises more in the future.

Getting the Raw Data

Still some people are going to want even more unfettered access to the data, and if that is you, you will most likely need computer skills beyond the ability to use excel and a web browser. If so you could head over to Darrell Zimmerman's Pitchf/x database. It is in MySQL (a very popular open source database system) format. This way you get all the data in a nice database without having to scrape it off MLBAM's site yourself.

Still if you want to have the data updated daily you need to scrape it for yourself. So that brings us to Mike Fast's instructions to scrape the data using a perl script and then get it into a MySQL database. These are an incredibly helpful set of instructions have been around since almost the beginning of the pitch/x era and helped many current, including this one, get access to the data. Nick Steiner used them as a guide to show how to do it with a Mac.

Finally as of just days ago Josh Hermsmeyer, who brought us the injury database, has a pitchf/x and MILB data extractor for Mac users. The extractor is built on PHP rather than perl and has GUI interface that probably makes it easier to use that command-line based systems. I have not tried it yet, but it looks great to me and would love to hear how it works.

Anyway I hope that helps. If there are any other tools I am missing please mention them in the comments and if I have incorrectly stated what one of these data sources offers people email me or tell me in the comments to I can correct it.

Comments

I've done some work with the MLBAM XML feeds and created my own little boxscore site at http://boxscore-junkie.appspot.com

I'm working on some other tools to have a crack at the pitch database, but am looking for some user requirements in helping me stay on a path or direction with my features (otherwise I keep meandering around making little bits here and there). I'd be happy to share the database file with people who are interested.

If anyone has questions about the Boxscore Junkie website, let me know. It is coded in python and leverages the Google App Engine.

Dave,

The link to the extractor using my name as an anchor is to a page that doesn't exist.

The tool url is here: http://blog.rotobase.com/2010/03/rotobase-pitch-fx-and-milb-data-extractor/

Thanks for the link love.

Also the link to Mike Fast's page is broken as well. :-(

Hey Dave,

A while back at BtB, I kinda sorta wrote (ripped off Mike Fast) tutorial for a Mac. Mike also popped up in the comments with a lot of helpful suggestions and common trouble shooting errors, as well as posting all of his scripts, so people don't have to mess around with changing Adler's.

http://www.beyondtheboxscore.com/2009/8/19/994666/saberizing-a-mac-4-pitch-f-x

Thanks for this post BTW. The more people who can process through the data, the better Pitch f/x analysis will get (especially if it forces lazy people like me to write an article already so studes doesn't fire me).

Josh,

Thanks for pointing this out, ugh what I had was a total mess. The problem with Mike's link actually lead to the sentence that appears now about Nick's pitchf/x-ing the Mac getting left out. So Nick your post was definitely intended to be in the piece if I had not screwed up the link.

Anyway I think it is all okay now. Nick and Josh thanks for stopping by and keep up your great work.

Greg,

That looks very cool. We can also use that as a further PSA that two spring training parks, Surprise (Rangers & Royals) & Peoria (Padres & Mariners, have the pitchf/x system installed. So going to Greg's page for the game today between the Padres and Mariners, you can see the pitchf/x numbers for each pitch.

Nick,

Were you ever able to get Perl parsing the xml scripts on your Mac?

I tried for a week or two to get that thing running but the DBI plug in for perl is complete pants on a Mac. Backfilps included installing Macports, recompiling from source, having 2 instances of Perl installed on my machine and fooling around with trying to figure out which one was getting called from the command line.

Made me want to take my toaster in the bath.

If you ever did get it going I'd love to see a follow up post. Getting Perl talking to mysql is the Mac holy grail.

Did you try installing via CPAN instead? Once DBI is installed, all should be good. Can also use sqlite if you want. Much simpler to install and use than MySQL.

For parsing XML with Perl, my favorite module has always been XML::Simple.

Josh,

I was able to get the parsing scripts working on the Mac, about 5 months ago. I had tried to deal with the stupid defunct DBI:SQL for like 2 weeks, but I found out it was broken. I got an email from a BtB reader, and he told me in very distinct steps how to use XML::Simple to parse the data.

I may do a follow up post, although with all of the new sources out there it might not be worth it. ..

Dave,

I usually have up the previous days data in the pitchf/x mysql database. Some times the automatic process errors, but I run automatically get the XML files and import the data into mysql then export it for download daily during the season.


On a side note thanks for your Presentation on Creating contour and "heat map" graphs to display PITCHf/x data. It has helped me in my quest for Run value heat maps available to everyone. Based on the article "More Run Values By Joe P. Sheehan"

Darrell

BBOS can get Pitch F/X and all game related data in MySQL on a daily basis if that is your goal.

http://sourceforge.net/projects/baseballonastic/

Also, if one did want to pursue a Perl version the old versions of BBOS ran in Perl using DBI and MySQL. But the newer versions are in Python. If you are able to install Python and MySQL on your computer then you can have the pitch f/x data updated daily with minimal effort. If anyone is interested in improving an existing and functional pitch f/x reader free to contact. It seems like people are reinventing the download process left and right. There is also a pitch f/x webservice that may make the retrieval even simpler if you have follow Wells Oliver's work.