Showing posts with label baseball. Show all posts
Showing posts with label baseball. Show all posts

Tuesday, September 1, 2015

Simulating a baseball game

As many of you know, I'm an avid baseball fan. Therefore, whenever I come across a topic involving Operations Research and baseball, I can't just simply ignore it. Awhile ago, I discovered this website that analyzes the strategies of a team throughout a baseball game. It's an interesting approach that uses dynamic programming to maximize the probability of winning the overall game by breaking up each inning into the various situations that are possible and then picking the optimal strategy for that situation.

My first reaction was, "Cool! I want to play with it!" But wait, it's in MATLAB?! I don't have MATLAB at home. I had a couple of options, I could try and get a copy of MATLAB (which is not very cheap) or translate to a language I had access to on my laptop. I thought about R, Python, or potentially learning Octave but eventually sided with R. Below are links to the Git that has the translated code in R. Note that this version is strictly a translation and has not been optimized for the R language.

This is really a work in progress since I'd like to build it out a bit. Here's what I see as future work I hope to accomplish on this problem:
  1. Optimize for the R language and decrease overall runtime for a particular run.
  2. Adjust for a team of different players (currently assumes the same player at all nine hitting spots). 
  3. Create a nice way to visualize the results.
  4. Link to more recent data. 
    1. Through the Lahman package.
    2. Through "live" data.
  5. Add in ability to manipulate lineup order and be able to compare pros and cons behind various strategies.
  6. Add in ability for lineup substitutions.
Links:
Github repository
Original Problem website

Friday, March 21, 2014

Similar Players in MLB: Comparison against Baseball-Reference

As many of you are aware, Baseball-Reference calculates a similarity score for each player against other players. Although this is an accepted way to calculate the similarity between two players, I wanted to see if my methodology compares. I ran my methodology for Hank Aaron (someone we all know and can understand the comparisons) and compared the list I got against the list Baseball-Reference posted. First, let's look at what Baseball-Reference has:
  1. Willie Mays (782) 
  2. Barry Bonds (748) 
  3. Frank Robinson (667) 
  4. Stan Musial (666) 
  5. Babe Ruth (645) 
  6. Ken Griffey (629) 
  7. Carl Yastrzemski (627) 
  8. Rafael Palmeiro (611) 
  9. Alex Rodriguez (610) 
  10. Mel Ott (602) 
I don't think anyone would argue with any of these players. However, what's the list of players I came up with? Well, here they are:
  1. Willie Mays 
  2. Frank Robinson 
  3. Al Kaline 
  4. Ernie Banks 
  5. Billy Williams 
  6. Brooks Robinson 
  7. Roberto Clemente 
  8. Ken Boyer 
  9. Norm Cash 
  10. Carl Yastrzemski 
What's interesting about my list is there are certainly players that don't seem comparable to Hank Aaron. The question then becomes, how did they make it here? Quickly looking at the numbers you can see that I included more statistics for comparison than Baseball-Reference. In addition, I used a weighting scheme for comparing various statistics. Here's the full list of comparisons. Note that the lower the number, the closer they are in comparison.


So which one is right? I think its easy to say that Baseball-Reference seems more accurate, but I am continuously looking to improve this methodology and see how that impacts the results. Keep tuned for the final version of the code and methodology.

Thursday, March 20, 2014

Similar Players in MLB: Calculating Similarity

As I mentioned in Similar Players in MLB, I want to be able to see how similar players can be. I decided to take a somewhat different approach by looking at how a player's career compares against another player's career. In order to put things in terms of a career, I didn't want to simply sum up their statistics or normalize given the number of years they played. I wanted to be able to compare a player's second year against another player's second year. This isn't a simple problem though.

First part of the problem, what statistics do you use to compare two different players? Using the Lahman Database, I had easy access to the common statistics like games played, at bats, runs scored, hits, doubles, triples, etc. However, this database doesn't simply just have the counting stats by year. It is a compiled record of stats by league and team. In order to simplify the data collection, I aggregated the information up to the combination of player and year.

Second part of the problem, how accurate is it to compare a player's first year against another player's first year? Is it possible to correct for a player being sent to the majors a little early or taking a couple years to develop? This means being able to compare a player's first year against another player's first, second, or perhaps third year to determine the closest match. How do you adjust for a player's third year against a player's first year? While I use SAS at work, I don't have access to it's functions at home. I noticed that SAS's PROC SIMILARITY has the capability for being able to calculate the minimum distance between two time series. Consider the example below of two player's games.



Note that they're not the exact same, but you can see the similarity between the two. Using dynamic programming, you can find the minimum distance between these two time series. Distance between two time series can be a simple euclidean distance or something slightly different. However, I don't have access to SAS at home. Luckily enough, I was able to find a package in R that has this capability. Using the "dtw" package, you can easily calculate the distance between two different time series. Applying this package to the above example, gives you the results below. This three-way plot shows each player's data in the margins and how each point maps. The closer the plot is to Y=X implies the how close the mapping of one series of data is to another.



The next problem is how to use all of this similarity to figure out what players are actually similar to each other. Keep an eye out for my next blog on using Social Network Analysis to cluster these players together and incorporating more than just Games Played.

Wednesday, March 19, 2014

Similar Players in MLB

It's a common question to compare baseball players against each other. The question is what do you actually compare? Their playing styles? Positions they played? Teams they played for? Eras they played in? There are several dimensions to which this problem's complexity increases dramatically. In fact, several people now try to compare Yasiel Puig against Mike Trout (like Mark Saxon). However, how would you compare them?

I'm interested in developing an analytical technique that will remove the manual labor of looking at statistics. By removing that tedious task, it would be interesting to see how player's careers compare against other player careers. I'm hoping that in the end, I can use this as a significant factor in being able to predict whether a player will end up in the Hall of Fame.

More to come on this topic, but here's a little tease:
  • Data sourced from Sean Lahman
  • Techniques include: 
    • Dynamic Programming 
    • Social Network Analysis 
    • Logistic Regression 
  • Programmed entirely in R and RStudio