Showing posts with label dynamic programming. Show all posts
Showing posts with label dynamic programming. Show all posts

Tuesday, September 1, 2015

Simulating a baseball game

As many of you know, I'm an avid baseball fan. Therefore, whenever I come across a topic involving Operations Research and baseball, I can't just simply ignore it. Awhile ago, I discovered this website that analyzes the strategies of a team throughout a baseball game. It's an interesting approach that uses dynamic programming to maximize the probability of winning the overall game by breaking up each inning into the various situations that are possible and then picking the optimal strategy for that situation.

My first reaction was, "Cool! I want to play with it!" But wait, it's in MATLAB?! I don't have MATLAB at home. I had a couple of options, I could try and get a copy of MATLAB (which is not very cheap) or translate to a language I had access to on my laptop. I thought about R, Python, or potentially learning Octave but eventually sided with R. Below are links to the Git that has the translated code in R. Note that this version is strictly a translation and has not been optimized for the R language.

This is really a work in progress since I'd like to build it out a bit. Here's what I see as future work I hope to accomplish on this problem:
  1. Optimize for the R language and decrease overall runtime for a particular run.
  2. Adjust for a team of different players (currently assumes the same player at all nine hitting spots). 
  3. Create a nice way to visualize the results.
  4. Link to more recent data. 
    1. Through the Lahman package.
    2. Through "live" data.
  5. Add in ability to manipulate lineup order and be able to compare pros and cons behind various strategies.
  6. Add in ability for lineup substitutions.
Links:
Github repository
Original Problem website

Thursday, March 20, 2014

Similar Players in MLB: Calculating Similarity

As I mentioned in Similar Players in MLB, I want to be able to see how similar players can be. I decided to take a somewhat different approach by looking at how a player's career compares against another player's career. In order to put things in terms of a career, I didn't want to simply sum up their statistics or normalize given the number of years they played. I wanted to be able to compare a player's second year against another player's second year. This isn't a simple problem though.

First part of the problem, what statistics do you use to compare two different players? Using the Lahman Database, I had easy access to the common statistics like games played, at bats, runs scored, hits, doubles, triples, etc. However, this database doesn't simply just have the counting stats by year. It is a compiled record of stats by league and team. In order to simplify the data collection, I aggregated the information up to the combination of player and year.

Second part of the problem, how accurate is it to compare a player's first year against another player's first year? Is it possible to correct for a player being sent to the majors a little early or taking a couple years to develop? This means being able to compare a player's first year against another player's first, second, or perhaps third year to determine the closest match. How do you adjust for a player's third year against a player's first year? While I use SAS at work, I don't have access to it's functions at home. I noticed that SAS's PROC SIMILARITY has the capability for being able to calculate the minimum distance between two time series. Consider the example below of two player's games.



Note that they're not the exact same, but you can see the similarity between the two. Using dynamic programming, you can find the minimum distance between these two time series. Distance between two time series can be a simple euclidean distance or something slightly different. However, I don't have access to SAS at home. Luckily enough, I was able to find a package in R that has this capability. Using the "dtw" package, you can easily calculate the distance between two different time series. Applying this package to the above example, gives you the results below. This three-way plot shows each player's data in the margins and how each point maps. The closer the plot is to Y=X implies the how close the mapping of one series of data is to another.



The next problem is how to use all of this similarity to figure out what players are actually similar to each other. Keep an eye out for my next blog on using Social Network Analysis to cluster these players together and incorporating more than just Games Played.

Wednesday, March 19, 2014

Similar Players in MLB

It's a common question to compare baseball players against each other. The question is what do you actually compare? Their playing styles? Positions they played? Teams they played for? Eras they played in? There are several dimensions to which this problem's complexity increases dramatically. In fact, several people now try to compare Yasiel Puig against Mike Trout (like Mark Saxon). However, how would you compare them?

I'm interested in developing an analytical technique that will remove the manual labor of looking at statistics. By removing that tedious task, it would be interesting to see how player's careers compare against other player careers. I'm hoping that in the end, I can use this as a significant factor in being able to predict whether a player will end up in the Hall of Fame.

More to come on this topic, but here's a little tease:
  • Data sourced from Sean Lahman
  • Techniques include: 
    • Dynamic Programming 
    • Social Network Analysis 
    • Logistic Regression 
  • Programmed entirely in R and RStudio