Feasible Analytics: dynamic time warping

Friday, March 21, 2014

Similar Players in MLB: Comparison against Baseball-Reference

As many of you are aware, Baseball-Reference calculates a similarity score for each player against other players. Although this is an accepted way to calculate the similarity between two players, I wanted to see if my methodology compares. I ran my methodology for Hank Aaron (someone we all know and can understand the comparisons) and compared the list I got against the list Baseball-Reference posted. First, let's look at what Baseball-Reference has:

Willie Mays (782)
Barry Bonds (748)
Frank Robinson (667)
Stan Musial (666)
Babe Ruth (645)
Ken Griffey (629)
Carl Yastrzemski (627)
Rafael Palmeiro (611)
Alex Rodriguez (610)
Mel Ott (602)

I don't think anyone would argue with any of these players. However, what's the list of players I came up with? Well, here they are:

Willie Mays
Frank Robinson
Al Kaline
Ernie Banks
Billy Williams
Brooks Robinson
Roberto Clemente
Ken Boyer
Norm Cash
Carl Yastrzemski

What's interesting about my list is there are certainly players that don't seem comparable to Hank Aaron. The question then becomes, how did they make it here? Quickly looking at the numbers you can see that I included more statistics for comparison than Baseball-Reference. In addition, I used a weighting scheme for comparing various statistics. Here's the full list of comparisons. Note that the lower the number, the closer they are in comparison.

So which one is right? I think its easy to say that Baseball-Reference seems more accurate, but I am continuously looking to improve this methodology and see how that impacts the results. Keep tuned for the final version of the code and methodology.

Thursday, March 20, 2014

Similar Players in MLB: Calculating Similarity

As I mentioned in Similar Players in MLB, I want to be able to see how similar players can be. I decided to take a somewhat different approach by looking at how a player's career compares against another player's career. In order to put things in terms of a career, I didn't want to simply sum up their statistics or normalize given the number of years they played. I wanted to be able to compare a player's second year against another player's second year. This isn't a simple problem though.

First part of the problem, what statistics do you use to compare two different players? Using the Lahman Database, I had easy access to the common statistics like games played, at bats, runs scored, hits, doubles, triples, etc. However, this database doesn't simply just have the counting stats by year. It is a compiled record of stats by league and team. In order to simplify the data collection, I aggregated the information up to the combination of player and year.

Second part of the problem, how accurate is it to compare a player's first year against another player's first year? Is it possible to correct for a player being sent to the majors a little early or taking a couple years to develop? This means being able to compare a player's first year against another player's first, second, or perhaps third year to determine the closest match. How do you adjust for a player's third year against a player's first year? While I use SAS at work, I don't have access to it's functions at home. I noticed that SAS's PROC SIMILARITY has the capability for being able to calculate the minimum distance between two time series. Consider the example below of two player's games.

Note that they're not the exact same, but you can see the similarity between the two. Using dynamic programming, you can find the minimum distance between these two time series. Distance between two time series can be a simple euclidean distance or something slightly different. However, I don't have access to SAS at home. Luckily enough, I was able to find a package in R that has this capability. Using the "dtw" package, you can easily calculate the distance between two different time series. Applying this package to the above example, gives you the results below. This three-way plot shows each player's data in the margins and how each point maps. The closer the plot is to Y=X implies the how close the mapping of one series of data is to another.

The next problem is how to use all of this similarity to figure out what players are actually similar to each other. Keep an eye out for my next blog on using Social Network Analysis to cluster these players together and incorporating more than just Games Played.