Feasible Analytics: dplyr

Tuesday, October 13, 2015

DFS and Optimization: Data

Like any analytics problem, let's start by getting our hands on data. For the optimization problem, we'll need at least two pieces of information:

Salary information per player
Player information (position, league)
Player statistics / metrics to measure value

Ideally we could download this information straight from Draft Kings. However, I didn't want to create an account and it didn't seem straight forward. I took the easier route using Google to find someone that was already posting some of the data I needed.

Draft King Salaries

It was a bit difficult to access the salary information directly on draft kings, but RotoGuru is nice enough to post the daily data for us. Using the httr, dplyr and stringr packages was easy enough to scrape his website and pull down the salary data.

ESPN Game Score

Next up was some metrics and statistics for each player. My first though was go to ESPN, they have everything right? Well, yes, however, it wasn't easy to grab. Their daily notes section gives lots of tips on who to pick up, including a nice metric called Game Score for pitchers. Here's some code that we'll use to grab that data.

Fangraphs Advanced Metrics

Well, game score is certainly handy, but it'd be nice to have a great metric for hitters too. Since I'm a SABR person, I figured why not go for some advanced metrics. Fangraphs is a great site with articles discussing baseball in terms of advanced metrics and hosting an accompanying glossary for those unfamiliar with them. Here's the code for downloading that data:

Sunday, November 16, 2014

R Package of the Month: dplyr

I'm taking a bit of a hiatus from optimization and going to start a series of posts about R packages. I've been discovering a few useful packages lately that I think are worth sharing. This month I'll share my thoughts around "dplyr" and how it's really sped up my coding in R.

Through work and my side projects, I do a lot of data cleansing and high level analysis of random datasets. I tend to have the typical data problems most people get, not knowing what's inside or how it's structured. R has been a great language for parsing datasets apart and cleaning them. However, I have been writing lengthier code that was usually hard to follow. The "dplyr" package has helped streamline my code in such a way to help read through it later on and pass it on to other colleagues without having to spend time explaining it.

Here's a list of the functions/operators I use the most:

select: Select the columns I want for analysis.
left_join: Join two data frames together, keeping all of the values in the first dataset. Similar to the "left join" you find in SQL.
group_by: Within the data frame, choose which columns to aggregate by.
ungroup: Ungroup a data frame by a particular column.
summarize: Combined with "group_by", this allows you to create summary columns like sum, average or even concatenate text fields.
mutate: Similar to summarize, create a calculated field across columns for each row in the data frame.
filter: Select the rows of interest by matching a condition.
arrange: Order the rows in a data frame by column(s).
slice: Select the rows of interest by position in the data frame.
%>%: This operator allows your to build up a sequence of commands without having to repeat data frame names in each function or saving out intermediate data frames after each step.

Below is a simple example using baseball data provided by the Lahman package. First we read in data around batting statistics and around players. Starting with the Batting data frame, we will aggregate by player ID and year to calculate total number of triples in a year per player. I also concatenate team names and calculate the number of distinct leagues he played in. I then sort by triples and keep only records from 2004. The next two steps involve making the output a bit nicer looking. Rather than keeping the player ID, I create a full name for each player in the Master dataset and append my working data frame. Line 18 takes a subset of columns and renames them. Finally, rather than output the entire list, I keep only the top 5 records.

For more information around dplyr, I recommend checking out the vignettes on the package site and the walkthrough by RStudio.