Sunday, November 16, 2014

R Package of the Month: dplyr

I'm taking a bit of a hiatus from optimization and going to start a series of posts about R packages. I've been discovering a few useful packages lately that I think are worth sharing. This month I'll share my thoughts around "dplyr" and how it's really sped up my coding in R.

Through work and my side projects, I do a lot of data cleansing and high level analysis of random datasets. I tend to have the typical data problems most people get, not knowing what's inside or how it's structured. R has been a great language for parsing datasets apart and cleaning them. However, I have been writing lengthier code that was usually hard to follow. The "dplyr" package has helped streamline my code in such a way to help read through it later on and pass it on to other colleagues without having to spend time explaining it.

Here's a list of the functions/operators I use the most:
  • select: Select the columns I want for analysis.
  • left_join: Join two data frames together, keeping all of the values in the first dataset. Similar to the "left join" you find in SQL.
  • group_by: Within the data frame, choose which columns to aggregate by.
  • ungroup: Ungroup a data frame by a particular column.
  • summarize: Combined with "group_by", this allows you to create summary columns like sum, average or even concatenate text fields.
  • mutate: Similar to summarize, create a calculated field across columns for each row in the data frame.
  • filter: Select the rows of interest by matching a condition.
  • arrange: Order the rows in a data frame by column(s).
  • slice: Select the rows of interest by position in the data frame.
  • %>%: This operator allows your to build up a sequence of commands without having to repeat data frame names in each function or saving out intermediate data frames after each step.
Below is a simple example using baseball data provided by the Lahman package. First we read in data around batting statistics and around players. Starting with the Batting data frame, we will aggregate by player ID and year to calculate total number of triples in a year per player. I also concatenate team names and calculate the number of distinct leagues he played in. I then sort by triples and keep only records from 2004. The next two steps involve making the output a bit nicer looking. Rather than keeping the player ID, I create a full name for each player in the Master dataset and append my working data frame. Line 18 takes a subset of columns and renames them. Finally, rather than output the entire list, I keep only the top 5 records. 


For more information around dplyr, I recommend checking out the vignettes on the package site and the walkthrough by RStudio

No comments:

Post a Comment