Feasible Analytics: November 2014

Sunday, November 30, 2014

2014 DR XC Alumni Race Results

Well it was another fun race this year, especially with the added surprise of no water on the gas pipes. I will try and post up results here from now on along with some added analysis. I'll probably take a look at how runners progress as they get older, perhaps how course conditions affected times.

Either way, we had 15 participants this year with their times below. Please let me know if I misspelled any names.

Wednesday, November 26, 2014

ThinkUp: Personal Twitter Analytics

As a fan of various TWiT.TV shows, I learned about a website that will analyze your Twitter account. Given my curiosity around social media analytics, I figured I'd give it a shot. Although I'm not the most active Twitter user, I was curious what it could tell me and what they thought other people would be interested in.

Their analysis seems to be split into a few categories:

Analysis of your tweet contents
Analysis of responses to your tweet
Analysis of your friends and followers profiles

Overall, it's pretty interesting. It helps me manage my "brand" as it comes across on Twitter. For example, I can make sure I vary the content of my tweets and not just talk about myself in the majority of tweets. I also enjoy knowing what times I should tweet in order to get the best response.

Being the curious analytics person I am, there are a few other facts / insights I would like to see:

When are my followers most active?
Who else should I follow?
What is the general sentiment of my tweets?

You can view my ThinkUp page for a nice example, although there are some better ones out there.

Monday, November 24, 2014

Geocoding in R with Google Maps API

There are some occasions where you might need to translate an address into geospatial coordinates for analysis. Here's a quick post on how to do this using R and Google Maps API.

I will be using two libraries common with reaching out and getting information from various APIs on the web.

RJSONIO: Useful for parsing JSON objects
RCurl: Used in sending a web request to an API

With these two packages, we'll build out the url for the request, send the request, and parse the results. There are several parameters returned in the results, but I really care about four pieces of information:

A formatted version of the address
The latitude and longitude coordinates for the address
The accuracy for the given coordinates

The formatted address can be helpful in making sure Google correctly interpreted your address. Unfortunately, Google can't pinpoint where every address is exactly. They classify their accuracy into four different categories. The descriptions below have been taken from the API website:

"ROOFTOP" indicates that the returned result is a precise geocode for which we have location information accurate down to street address precision.
"RANGE_INTERPOLATED" indicates that the returned result reflects an approximation (usually on a road) interpolated between two precise points (such as intersections). Interpolated results are generally returned when rooftop geocodes are unavailable for a street address.
"GEOMETRIC_CENTER" indicates that the returned result is the geometric center of a result such as a polyline (for example, a street) or polygon (region).
"APPROXIMATE" indicates that the returned result is approximate.

From my testing, it appears that normal home addresses are almost always "RANGE_INTERPOLATED" whereas landmarks are marked as "ROOFTOP".

For more information, see the Google Maps API documentation. This post was inspired by a post done by Jose Gonzalez.

Sunday, November 16, 2014

R Package of the Month: dplyr

I'm taking a bit of a hiatus from optimization and going to start a series of posts about R packages. I've been discovering a few useful packages lately that I think are worth sharing. This month I'll share my thoughts around "dplyr" and how it's really sped up my coding in R.

Through work and my side projects, I do a lot of data cleansing and high level analysis of random datasets. I tend to have the typical data problems most people get, not knowing what's inside or how it's structured. R has been a great language for parsing datasets apart and cleaning them. However, I have been writing lengthier code that was usually hard to follow. The "dplyr" package has helped streamline my code in such a way to help read through it later on and pass it on to other colleagues without having to spend time explaining it.

Here's a list of the functions/operators I use the most:

select: Select the columns I want for analysis.
left_join: Join two data frames together, keeping all of the values in the first dataset. Similar to the "left join" you find in SQL.
group_by: Within the data frame, choose which columns to aggregate by.
ungroup: Ungroup a data frame by a particular column.
summarize: Combined with "group_by", this allows you to create summary columns like sum, average or even concatenate text fields.
mutate: Similar to summarize, create a calculated field across columns for each row in the data frame.
filter: Select the rows of interest by matching a condition.
arrange: Order the rows in a data frame by column(s).
slice: Select the rows of interest by position in the data frame.
%>%: This operator allows your to build up a sequence of commands without having to repeat data frame names in each function or saving out intermediate data frames after each step.

Below is a simple example using baseball data provided by the Lahman package. First we read in data around batting statistics and around players. Starting with the Batting data frame, we will aggregate by player ID and year to calculate total number of triples in a year per player. I also concatenate team names and calculate the number of distinct leagues he played in. I then sort by triples and keep only records from 2004. The next two steps involve making the output a bit nicer looking. Rather than keeping the player ID, I create a full name for each player in the Master dataset and append my working data frame. Line 18 takes a subset of columns and renames them. Finally, rather than output the entire list, I keep only the top 5 records.

For more information around dplyr, I recommend checking out the vignettes on the package site and the walkthrough by RStudio.