Feasible Analytics

Monday, November 24, 2014

Geocoding in R with Google Maps API

There are some occasions where you might need to translate an address into geospatial coordinates for analysis. Here's a quick post on how to do this using R and Google Maps API.

I will be using two libraries common with reaching out and getting information from various APIs on the web.

RJSONIO: Useful for parsing JSON objects
RCurl: Used in sending a web request to an API

With these two packages, we'll build out the url for the request, send the request, and parse the results. There are several parameters returned in the results, but I really care about four pieces of information:

A formatted version of the address
The latitude and longitude coordinates for the address
The accuracy for the given coordinates

The formatted address can be helpful in making sure Google correctly interpreted your address. Unfortunately, Google can't pinpoint where every address is exactly. They classify their accuracy into four different categories. The descriptions below have been taken from the API website:

"ROOFTOP" indicates that the returned result is a precise geocode for which we have location information accurate down to street address precision.
"RANGE_INTERPOLATED" indicates that the returned result reflects an approximation (usually on a road) interpolated between two precise points (such as intersections). Interpolated results are generally returned when rooftop geocodes are unavailable for a street address.
"GEOMETRIC_CENTER" indicates that the returned result is the geometric center of a result such as a polyline (for example, a street) or polygon (region).
"APPROXIMATE" indicates that the returned result is approximate.

From my testing, it appears that normal home addresses are almost always "RANGE_INTERPOLATED" whereas landmarks are marked as "ROOFTOP".

For more information, see the Google Maps API documentation. This post was inspired by a post done by Jose Gonzalez.

Sunday, November 16, 2014

R Package of the Month: dplyr

I'm taking a bit of a hiatus from optimization and going to start a series of posts about R packages. I've been discovering a few useful packages lately that I think are worth sharing. This month I'll share my thoughts around "dplyr" and how it's really sped up my coding in R.

Through work and my side projects, I do a lot of data cleansing and high level analysis of random datasets. I tend to have the typical data problems most people get, not knowing what's inside or how it's structured. R has been a great language for parsing datasets apart and cleaning them. However, I have been writing lengthier code that was usually hard to follow. The "dplyr" package has helped streamline my code in such a way to help read through it later on and pass it on to other colleagues without having to spend time explaining it.

Here's a list of the functions/operators I use the most:

select: Select the columns I want for analysis.
left_join: Join two data frames together, keeping all of the values in the first dataset. Similar to the "left join" you find in SQL.
group_by: Within the data frame, choose which columns to aggregate by.
ungroup: Ungroup a data frame by a particular column.
summarize: Combined with "group_by", this allows you to create summary columns like sum, average or even concatenate text fields.
mutate: Similar to summarize, create a calculated field across columns for each row in the data frame.
filter: Select the rows of interest by matching a condition.
arrange: Order the rows in a data frame by column(s).
slice: Select the rows of interest by position in the data frame.
%>%: This operator allows your to build up a sequence of commands without having to repeat data frame names in each function or saving out intermediate data frames after each step.

Below is a simple example using baseball data provided by the Lahman package. First we read in data around batting statistics and around players. Starting with the Batting data frame, we will aggregate by player ID and year to calculate total number of triples in a year per player. I also concatenate team names and calculate the number of distinct leagues he played in. I then sort by triples and keep only records from 2004. The next two steps involve making the output a bit nicer looking. Rather than keeping the player ID, I create a full name for each player in the Master dataset and append my working data frame. Line 18 takes a subset of columns and renames them. Finally, rather than output the entire list, I keep only the top 5 records.

For more information around dplyr, I recommend checking out the vignettes on the package site and the walkthrough by RStudio.