Sunday, December 14, 2014

DR Alumni Race Results: First attempt with Shiny

I've always touted how nice Tableau can be for producing interactive graphics quickly. Although Tableau requires having a license to work with "large" datasets or connecting outside of local text files. As far as I'm concerned, Tableau is still great for quick prototyping visualizations and has a relatively cheap license. What do you do when your data is online and Tableau isn't an option?

That's when I picked up on RStudio's Shiny. Shiny is a web application framework for being able to host your R scripts online. Well, about two nights of going through their awesome tutorial and I had myself a basic working visualization using the ggplot2. I was able to make a pretty nice static visualization that could redraw based on an input string (someone's name). In order to recreate the interactivity of Tableau, I needed a different package and chose to use rCharts. rCharts allows for a nice interface to the D3 library which allows for interactive charts. Below is a simple example extending Shiny to redraw an interactive scatter plot using the Polychart.JS library. What do you think?

Sunday, November 30, 2014

2014 DR XC Alumni Race Results

Well it was another fun race this year, especially with the added surprise of no water on the gas pipes. I will try and post up results here from now on along with some added analysis. I'll probably take a look at how runners progress as they get older, perhaps how course conditions affected times.

Either way, we had 15 participants this year with their times below. Please let me know if I misspelled any names.



Wednesday, November 26, 2014

ThinkUp: Personal Twitter Analytics

As a fan of various TWiT.TV shows, I learned about a website that will analyze your Twitter account. Given my curiosity around social media analytics, I figured I'd give it a shot. Although I'm not the most active Twitter user, I was curious what it could tell me and what they thought other people would be interested in.

Their analysis seems to be split into a few categories:
  • Analysis of your tweet contents
  • Analysis of responses to your tweet
  • Analysis of your friends and followers profiles
Overall, it's pretty interesting. It helps me manage my "brand" as it comes across on Twitter. For example, I can make sure I vary the content of my tweets and not just talk about myself in the majority of tweets. I also enjoy knowing what times I should tweet in order to get the best response.

Being the curious analytics person I am, there are a few other facts / insights I would like to see:
  • When are my followers most active?
  • Who else should I follow? 
  • What is the general sentiment of my tweets?
You can view my ThinkUp page for a nice example, although there are some better ones out there.  

Monday, November 24, 2014

Geocoding in R with Google Maps API

There are some occasions where you might need to translate an address into geospatial coordinates for analysis. Here's a quick post on how to do this using R and Google Maps API.

I will be using two libraries common with reaching out and getting information from various APIs on the web.
  • RJSONIO: Useful for parsing JSON objects
  • RCurl: Used in sending a web request to an API
With these two packages, we'll build out the url for the request, send the request, and parse the results. There are several parameters returned in the results, but I really care about four pieces of information: 
  • A formatted version of the address
  • The latitude and longitude coordinates for the address
  • The accuracy for the given coordinates
The formatted address can be helpful in making sure Google correctly interpreted your address. Unfortunately, Google can't pinpoint where every address is exactly. They classify their accuracy into four different categories. The descriptions below have been taken from the API website:
  • "ROOFTOP" indicates that the returned result is a precise geocode for which we have location information accurate down to street address precision.
  • "RANGE_INTERPOLATED" indicates that the returned result reflects an approximation (usually on a road) interpolated between two precise points (such as intersections). Interpolated results are generally returned when rooftop geocodes are unavailable for a street address.
  • "GEOMETRIC_CENTER" indicates that the returned result is the geometric center of a result such as a polyline (for example, a street) or polygon (region).
  • "APPROXIMATE" indicates that the returned result is approximate.
From my testing, it appears that normal home addresses are almost always "RANGE_INTERPOLATED" whereas landmarks are marked as "ROOFTOP". 


For more information, see the Google Maps API documentation. This post was inspired by a post done by Jose Gonzalez. 

Sunday, November 16, 2014

R Package of the Month: dplyr

I'm taking a bit of a hiatus from optimization and going to start a series of posts about R packages. I've been discovering a few useful packages lately that I think are worth sharing. This month I'll share my thoughts around "dplyr" and how it's really sped up my coding in R.

Through work and my side projects, I do a lot of data cleansing and high level analysis of random datasets. I tend to have the typical data problems most people get, not knowing what's inside or how it's structured. R has been a great language for parsing datasets apart and cleaning them. However, I have been writing lengthier code that was usually hard to follow. The "dplyr" package has helped streamline my code in such a way to help read through it later on and pass it on to other colleagues without having to spend time explaining it.

Here's a list of the functions/operators I use the most:
  • select: Select the columns I want for analysis.
  • left_join: Join two data frames together, keeping all of the values in the first dataset. Similar to the "left join" you find in SQL.
  • group_by: Within the data frame, choose which columns to aggregate by.
  • ungroup: Ungroup a data frame by a particular column.
  • summarize: Combined with "group_by", this allows you to create summary columns like sum, average or even concatenate text fields.
  • mutate: Similar to summarize, create a calculated field across columns for each row in the data frame.
  • filter: Select the rows of interest by matching a condition.
  • arrange: Order the rows in a data frame by column(s).
  • slice: Select the rows of interest by position in the data frame.
  • %>%: This operator allows your to build up a sequence of commands without having to repeat data frame names in each function or saving out intermediate data frames after each step.
Below is a simple example using baseball data provided by the Lahman package. First we read in data around batting statistics and around players. Starting with the Batting data frame, we will aggregate by player ID and year to calculate total number of triples in a year per player. I also concatenate team names and calculate the number of distinct leagues he played in. I then sort by triples and keep only records from 2004. The next two steps involve making the output a bit nicer looking. Rather than keeping the player ID, I create a full name for each player in the Master dataset and append my working data frame. Line 18 takes a subset of columns and renames them. Finally, rather than output the entire list, I keep only the top 5 records. 


For more information around dplyr, I recommend checking out the vignettes on the package site and the walkthrough by RStudio

Wednesday, July 30, 2014

LocalSolver License

I spent the last few days translating the LPSolve code from my previous optimization problem to using the LocalSolver API. I'll admit, I like the syntax and the accompanying R package to the solver quite a bit. I found it incredibly easy to import data, setup constraints and solve the problem. However, there was one big downside, I'm not a student or teacher and can't afford to buy a commercial license.

Anyway, here's the code in case you have a license that exceeds the limitations for the free license.

Monday, July 28, 2014

Better Fantasy Football Data for Optimization

It's been awhile, but finally got around to parsing a better data source. FFToday provides projections for multiple statistics per position. This fills the gap for Special Teams and Kickers that ESPN was not providing. In addition, you can filter to get the projected total points according to several different websites default scoring and rules. The link I used calculates total points for standard ESPN leagues, but you can easily adjust and pull for your own league or site.

Well, let's get into the fun part. In general, all of the positions are relatively simple in pulling the data but vary in the columns that are pulled. We'll only use two libraries for parsing the website and then combining the data frames.

require("XML")
require("plyr")
options(stringsAsFactors = FALSE)

Next up, let's pull the raw data. As you'll note, the first column and row are not useful for us so we'll drop it and then change the column names. The last piece looks at removing some weird characters that ended up inside of the Name field.

link<-"http://fftoday.com/rankings/playerproj.php?Season=2014&PosID=10&LeagueID=26955"

x<-readHTMLTable(link)
qb <- x[11]$'NULL'
qb <- qb[-1,-1]
names(qb)<-c("Name","Team","Bye.Week","Passing.Completions","Passing.Attempts",
             "Passing.Yards","Passing.TDs","Passing.INTs",
             "Rushing.Attempts","Rushing.Yards","Rushing.TDs","FFPTs")

qb$Name<-sapply(qb$Name,function(x) substring(x,3))

We then repeat this for the other positions.

link<-"http://fftoday.com/rankings/playerproj.php?Season=2014&PosID=20&LeagueID=26955"
x<-readHTMLTable(link)
rb <- x[11]$'NULL'
rb <- rb[-1,-1]
names(rb)<-c("Name","Team","Bye.Week","Rushing.Attempts","Rushing.Yards","Rushing.TDs",
             "Receiving.Catches","Receiving.Yards","Receiving.TDs","FFPTs")
rb$Name<-sapply(rb$Name,function(x) substring(x,3))

link<-"http://fftoday.com/rankings/playerproj.php?Season=2014&PosID=30&LeagueID=26955"
x<-readHTMLTable(link)
wr <- x[11]$'NULL'
wr <- wr[-1,-1]
names(wr)<-c("Name","Team","Bye.Week","Receiving.Catches","Receiving.Yards","Receiving.TDs",
             "Rushing.Attempts","Rushing.Yards","Rushing.TDs","FFPTs")
wr$Name<-sapply(wr$Name,function(x) substring(x,3))

link<-"http://fftoday.com/rankings/playerproj.php?Season=2014&PosID=40&LeagueID=26955"
x<-readHTMLTable(link)
te <- x[11]$'NULL'
te <- te[-1,-1]
names(te)<-c("Name","Team","Bye.Week","Receiving.Catches","Receiving.Yards","Receiving.TDs","FFPTs")
te$Name<-sapply(te$Name,function(x) substring(x,3))

link<-"http://fftoday.com/rankings/playerproj.php?Season=2014&PosID=80&LeagueID=26955"
x<-readHTMLTable(link)
k <- x[11]$'NULL'
k <- k[-1,-1]
names(k)<-c("Name","Team","Bye.Week","Field.Goals.Made","Field.Goals.Attempts","Field.Goals.Pct",
             "Extra.Points.Made","Extra.Points.Attempts","FFPTs")
k$Name<-sapply(k$Name,function(x) substring(x,3))

link<-"http://fftoday.com/rankings/playerproj.php?Season=2014&PosID=99&LeagueID=26955"
x<-readHTMLTable(link)
def <- x[11]$'NULL'
def <- def[-1,-1]
names(def)<-c("Name","Bye.Week","Sacks","Fumble.Recovered","Interceptions",
            "Defensive.TDs","Points.Against","Passing.Yards.per.Game","Rushing.Yards.per.Game",
            "Safety","Kicking.TDs","FFPTs")
def$Name<-sapply(def$Name,function(x) substring(x,3))

Lastly, we'll add a field with the position and then combine the datasets. Note since each of the data frames has different columns, we need to use a function from the PLYR package to combine the data frames. This leaves NAs where the column wasn't in the data frame. We can easily fix that with the last statement.

qb$pos <- as.factor("QB")
rb$pos <- as.factor("RB")
wr$pos <- as.factor("WR")
te$pos <- as.factor("TE")
def$pos <- as.factor("DEF")
k$pos <- as.factor("K")

d <- rbind.fill(qb,rb,wr,te,def,k)
d[is.na(d)]<-0

Hooray, better data to work with in our optimization. I'm looking into using a different solver that looks to be a bit more user friendly. Keep an eye out for a blog on localsolver up next!

Tuesday, July 1, 2014

Fantasy Football Optimization Part 1

I decided to break up the problem into baby steps. This first part will deal with building out the initial structure of the optimization problem. For those that read my other post on optimization in R, I'll be using the same libraries and style for setting up this problem.

First up, let's read in the data we created in the last post. We'll add a simple column that creates a numeric ID per player.
d <- read.csv(file=paste(getwd(),"/Data/ESPN-Projections.csv", sep=""))
d$id <- as.integer(factor(paste(d$name,d$team)))

Now that the data is all set, we can load the required solver libraries.
require("lpSolve");require("lpSolveAPI");

We can set the number of teams in the league. Given the number of teams in the league, we can set up a vector of team IDs.
num.teams <- 10
teams <- seq(1,num.teams)

Similarly, we can grab the number of players in our dataset and create a vector of the ids.
num.players <- length(unique(d$id))
players <- unique(d$id)

I'm going to create a data frame with the decision variables for our problem. First up is creating the cross product of all players and teams. We'll then merge in our player data and add in a team ID.
vars <- data.frame(player.id=rep(players,num.teams))
vars <- merge(x=vars,y=d,by.y="id",by.x="player.id")
vars <- vars[,c("player.id","pos","name")]
vars$team.id <- rep(seq(1,num.teams),num.players)

The data is set up and it's time to create the actual Integer Programming problem. Note that these decision variables are also binary, either a player is assigned to that team or he isn't.
ip <- make.lp(0,num.players*num.teams)
set.type(ip,seq(1,num.players*num.teams),type="binary")

The objective function is simply to maximize the number of projected points.
set.objfn(ip,rep(d$total.points,num.teams))
lp.control(ip,sense="max")

We need to add constraints for each player that ensures that if they are assigned to a team, that they are assigned to one and only one team.
for (p in players) {
  add.constraint(ip,
                 rep(1,num.teams),
                 "<=",
                 1,
                 which(vars$player.id==p)
                 )
}

Now for the team constraints. First up, the positions required for each team. For simplicity, I'm using the lineup that ESPN uses in their standard league. Here are the minimum number of positions to be drafted:

  • 1 QB
  • 2 RB
  • 2 WR
  • 1 RB/WR/TE (Flex player)
  • 1 TE
  • 1 DEF
  • 1 K


for (t in teams) {
  #This constraint covers having at least 1 QB  
  add.constraint(ip,
                 rep(1,sum(vars$pos=="QB")/num.teams),
                 ">=",
                 1,
                 which(vars$team.id==t & vars$pos=="QB")
  )
  #This constraint covers having at least 2 WR
  add.constraint(ip,
                 rep(1,sum(vars$pos=="WR")/num.teams),
                 ">=",
                 2,
                 which(vars$team.id==t & vars$pos=="WR")
  )
  #This constraint covers having at least 2 RB
  add.constraint(ip,
                 rep(1,sum(vars$pos=="RB")/num.teams),
                 ">=",
                 2,
                 which(vars$team.id==t & vars$pos=="RB")
  )
  #This constraint covers having at least 1 DEF
  add.constraint(ip,
                 rep(1,sum(vars$pos=="DEF")/num.teams),
                 ">=",
                 1,
                 which(vars$team.id==t & vars$pos=="DEF")
  )
  #This constraint covers having at least 1 K
  add.constraint(ip,
                 rep(1,sum(vars$pos=="K")/num.teams),
                 ">=",
                 1,
                 which(vars$team.id==t & vars$pos=="K")
  )
  #This constraint covers having at least 1 TE
  add.constraint(ip,
                 rep(1,sum(vars$pos=="TE")/num.teams),
                 ">=",
                 1,
                 which(vars$team.id==t & vars$pos=="TE")
  )
  #This constraint covers having at least 1 flex player. Note that the other constraints require at least 1 TE, 2 RB, 2 WR. In order to cover a flex player, the total sum of players from those positions needs to be at least 6.
  add.constraint(ip,
                 rep(1,sum(vars$pos=="TE",vars$pos=="RB",vars$pos=="WR")/num.teams),
                 ">=",
                 6,
                 which(vars$team.id==t & (vars$pos=="TE" | vars$pos=="RB" | vars$pos=="WR"))
  )
  #This constraint covers each team having 16 players
  add.constraint(ip,
                 rep(1,num.players),
                 "=",
                 16,
                 which(vars$team.id==t)
  )
}

Well that's it for our basic set of constraints. If you're interested in seeing what the model formulation looks like, execute the "write.lp" statement below.
write.lp(ip,paste(getwd(),"/modelformulation.txt",sep=""),type="lp",use.names=T)

Now the fun part, solving the integer program. Following that it is feasible (and it is) we get the objective function value and the solution.
solve(ip)
get.objective(ip)
get.variables(ip)

Although seeing the solution looks relatively complex, we can simply keep the assignments and print them out.

sol<-vars[get.variables(ip)==1,c("name","team.id","pos")]
View(sol[order(sol$team.id,sol$pos),])

One huge downside to this approach is the lack of actual drafting strategy or complications. This problem simply looks at dividing talent evenly across teams. I particularly dislike the results of some teams ending up with more than one kicker. No one should ever own more than one kicker.

My next step is to either improve the formulation of this problem, probably by using some options mentioned in this Fantasy Football Analytics post, or to look at applying a different algorithm to solving this problem.

Tuesday, June 24, 2014

Downloading ESPN Fantasy Football Projections


I'm starting another series of posts that will look at determining the best fantasy football draft using optimization and game theory. However, first we will need some data to play with. Using FantasyFootballAnalytics code, I adjusted it to download stats off of the ESPN rankings page.

First up, let's load some useful packages.

require("XML")
require("stringr")
require("ggplot2")
require("plyr")


Next, I want to make sure we read in all strings as strings and not create factors from them.
options(stringsAsFactors = FALSE)

Here's the hard part. We need to go out to the website, download the data and clean it up.
clean <- function(pos) {
    link<-"http://games.espn.go.com/ffl/tools/projections?&seasonTotals=true&seasonId=2014&slotCategoryId="
    x <- readHTMLTable(paste(link,pos,sep=""))$playertable_0
    #update the column names
    names(x) <- c("position.rank","name.team.pos", "passing.completions.per.attempt", "passing.yards", "passing.tds", "passing.interceptions", "rushing.rushes", "rushing.yards", "rushing.tds", "receiving.catches", "receiving.yards", "receiving.tds", "total.points")

    #remove header row
    x <- x[-1,]

    #separate out the completions and attempts
    x$passing.completions <- str_sub(x$passing.completions.per.attempt, end=str_locate(string=x$passing.completions.per.attempt, '/')[,1]-1)
    x$passing.attempts <- str_sub(x$passing.completions.per.attempt, start=str_locate(string=x$passing.completions.per.attempt, '/')[,1]+1)
    x[,"passing.completions.per.attempt"] <- NULL

  if (pos!="16") {
    x$name <- str_sub(x$name.team.pos, end=str_locate(string=x$name.team.pos, ',')[,1]-1)
    x$team <- str_sub(x$name.team.pos, 
                      start=str_locate(string=x$name.team.pos, ',')[,1]+2, 
                      end = str_locate(string=x$name.team.pos, ',')[,1]+4)
    x$team <- str_trim(x$team, side="right")
    x$team <- toupper(x$team) 
  }  else {
    x$name <- str_sub(x$name.team.pos, end=str_locate(string=x$name.team.pos, ' ')[,1]-1)
    team.lookup<-as.data.frame(matrix(c("DEN","Broncos","GB","Packers","NO","Saints","CAR","Panthers",
                                        "WSH","Redskins","DET","Lions","IND","Colts","PHI","Eagles","OAK","Raiders",
                                        "SEA","Seahawks","SF","49ers","DAL","Cowboys","ATL","Falcons","NE","Patriots",
                                        "SD","Chargers","MIN","Vikings","CHI","Bears","KC","Chiefs","CIN","Bengals",
                                        "PIT","Steelers","NYG","Giants","ARI","Cardinals","MIA","Dolphins",
                                        "BAL","Ravens","TB","Bucaneers","CLE","Browns","HOU","Texans","STL","Rams",
                                        "BUF","Bills","NYJ","Jets","TEN","Titans","JAC","Jaguars"),ncol=2,byrow=TRUE))
    names(team.lookup)<-c("team","name")
    x<-merge(x=x,y=team.lookup,by="name")
  } 
    x[,"name.team.pos"] <- NULL

    #change from character to numeric
    for (c in names(x)) {
        if (!(c %in% c("pos","name","team"))) x[,c] <- as.numeric(x[,c])
        #replace NAs with 0
        x[is.na(x[,c]),c] <- 0
    }
    return(x)
}


Now that the function is prepped for us, we just need to fill in the appropriate tail to each website. Note that since we'll need more than 40 RBs and 40 WRs, we'll grab the top 120 players at each position.

qb <- clean("0")
rb <- rbind(clean("2"),clean("2&startIndex=40"),clean("2&startIndex=80"))
wr <- rbind(clean("4"),clean("4&startIndex=40"),clean("4&startIndex=80"))
te <- clean("6")
def <- clean("16")
k <- clean("17")


Now let's add in a variable for each position.

qb$pos <- as.factor("QB")
rb$pos <- as.factor("RB")
wr$pos <- as.factor("WR")
te$pos <- as.factor("TE")
def$pos <- as.factor("DEF")
k$pos <- as.factor("K")


Finally, let's combine all of the position specific datasets.

d <- rbind(qb,rb,wr,te,def,k)

We can now rank each player by the point totals.

d$rank <- rank(-d$total.points, ties.method="min")

Finally, we can order the dataset by that rank.

d <- d[order(d$rank),]

Well this is a great start. Note that we have total points for Defense and Kicker, but no accompanying statistics. I'm going to explore the ESPN website to see what I can find but this will do in the meantime.

Saturday, March 29, 2014

Measuring Accuracy of a Naive Bayes Classifier in Python

In the last post, I built a Naive Bayes classifier that uses a training dataset to classify tweets into a particular sentiment. One thing I didn't mention is how to measure the accuracy of that classifier. In order to determine how large a training dataset needs to be, we first need a metric that can be used to measure the classifier.

In the previous program, I had read the tweets into a list of tuples. The first step is to change this into a list of lists. This can be done simply with the following generator:

tweets = [list(t) for t in taggedtweets]

Now that we have a list of tweets, we can simply iterate through each tweet and add the result of classifying each tweet.

for t in tweets:
    t.append(classifier.classify(feature_extractor(t[0])))

Now the list has the tweet, the training classification and the scoring classification. We can then simply calculate how many of them are the same and divide by the number of tweets. Since I like to view the results in the shell, I added a nice print statement that will format the accuracy as a percent.

accuracy = sum(t[1]==t[2] for t in tweets)/float(len(tweets))
print "Classifier accuracy is {:.0%}".format(accuracy)

Using the training dataset I provided, this comes out with a accuracy of 46%. Not exactly the most accurate dataset. Ideally I would compare this against another dataset, but I don't have one available at the moment. Looking at the results can bring up another question as to whether this is the best metric to use. The "classify" method chooses the classification with the highest probability. Is it worth taking into account the probability distribution calculated by the classifier? For example, a tweet could have this probability distribution:

P(sentiment==positive)=0.000000
P(sentiment==negative)=0.620941
P(sentiment==neutral)=0.379059

In this example, the tweet is actually classified as neutral. Using the above metric, it gets marked as being inaccurate. However, should the fact that the model's probability of being neutral count in some way? I think it should get some partial credit. Instead of counting it as not matching (score of 0), I use the probability of what the tweet was categorized (score of 0.379059). Although this increases the credit for non-matches, this decreases the credit given for matches. This calculation changes the code slightly:

tweets = [list(t) for t in taggedtweets]

for t in tweets:
    t.append(classifier.classify(feature_extractor(t[0])))
    pc = classifier.prob_classify(feature_extractor(t[0]))
    t.append(pc.prob(t[1]))

accuracy = sum(t[1]==t[2] for t in tweets)/float(len(tweets))
weighted_accuracy = sum(t[3] for t in tweets)/float(len(tweets))
print "Classifier accuracy is {:.0%}".format(accuracy)
print "Classifier weighted accuracy is {:.0%}".format(weighted_accuracy)

The results aren't quite what I expected. It comes out to roughly 46% like the simple accuracy. I think this weighted accuracy will be a better measurement. Using this metric, we can now work on improving the model and determining the size of a training set required to feed into the model.

Sunday, March 23, 2014

Sentiment Analysis using Python and NLTK

Keeping on the trend of using Python to analyze Twitter data, here's a sample program for analyzing those tweets. In general, this program utilizes an initial training dataset (referenced below as "training data.csv") that initially classifies tweets into "Positive", "Neutral", or "Negative". This dataset is then used to train a Naive Bayes classifier that will be used to score future tweets.

First step in any program is to import the useful libraries for the program. We'll be using the NLTK package for analyzing the text and the CSV package for reading in the training dataset.
import nltk
from nltk.corpus import stopwords
import csv

This is a small list of custom stop words that I can use to eliminate from the tweets.
customstopwords = ['band', 'they', 'them']

Next up, let's open up the training dataset and parse it. I don't particularly like the idea of '1' or '-1' to represent positive and negative so I recode these into "positive" and "negative". I'm sure there's a more elegant way of doing this but found this to be a simple way.
ifile = open('training data.csv', 'rb')
reader = csv.reader(ifile)

taggedtweets = []

rownum=0
for row in reader:
    if rownum == 0:
        header = row
    else:
        colnum = 0
        txt = ''
        for col in row:
            if colnum == 0:
                txt = col
            else:
                if col == '1':
                    taggedtweets.append((txt,'positive'))
                elif col == '0':
                    taggedtweets.append((txt,'neutral'))
                elif col == '-1':
                    taggedtweets.append((txt,'negative'))
            colnum+=1
    rownum+=1

ifile.close()

Now that we have our training dataset, let's create an empty list and insert a tuple of the tweet with its respective sentiment.
tweets = []

for (word, sentiment) in taggedtweets:
    word_filter = [i.lower() for i in word.split()]
    tweets.append((word_filter, sentiment))

I found that there were two useful functions for being able to analyze the results. First a function that will let you get all of the words in a tweet. The second function orders the list of tweets by their frequency. 
def getwords(tweets):
    allwords = []
    for (words, sentiment) in tweets:
        allwords.extend(words)
    return allwords

def getwordfeatures(listoftweets):
    wordfreq = nltk.FreqDist(listoftweets)
    words = wordfreq.keys()
    return words

We will now use the list of tweets in the training dataset to create a comprehensive list of words. At the same time, we will remove the English stop words. These words often provide noise when determining sentiment of a tweet. The function afterwards will be used for identifying the features within each tweet. 
wordlist = [i for i in getwords(tweets) if not i in stopwords.words('english')]

def feature_extractor(doc):
    docwords = set(doc)
    features = {}
    for i in wordlist:
        features['contains(%s)' % i] = (i in docwords)
    return features

Using the cleaned training dataset, we feed it into a classifier. This classifier can then be used to score future tweets. We can also print out some interesting information about this classifier like the top 20 most informative features. In this case, they are the 20 terms that have strong predictive tendencies to be one sentiment value over another one. 
training_set = nltk.classify.apply_features(feature_extractor, tweets)

classifier = nltk.NaiveBayesClassifier.train(training_set)
print "These are the top 20 most informative features\n"
print classifier.show_most_informative_features(n=20)

Now that we have a classifier, we can use it for multiple purposes. I think my next blog will be about how to simultaneously improve this classifier while scoring a stream of tweets that get caught with a search. I'd like to be able to also throw some numbers out on appropriate size for a training dataset. Keep an eye out for future blog entries.

Friday, March 21, 2014

Similar Players in MLB: Comparison against Baseball-Reference

As many of you are aware, Baseball-Reference calculates a similarity score for each player against other players. Although this is an accepted way to calculate the similarity between two players, I wanted to see if my methodology compares. I ran my methodology for Hank Aaron (someone we all know and can understand the comparisons) and compared the list I got against the list Baseball-Reference posted. First, let's look at what Baseball-Reference has:
  1. Willie Mays (782) 
  2. Barry Bonds (748) 
  3. Frank Robinson (667) 
  4. Stan Musial (666) 
  5. Babe Ruth (645) 
  6. Ken Griffey (629) 
  7. Carl Yastrzemski (627) 
  8. Rafael Palmeiro (611) 
  9. Alex Rodriguez (610) 
  10. Mel Ott (602) 
I don't think anyone would argue with any of these players. However, what's the list of players I came up with? Well, here they are:
  1. Willie Mays 
  2. Frank Robinson 
  3. Al Kaline 
  4. Ernie Banks 
  5. Billy Williams 
  6. Brooks Robinson 
  7. Roberto Clemente 
  8. Ken Boyer 
  9. Norm Cash 
  10. Carl Yastrzemski 
What's interesting about my list is there are certainly players that don't seem comparable to Hank Aaron. The question then becomes, how did they make it here? Quickly looking at the numbers you can see that I included more statistics for comparison than Baseball-Reference. In addition, I used a weighting scheme for comparing various statistics. Here's the full list of comparisons. Note that the lower the number, the closer they are in comparison.


So which one is right? I think its easy to say that Baseball-Reference seems more accurate, but I am continuously looking to improve this methodology and see how that impacts the results. Keep tuned for the final version of the code and methodology.

Thursday, March 20, 2014

Similar Players in MLB: Calculating Similarity

As I mentioned in Similar Players in MLB, I want to be able to see how similar players can be. I decided to take a somewhat different approach by looking at how a player's career compares against another player's career. In order to put things in terms of a career, I didn't want to simply sum up their statistics or normalize given the number of years they played. I wanted to be able to compare a player's second year against another player's second year. This isn't a simple problem though.

First part of the problem, what statistics do you use to compare two different players? Using the Lahman Database, I had easy access to the common statistics like games played, at bats, runs scored, hits, doubles, triples, etc. However, this database doesn't simply just have the counting stats by year. It is a compiled record of stats by league and team. In order to simplify the data collection, I aggregated the information up to the combination of player and year.

Second part of the problem, how accurate is it to compare a player's first year against another player's first year? Is it possible to correct for a player being sent to the majors a little early or taking a couple years to develop? This means being able to compare a player's first year against another player's first, second, or perhaps third year to determine the closest match. How do you adjust for a player's third year against a player's first year? While I use SAS at work, I don't have access to it's functions at home. I noticed that SAS's PROC SIMILARITY has the capability for being able to calculate the minimum distance between two time series. Consider the example below of two player's games.



Note that they're not the exact same, but you can see the similarity between the two. Using dynamic programming, you can find the minimum distance between these two time series. Distance between two time series can be a simple euclidean distance or something slightly different. However, I don't have access to SAS at home. Luckily enough, I was able to find a package in R that has this capability. Using the "dtw" package, you can easily calculate the distance between two different time series. Applying this package to the above example, gives you the results below. This three-way plot shows each player's data in the margins and how each point maps. The closer the plot is to Y=X implies the how close the mapping of one series of data is to another.



The next problem is how to use all of this similarity to figure out what players are actually similar to each other. Keep an eye out for my next blog on using Social Network Analysis to cluster these players together and incorporating more than just Games Played.

Wednesday, March 19, 2014

Similar Players in MLB

It's a common question to compare baseball players against each other. The question is what do you actually compare? Their playing styles? Positions they played? Teams they played for? Eras they played in? There are several dimensions to which this problem's complexity increases dramatically. In fact, several people now try to compare Yasiel Puig against Mike Trout (like Mark Saxon). However, how would you compare them?

I'm interested in developing an analytical technique that will remove the manual labor of looking at statistics. By removing that tedious task, it would be interesting to see how player's careers compare against other player careers. I'm hoping that in the end, I can use this as a significant factor in being able to predict whether a player will end up in the Hall of Fame.

More to come on this topic, but here's a little tease:
  • Data sourced from Sean Lahman
  • Techniques include: 
    • Dynamic Programming 
    • Social Network Analysis 
    • Logistic Regression 
  • Programmed entirely in R and RStudio

Tuesday, March 18, 2014

Downloading Twitter with Python

I've been working recently with downloading social data on my own through R. However, I kept seeing posts on how easy this was to download through Python so I figured I would give it a shot and write some code. As it turns out, it was incredibly simple to download and write out to a CSV file. There are a couple of requirements:
  1. A Twitter Development account: Free through Twitter
  2. Python Library rauth: Free through Rauth
Next the actual code. Note that I'm not a Python expert so feel free to leave a comment on a more efficient method if there is one.

Load the Python libraries.
from rauth import OAuth1Service
import string
import csv

Define a function to keep only printable characters.
def printable(s):

return filter(lambda x: x in string.printable, s)
Here's a list of fields that I want to keep. Note there is a lot more information to a single Tweet, but are often blank or missing.

fieldnames = ['handle',
'text',
'coordinates',
'created_at',
'tweet_id',
'favorite_count',
'retweet_count'
]


Initialize the CSV file to write Tweets into.

writer = csv.DictWriter(open('tweets.csv','wb'),fieldnames=fieldnames)
writer.writeheader()


Get a real consumer key & secret from https://dev.twitter.com/apps/new.

twitter = OAuth1Service(
name='twitter',
consumer_key='consumerkey',
consumer_secret='consumersecret',
request_token_url='https://api.twitter.com/oauth/request_token',
access_token_url='https://api.twitter.com/oauth/access_token',
authorize_url='https://api.twitter.com/oauth/authorize',
base_url='https://api.twitter.com/1.1/')

Initialize the request token and retrieve from your web browser.

request_token, request_token_secret = twitter.get_request_token()
authorize_url = twitter.get_authorize_url(request_token)

print('Visit this URL in your browser: {url}'.format(url=authorize_url))
pin = input('Enter PIN from browser: ')

After entering the pin, your session will start with this section of code.

session = twitter.get_auth_session(request_token,
request_token_secret,
method='POST',
data={'oauth_verifier': pin})

Each method in Twitter has some parameters to it. This example will grab the timeline for the authenticating user. There are two parameters that we will use to include retweets and limit it to 200 tweets.

params = {'include_rts': 1,
'count': 200}
Now let's get the actual JSON object with the information!
r = session.get('statuses/user_timeline.json', params=params, verify=True)


Last but not least, let's iterate through the tweets and write the information out to the CSV. Note that there can be issues if you don't strip out non-printable characters or change the encoding of the text.

for i, tweet in enumerate(r.json(), 1):
t={}
t['handle'] = printable(tweet['user']['screen_name'])
t['text'] = printable(tweet['text'])
t['coordinates'] = tweet['coordinates']
t['created_at'] = printable(tweet['created_at'])
t['tweet_id'] = printable(tweet['id_str'])
t['favorite_count'] = tweet['favorite_count']
t['retweet_count'] = tweet['retweet_count']


print(u'{0}. @{1}: {2}'.format(i, t['created_at'], t['text']))
writer.writerow(t)


I liked this approach a lot better than several versions I saw using R. Next on my list is to look at the NLTK package for natural language processing in Python. I want to be able to do a nice comparison against some equivalent packages in R to see which is easier to use, more comprehensive, and more efficient in processing. Keep tuned for the results.

Wednesday, March 5, 2014

Optimization using R and LPSolve

I saw a recent post on OR-Exchange about what programming language is best to for optimization. While I agree with @JFPuget that languages are just a wrapper for various solvers, there is a learning curve behind how to use each wrapper. My previous entries are about how to program in SAS using optmodel. I took this opportunity to write up the same example inside of R and using lpsolve.

First up, let's setup the data.

products<-data.frame(
   product=c("TRAVEL","CASH Rewards","HOTEL"),
   volume=rep(500,3))
prices<-data.frame(
   price=c("10.99"),
   volume=c(500))
set.seed(123)
customers<-data.frame(
   id=seq(1,1000),
   customer_status=rbinom(1000,1,0.25))
set.seed(123)
require("triangle")
model.scores<-rbind(
   data.frame(
      id=seq(1,1000),
      price=rep("10.99",1000),
      product=rep("TRAVEL",1000),
      expected.profit=runif(1000,1,100),
      likelihood.to.apply=rtriangle(1000,0.0,1.0,0.6)),
   data.frame(
      id=seq(1,1000),
      price=rep("10.99",1000),
      product=rep("CASH Rewards",1000),
      expected.profit=runif(1000,1,100),
      likelihood.to.apply=rtriangle(1000,0.0,1.0,0.6)),
   data.frame(
      id=seq(1,1000),
      price=rep("10.99",1000),
      product=rep("HOTEL",1000),
      expected.profit=runif(1000,1,100),
      likelihood.to.apply=rtriangle(1000,0.0,1.0,0.6)))

Next Up, the optimization code. I will note that this is somewhat confusing but I will break it up into sections similar to what I did with the SAS version. I will be using these two libraries for lpSolve. The first one provides access to the solver itself but the API is what actually makes this relatively easy to code.

require("lpSolve");require("lpSolveAPI")

The first step is making an empty optimization problem named ip. I start with zero constraints and add in the number of decision variables required.

ip <- make.lp(0,nrow(customers)*nrow(products)*nrow(prices))

Next I declare each decision variable as binary.

set.type(ip,seq(1,nrow(customers)*nrow(products)*nrow(prices)),type="binary")

This next part will seem like a relatively pointless step, but it will help in adding constraints. This data.frame contains every combination of customer, product, and price available. This will provide the index for assigning values in constraints and the objective function.

combos <- expand.grid(prices$price,products$product,customers$id)
names(combos)<-c('price','product','id')
rownames(combos)<-do.call(paste, c(combos[c("id", "product","price")], sep = "_"))
rownames(model.scores)<-do.call(paste, c(model.scores[c("id", "product","price")], sep = "_"))
model.scores.ordered<-model.scores[order(match(rownames(model.scores),rownames(combos))),]

By default, the objective function is set to minimize the problem. First I will set the coefficients to the decision variables in the objective function and then change it to be a maximization problem.

set.objfn(ip,model.scores.ordered$expected.profit*model.scores.ordered$likelihood.to.apply)
lp.control(ip,sense="max")

Here are two constants we will use in defining the constraints.

prod.per.person<-1
price.per.product<-1

First up, let's add in the constraints around the number of products per person.

for (c in customers$id) {
   add.constraint(ip,
                  rep(1,nrow(products)*nrow(prices)),
                  "<=",
                  prod.per.person,
                  which(combos$id==c)
   )
}

The next set of constraints will limit the number of price points per product per customer.

for (c in customers$id) {
   for (p in products$product) {
      add.constraint(ip,
                     rep(1,nrow(prices)),
                     "<=",
                     price.per.product,
                     which(combos$product==p & combos$id==c)
      )
   }
}

The next set of constraints assign the volume constraints for price points.

for (q in prices$price) {
   add.constraint(ip,
                  rep(1,nrow(customers)*nrow(products)),
                  "<=",
                  prices$volume[which(q==prices$price)],
                  which(combos$price==q)
   )
}

The next set of constraints assign the volume constraints for products.

for (p in products$product) {
   add.constraint(ip,
                  rep(1,length(which(combos$product==p))),
                  "<=",
                  products$volume[which(p==products$product)],
                  which(combos$product==p)
   )
}

Finally, let's clean up the formulation a little and assign names to the decision variables and constraints.

colnames <- character()
for (c in customers$id) {
   for (p in products$product) {
      for (q in prices$price) {
         colnames<-c(colnames,paste(c,p,q,sep="_"))
      }
   }
}
rownames <- character()
for (c in customers$id) {
   rownames<-c(rownames,paste("prod_per_cust",c,sep="_"))
}
for (c in customers$id) {
   for (p in products$product) {
      rownames<-c(rownames,paste("price_per_prod",c,p,sep="_"))
   }
}
for (q in prices$price) {
   rownames<-c(rownames,paste("price_vol",q,sep="_"))
}
for (p in products$product) {
   rownames<-c(rownames,paste("prod_vol",p,sep="_"))
}

dimnames(ip) <- list(rownames, colnames)

It's possible to write out the formulation in case you would like to review it. Note that this can be very overwhelming with the increase in any of the three dimensions to the problem.

write.lp(ip,"modelformulation.txt",type="lp",use.names=T)

Last, but not least, let's solve the problem. Following the ability to solve the problem are commands to view the objective value, decision variable values and constraint values.

solve(ip)
get.objective(ip)
get.variables(ip)

get.constraints(ip)

I'm sure there are more elegant ways for writing up this code, but it still remains a relatively painful way to code an optimization problem. I'll skip the visualization part for now. I'm going to spend time writing up a Python version next and then hope to write a comparison between the three versions.

Thursday, January 23, 2014

Optimization in SAS: Results

After spending all the time writing up code, next comes the fun part and what we really get paid for in the consulting world. I'm going to walk through three pieces of the results having hit "run". The three pieces are:
  1. The Log: This contains information behind the compilation of the code, including errors and high level results.
  2. The Results Window (Tab if you are in SAS EG): Contains the default results of having written a basic PROC OPTMODEL.
  3. Graphing the results: Some bonus code at the end to make some sense 
First up, the log produced from the program. SAS uses the log in order to print out the model description and the high level results. I'll skip the boring parts of the log where it reprints the code and get right to the important part. This set of "notes" in SAS describes what was created as the optimization formulation. It will also give the stats around "presolving" the problem and the solution status. SAS "presolves" the optimization problem by removing extraneous information like variables and non-binding constraints.

NOTE: The problem has 3000 variables (0 free, 0 fixed).
NOTE: The problem has 3000 binary and 0 integer variables.
NOTE: The problem has 4004 linear constraints (4004 LE, 0 EQ, 0 GE, 0 range).
NOTE: The problem has 12000 linear constraint coefficients.
NOTE: The problem has 0 nonlinear constraints (0 LE, 0 EQ, 0 GE, 0 range).
NOTE: The OPTMODEL presolver removed 0 variables, 3003 linear constraints, and 0 nonlinear constraints.
NOTE: The OPTMODEL presolved problem has 3000 variables, 1001 linear constraints, and 0 nonlinear constraints.
NOTE: The OPTMODEL presolver removed 8000 linear constraint coefficients, leaving 4000.
NOTE: The OPTMILP presolver value AUTOMATIC is applied.
NOTE: The OPTMILP presolver removed 0 variables and 0 constraints.
NOTE: The OPTMILP presolver removed 0 constraint coefficients.
NOTE: The OPTMILP presolver modified 0 constraint coefficients.
NOTE: The presolved problem has 3000 variables, 1001 constraints, and 4000 constraint coefficients.
NOTE: The MIXED INTEGER LINEAR solver is called.
          Node  Active    Sols    BestInteger      BestBound      Gap    Time
             0       1       2  41528.0754238  41528.0754238    0.00%       1
             0       0       2  41528.0754238              .    0.00%       1
NOTE: Optimal.
NOTE: Objective = 41528.0754.

Hooray, the model not only solved, but it found an optimal solution. Next part is to check the results of the program. This is where SAS will put all printouts from the procedures used in the program. There are three sections to this problem in the results section. The first one is the test print we had to verify that our data was read in to the model correctly.


The next part shows the model summary. You'll noticed that this looks like the output in the log, but forced into a table format. I don't particularly use this table very often other than to verify the size of the table. It can be helpful to note the size of the table in reference to the run-time for the problem.

problem summary

The last piece shows the solution summary. This gives you some of the relevant information you need to evaluate your model.

Solution Summary


The last step for me is taking a stab at visualizing the results produced by the optimization model. Since this optimization problem is relatively simple, I want to color code who was mailed and who wasn't for each of the products. Since my objective function is the product of Profit and the Likelihood To Apply, I use each as an axis inside of the chart. Below is the simple macro I use to build the chart for each product. First I use SQL to generate a dynamic header for the chart with the total expected profit for that product. I then use SGPLOT to draw the scatter plot. 

proc format;
    value mail
        1="Mail"
        0="No Mail"
    ;
run;

%macro graph_results(product);
goptions device=svg;
ods graphics on / antialiasmax=10000;
proc sql noprint;
    select sum(e_profit*lta) format=dollar12. 
        into :profit 
        from results 
        where mail=1
        and product = "&product"
    ;
quit;

proc sgplot data=results;
    title1 "Optimization Results for &product.";
    title2 "Total Expected Profit = &profit.";
    where product = "&product";
    format mail mail.;
    scatter x=e_profit y=lta / 
        group=mail 
        markerattrs=(size=5 symbol=circlefilled)
        transparency=0.7
        name="scatter"
    ;
    keylegend "scatter" / across=1 position=topleft noborder location=inside ;
    xaxis label="Expected Profit" values=(0 to 100 by 20) display=(noticks);
    yaxis label="Likelihood to Apply" values=(0 to 1 by .1) display=(noticks);
run;
title;
ods graphics off;
goptions reset=all;
%mend;

%graph_results(TRAVEL);




Note that it isn't the greatest picture, but it gives you an idea of the solution. I hope to give the same visualization a shot in Tableau to demonstrate that without coding up a chart, you can still get a nice visualization.