Saturday, March 29, 2014

Measuring Accuracy of a Naive Bayes Classifier in Python

In the last post, I built a Naive Bayes classifier that uses a training dataset to classify tweets into a particular sentiment. One thing I didn't mention is how to measure the accuracy of that classifier. In order to determine how large a training dataset needs to be, we first need a metric that can be used to measure the classifier.

In the previous program, I had read the tweets into a list of tuples. The first step is to change this into a list of lists. This can be done simply with the following generator:

tweets = [list(t) for t in taggedtweets]

Now that we have a list of tweets, we can simply iterate through each tweet and add the result of classifying each tweet.

for t in tweets:
    t.append(classifier.classify(feature_extractor(t[0])))

Now the list has the tweet, the training classification and the scoring classification. We can then simply calculate how many of them are the same and divide by the number of tweets. Since I like to view the results in the shell, I added a nice print statement that will format the accuracy as a percent.

accuracy = sum(t[1]==t[2] for t in tweets)/float(len(tweets))
print "Classifier accuracy is {:.0%}".format(accuracy)

Using the training dataset I provided, this comes out with a accuracy of 46%. Not exactly the most accurate dataset. Ideally I would compare this against another dataset, but I don't have one available at the moment. Looking at the results can bring up another question as to whether this is the best metric to use. The "classify" method chooses the classification with the highest probability. Is it worth taking into account the probability distribution calculated by the classifier? For example, a tweet could have this probability distribution:

P(sentiment==positive)=0.000000
P(sentiment==negative)=0.620941
P(sentiment==neutral)=0.379059

In this example, the tweet is actually classified as neutral. Using the above metric, it gets marked as being inaccurate. However, should the fact that the model's probability of being neutral count in some way? I think it should get some partial credit. Instead of counting it as not matching (score of 0), I use the probability of what the tweet was categorized (score of 0.379059). Although this increases the credit for non-matches, this decreases the credit given for matches. This calculation changes the code slightly:

tweets = [list(t) for t in taggedtweets]

for t in tweets:
    t.append(classifier.classify(feature_extractor(t[0])))
    pc = classifier.prob_classify(feature_extractor(t[0]))
    t.append(pc.prob(t[1]))

accuracy = sum(t[1]==t[2] for t in tweets)/float(len(tweets))
weighted_accuracy = sum(t[3] for t in tweets)/float(len(tweets))
print "Classifier accuracy is {:.0%}".format(accuracy)
print "Classifier weighted accuracy is {:.0%}".format(weighted_accuracy)

The results aren't quite what I expected. It comes out to roughly 46% like the simple accuracy. I think this weighted accuracy will be a better measurement. Using this metric, we can now work on improving the model and determining the size of a training set required to feed into the model.

Sunday, March 23, 2014

Sentiment Analysis using Python and NLTK

Keeping on the trend of using Python to analyze Twitter data, here's a sample program for analyzing those tweets. In general, this program utilizes an initial training dataset (referenced below as "training data.csv") that initially classifies tweets into "Positive", "Neutral", or "Negative". This dataset is then used to train a Naive Bayes classifier that will be used to score future tweets.

First step in any program is to import the useful libraries for the program. We'll be using the NLTK package for analyzing the text and the CSV package for reading in the training dataset.
import nltk
from nltk.corpus import stopwords
import csv

This is a small list of custom stop words that I can use to eliminate from the tweets.
customstopwords = ['band', 'they', 'them']

Next up, let's open up the training dataset and parse it. I don't particularly like the idea of '1' or '-1' to represent positive and negative so I recode these into "positive" and "negative". I'm sure there's a more elegant way of doing this but found this to be a simple way.
ifile = open('training data.csv', 'rb')
reader = csv.reader(ifile)

taggedtweets = []

rownum=0
for row in reader:
    if rownum == 0:
        header = row
    else:
        colnum = 0
        txt = ''
        for col in row:
            if colnum == 0:
                txt = col
            else:
                if col == '1':
                    taggedtweets.append((txt,'positive'))
                elif col == '0':
                    taggedtweets.append((txt,'neutral'))
                elif col == '-1':
                    taggedtweets.append((txt,'negative'))
            colnum+=1
    rownum+=1

ifile.close()

Now that we have our training dataset, let's create an empty list and insert a tuple of the tweet with its respective sentiment.
tweets = []

for (word, sentiment) in taggedtweets:
    word_filter = [i.lower() for i in word.split()]
    tweets.append((word_filter, sentiment))

I found that there were two useful functions for being able to analyze the results. First a function that will let you get all of the words in a tweet. The second function orders the list of tweets by their frequency. 
def getwords(tweets):
    allwords = []
    for (words, sentiment) in tweets:
        allwords.extend(words)
    return allwords

def getwordfeatures(listoftweets):
    wordfreq = nltk.FreqDist(listoftweets)
    words = wordfreq.keys()
    return words

We will now use the list of tweets in the training dataset to create a comprehensive list of words. At the same time, we will remove the English stop words. These words often provide noise when determining sentiment of a tweet. The function afterwards will be used for identifying the features within each tweet. 
wordlist = [i for i in getwords(tweets) if not i in stopwords.words('english')]

def feature_extractor(doc):
    docwords = set(doc)
    features = {}
    for i in wordlist:
        features['contains(%s)' % i] = (i in docwords)
    return features

Using the cleaned training dataset, we feed it into a classifier. This classifier can then be used to score future tweets. We can also print out some interesting information about this classifier like the top 20 most informative features. In this case, they are the 20 terms that have strong predictive tendencies to be one sentiment value over another one. 
training_set = nltk.classify.apply_features(feature_extractor, tweets)

classifier = nltk.NaiveBayesClassifier.train(training_set)
print "These are the top 20 most informative features\n"
print classifier.show_most_informative_features(n=20)

Now that we have a classifier, we can use it for multiple purposes. I think my next blog will be about how to simultaneously improve this classifier while scoring a stream of tweets that get caught with a search. I'd like to be able to also throw some numbers out on appropriate size for a training dataset. Keep an eye out for future blog entries.

Friday, March 21, 2014

Similar Players in MLB: Comparison against Baseball-Reference

As many of you are aware, Baseball-Reference calculates a similarity score for each player against other players. Although this is an accepted way to calculate the similarity between two players, I wanted to see if my methodology compares. I ran my methodology for Hank Aaron (someone we all know and can understand the comparisons) and compared the list I got against the list Baseball-Reference posted. First, let's look at what Baseball-Reference has:
  1. Willie Mays (782) 
  2. Barry Bonds (748) 
  3. Frank Robinson (667) 
  4. Stan Musial (666) 
  5. Babe Ruth (645) 
  6. Ken Griffey (629) 
  7. Carl Yastrzemski (627) 
  8. Rafael Palmeiro (611) 
  9. Alex Rodriguez (610) 
  10. Mel Ott (602) 
I don't think anyone would argue with any of these players. However, what's the list of players I came up with? Well, here they are:
  1. Willie Mays 
  2. Frank Robinson 
  3. Al Kaline 
  4. Ernie Banks 
  5. Billy Williams 
  6. Brooks Robinson 
  7. Roberto Clemente 
  8. Ken Boyer 
  9. Norm Cash 
  10. Carl Yastrzemski 
What's interesting about my list is there are certainly players that don't seem comparable to Hank Aaron. The question then becomes, how did they make it here? Quickly looking at the numbers you can see that I included more statistics for comparison than Baseball-Reference. In addition, I used a weighting scheme for comparing various statistics. Here's the full list of comparisons. Note that the lower the number, the closer they are in comparison.


So which one is right? I think its easy to say that Baseball-Reference seems more accurate, but I am continuously looking to improve this methodology and see how that impacts the results. Keep tuned for the final version of the code and methodology.

Thursday, March 20, 2014

Similar Players in MLB: Calculating Similarity

As I mentioned in Similar Players in MLB, I want to be able to see how similar players can be. I decided to take a somewhat different approach by looking at how a player's career compares against another player's career. In order to put things in terms of a career, I didn't want to simply sum up their statistics or normalize given the number of years they played. I wanted to be able to compare a player's second year against another player's second year. This isn't a simple problem though.

First part of the problem, what statistics do you use to compare two different players? Using the Lahman Database, I had easy access to the common statistics like games played, at bats, runs scored, hits, doubles, triples, etc. However, this database doesn't simply just have the counting stats by year. It is a compiled record of stats by league and team. In order to simplify the data collection, I aggregated the information up to the combination of player and year.

Second part of the problem, how accurate is it to compare a player's first year against another player's first year? Is it possible to correct for a player being sent to the majors a little early or taking a couple years to develop? This means being able to compare a player's first year against another player's first, second, or perhaps third year to determine the closest match. How do you adjust for a player's third year against a player's first year? While I use SAS at work, I don't have access to it's functions at home. I noticed that SAS's PROC SIMILARITY has the capability for being able to calculate the minimum distance between two time series. Consider the example below of two player's games.



Note that they're not the exact same, but you can see the similarity between the two. Using dynamic programming, you can find the minimum distance between these two time series. Distance between two time series can be a simple euclidean distance or something slightly different. However, I don't have access to SAS at home. Luckily enough, I was able to find a package in R that has this capability. Using the "dtw" package, you can easily calculate the distance between two different time series. Applying this package to the above example, gives you the results below. This three-way plot shows each player's data in the margins and how each point maps. The closer the plot is to Y=X implies the how close the mapping of one series of data is to another.



The next problem is how to use all of this similarity to figure out what players are actually similar to each other. Keep an eye out for my next blog on using Social Network Analysis to cluster these players together and incorporating more than just Games Played.

Wednesday, March 19, 2014

Similar Players in MLB

It's a common question to compare baseball players against each other. The question is what do you actually compare? Their playing styles? Positions they played? Teams they played for? Eras they played in? There are several dimensions to which this problem's complexity increases dramatically. In fact, several people now try to compare Yasiel Puig against Mike Trout (like Mark Saxon). However, how would you compare them?

I'm interested in developing an analytical technique that will remove the manual labor of looking at statistics. By removing that tedious task, it would be interesting to see how player's careers compare against other player careers. I'm hoping that in the end, I can use this as a significant factor in being able to predict whether a player will end up in the Hall of Fame.

More to come on this topic, but here's a little tease:
  • Data sourced from Sean Lahman
  • Techniques include: 
    • Dynamic Programming 
    • Social Network Analysis 
    • Logistic Regression 
  • Programmed entirely in R and RStudio

Tuesday, March 18, 2014

Downloading Twitter with Python

I've been working recently with downloading social data on my own through R. However, I kept seeing posts on how easy this was to download through Python so I figured I would give it a shot and write some code. As it turns out, it was incredibly simple to download and write out to a CSV file. There are a couple of requirements:
  1. A Twitter Development account: Free through Twitter
  2. Python Library rauth: Free through Rauth
Next the actual code. Note that I'm not a Python expert so feel free to leave a comment on a more efficient method if there is one.

Load the Python libraries.
from rauth import OAuth1Service
import string
import csv

Define a function to keep only printable characters.
def printable(s):

return filter(lambda x: x in string.printable, s)
Here's a list of fields that I want to keep. Note there is a lot more information to a single Tweet, but are often blank or missing.

fieldnames = ['handle',
'text',
'coordinates',
'created_at',
'tweet_id',
'favorite_count',
'retweet_count'
]


Initialize the CSV file to write Tweets into.

writer = csv.DictWriter(open('tweets.csv','wb'),fieldnames=fieldnames)
writer.writeheader()


Get a real consumer key & secret from https://dev.twitter.com/apps/new.

twitter = OAuth1Service(
name='twitter',
consumer_key='consumerkey',
consumer_secret='consumersecret',
request_token_url='https://api.twitter.com/oauth/request_token',
access_token_url='https://api.twitter.com/oauth/access_token',
authorize_url='https://api.twitter.com/oauth/authorize',
base_url='https://api.twitter.com/1.1/')

Initialize the request token and retrieve from your web browser.

request_token, request_token_secret = twitter.get_request_token()
authorize_url = twitter.get_authorize_url(request_token)

print('Visit this URL in your browser: {url}'.format(url=authorize_url))
pin = input('Enter PIN from browser: ')

After entering the pin, your session will start with this section of code.

session = twitter.get_auth_session(request_token,
request_token_secret,
method='POST',
data={'oauth_verifier': pin})

Each method in Twitter has some parameters to it. This example will grab the timeline for the authenticating user. There are two parameters that we will use to include retweets and limit it to 200 tweets.

params = {'include_rts': 1,
'count': 200}
Now let's get the actual JSON object with the information!
r = session.get('statuses/user_timeline.json', params=params, verify=True)


Last but not least, let's iterate through the tweets and write the information out to the CSV. Note that there can be issues if you don't strip out non-printable characters or change the encoding of the text.

for i, tweet in enumerate(r.json(), 1):
t={}
t['handle'] = printable(tweet['user']['screen_name'])
t['text'] = printable(tweet['text'])
t['coordinates'] = tweet['coordinates']
t['created_at'] = printable(tweet['created_at'])
t['tweet_id'] = printable(tweet['id_str'])
t['favorite_count'] = tweet['favorite_count']
t['retweet_count'] = tweet['retweet_count']


print(u'{0}. @{1}: {2}'.format(i, t['created_at'], t['text']))
writer.writerow(t)


I liked this approach a lot better than several versions I saw using R. Next on my list is to look at the NLTK package for natural language processing in Python. I want to be able to do a nice comparison against some equivalent packages in R to see which is easier to use, more comprehensive, and more efficient in processing. Keep tuned for the results.

Wednesday, March 5, 2014

Optimization using R and LPSolve

I saw a recent post on OR-Exchange about what programming language is best to for optimization. While I agree with @JFPuget that languages are just a wrapper for various solvers, there is a learning curve behind how to use each wrapper. My previous entries are about how to program in SAS using optmodel. I took this opportunity to write up the same example inside of R and using lpsolve.

First up, let's setup the data.

products<-data.frame(
   product=c("TRAVEL","CASH Rewards","HOTEL"),
   volume=rep(500,3))
prices<-data.frame(
   price=c("10.99"),
   volume=c(500))
set.seed(123)
customers<-data.frame(
   id=seq(1,1000),
   customer_status=rbinom(1000,1,0.25))
set.seed(123)
require("triangle")
model.scores<-rbind(
   data.frame(
      id=seq(1,1000),
      price=rep("10.99",1000),
      product=rep("TRAVEL",1000),
      expected.profit=runif(1000,1,100),
      likelihood.to.apply=rtriangle(1000,0.0,1.0,0.6)),
   data.frame(
      id=seq(1,1000),
      price=rep("10.99",1000),
      product=rep("CASH Rewards",1000),
      expected.profit=runif(1000,1,100),
      likelihood.to.apply=rtriangle(1000,0.0,1.0,0.6)),
   data.frame(
      id=seq(1,1000),
      price=rep("10.99",1000),
      product=rep("HOTEL",1000),
      expected.profit=runif(1000,1,100),
      likelihood.to.apply=rtriangle(1000,0.0,1.0,0.6)))

Next Up, the optimization code. I will note that this is somewhat confusing but I will break it up into sections similar to what I did with the SAS version. I will be using these two libraries for lpSolve. The first one provides access to the solver itself but the API is what actually makes this relatively easy to code.

require("lpSolve");require("lpSolveAPI")

The first step is making an empty optimization problem named ip. I start with zero constraints and add in the number of decision variables required.

ip <- make.lp(0,nrow(customers)*nrow(products)*nrow(prices))

Next I declare each decision variable as binary.

set.type(ip,seq(1,nrow(customers)*nrow(products)*nrow(prices)),type="binary")

This next part will seem like a relatively pointless step, but it will help in adding constraints. This data.frame contains every combination of customer, product, and price available. This will provide the index for assigning values in constraints and the objective function.

combos <- expand.grid(prices$price,products$product,customers$id)
names(combos)<-c('price','product','id')
rownames(combos)<-do.call(paste, c(combos[c("id", "product","price")], sep = "_"))
rownames(model.scores)<-do.call(paste, c(model.scores[c("id", "product","price")], sep = "_"))
model.scores.ordered<-model.scores[order(match(rownames(model.scores),rownames(combos))),]

By default, the objective function is set to minimize the problem. First I will set the coefficients to the decision variables in the objective function and then change it to be a maximization problem.

set.objfn(ip,model.scores.ordered$expected.profit*model.scores.ordered$likelihood.to.apply)
lp.control(ip,sense="max")

Here are two constants we will use in defining the constraints.

prod.per.person<-1
price.per.product<-1

First up, let's add in the constraints around the number of products per person.

for (c in customers$id) {
   add.constraint(ip,
                  rep(1,nrow(products)*nrow(prices)),
                  "<=",
                  prod.per.person,
                  which(combos$id==c)
   )
}

The next set of constraints will limit the number of price points per product per customer.

for (c in customers$id) {
   for (p in products$product) {
      add.constraint(ip,
                     rep(1,nrow(prices)),
                     "<=",
                     price.per.product,
                     which(combos$product==p & combos$id==c)
      )
   }
}

The next set of constraints assign the volume constraints for price points.

for (q in prices$price) {
   add.constraint(ip,
                  rep(1,nrow(customers)*nrow(products)),
                  "<=",
                  prices$volume[which(q==prices$price)],
                  which(combos$price==q)
   )
}

The next set of constraints assign the volume constraints for products.

for (p in products$product) {
   add.constraint(ip,
                  rep(1,length(which(combos$product==p))),
                  "<=",
                  products$volume[which(p==products$product)],
                  which(combos$product==p)
   )
}

Finally, let's clean up the formulation a little and assign names to the decision variables and constraints.

colnames <- character()
for (c in customers$id) {
   for (p in products$product) {
      for (q in prices$price) {
         colnames<-c(colnames,paste(c,p,q,sep="_"))
      }
   }
}
rownames <- character()
for (c in customers$id) {
   rownames<-c(rownames,paste("prod_per_cust",c,sep="_"))
}
for (c in customers$id) {
   for (p in products$product) {
      rownames<-c(rownames,paste("price_per_prod",c,p,sep="_"))
   }
}
for (q in prices$price) {
   rownames<-c(rownames,paste("price_vol",q,sep="_"))
}
for (p in products$product) {
   rownames<-c(rownames,paste("prod_vol",p,sep="_"))
}

dimnames(ip) <- list(rownames, colnames)

It's possible to write out the formulation in case you would like to review it. Note that this can be very overwhelming with the increase in any of the three dimensions to the problem.

write.lp(ip,"modelformulation.txt",type="lp",use.names=T)

Last, but not least, let's solve the problem. Following the ability to solve the problem are commands to view the objective value, decision variable values and constraint values.

solve(ip)
get.objective(ip)
get.variables(ip)

get.constraints(ip)

I'm sure there are more elegant ways for writing up this code, but it still remains a relatively painful way to code an optimization problem. I'll skip the visualization part for now. I'm going to spend time writing up a Python version next and then hope to write a comparison between the three versions.