Feasible Analytics: naive bayes

Saturday, March 29, 2014

Measuring Accuracy of a Naive Bayes Classifier in Python

In the last post, I built a Naive Bayes classifier that uses a training dataset to classify tweets into a particular sentiment. One thing I didn't mention is how to measure the accuracy of that classifier. In order to determine how large a training dataset needs to be, we first need a metric that can be used to measure the classifier.

In the previous program, I had read the tweets into a list of tuples. The first step is to change this into a list of lists. This can be done simply with the following generator:

tweets = [list(t) for t in taggedtweets]

Now that we have a list of tweets, we can simply iterate through each tweet and add the result of classifying each tweet.

for t in tweets:
t.append(classifier.classify(feature_extractor(t[0])))

Now the list has the tweet, the training classification and the scoring classification. We can then simply calculate how many of them are the same and divide by the number of tweets. Since I like to view the results in the shell, I added a nice print statement that will format the accuracy as a percent.

accuracy = sum(t[1]==t[2] for t in tweets)/float(len(tweets))

print "Classifier accuracy is {:.0%}".format(accuracy)

Using the training dataset I provided, this comes out with a accuracy of 46%. Not exactly the most accurate dataset. Ideally I would compare this against another dataset, but I don't have one available at the moment. Looking at the results can bring up another question as to whether this is the best metric to use. The "classify" method chooses the classification with the highest probability. Is it worth taking into account the probability distribution calculated by the classifier? For example, a tweet could have this probability distribution:

P(sentiment==positive)=0.000000

P(sentiment==negative)=0.620941

P(sentiment==neutral)=0.379059

In this example, the tweet is actually classified as neutral. Using the above metric, it gets marked as being inaccurate. However, should the fact that the model's probability of being neutral count in some way? I think it should get some partial credit. Instead of counting it as not matching (score of 0), I use the probability of what the tweet was categorized (score of 0.379059). Although this increases the credit for non-matches, this decreases the credit given for matches. This calculation changes the code slightly:

tweets = [list(t) for t in taggedtweets]

for t in tweets:

t.append(classifier.classify(feature_extractor(t[0])))

pc = classifier.prob_classify(feature_extractor(t[0]))

t.append(pc.prob(t[1]))

accuracy = sum(t[1]==t[2] for t in tweets)/float(len(tweets))

weighted_accuracy = sum(t[3] for t in tweets)/float(len(tweets))

print "Classifier accuracy is {:.0%}".format(accuracy)

print "Classifier weighted accuracy is {:.0%}".format(weighted_accuracy)

The results aren't quite what I expected. It comes out to roughly 46% like the simple accuracy. I think this weighted accuracy will be a better measurement. Using this metric, we can now work on improving the model and determining the size of a training set required to feed into the model.

Sunday, March 23, 2014

Sentiment Analysis using Python and NLTK

Keeping on the trend of using Python to analyze Twitter data, here's a sample program for analyzing those tweets. In general, this program utilizes an initial training dataset (referenced below as "training data.csv") that initially classifies tweets into "Positive", "Neutral", or "Negative". This dataset is then used to train a Naive Bayes classifier that will be used to score future tweets.

First step in any program is to import the useful libraries for the program. We'll be using the NLTK package for analyzing the text and the CSV package for reading in the training dataset.
import nltk
from nltk.corpus import stopwords
import csv

This is a small list of custom stop words that I can use to eliminate from the tweets.
customstopwords = ['band', 'they', 'them']

Next up, let's open up the training dataset and parse it. I don't particularly like the idea of '1' or '-1' to represent positive and negative so I recode these into "positive" and "negative". I'm sure there's a more elegant way of doing this but found this to be a simple way.
ifile = open('training data.csv', 'rb')
reader = csv.reader(ifile)

taggedtweets = []

rownum=0
for row in reader:
if rownum == 0:
header = row
else:
colnum = 0
txt = ''
for col in row:
if colnum == 0:
txt = col
else:
if col == '1':
taggedtweets.append((txt,'positive'))
elif col == '0':
taggedtweets.append((txt,'neutral'))
elif col == '-1':
taggedtweets.append((txt,'negative'))
colnum+=1
rownum+=1

ifile.close()

Now that we have our training dataset, let's create an empty list and insert a tuple of the tweet with its respective sentiment.
tweets = []

for (word, sentiment) in taggedtweets:
word_filter = [i.lower() for i in word.split()]
tweets.append((word_filter, sentiment))

I found that there were two useful functions for being able to analyze the results. First a function that will let you get all of the words in a tweet. The second function orders the list of tweets by their frequency.
def getwords(tweets):
allwords = []
for (words, sentiment) in tweets:
allwords.extend(words)
return allwords

def getwordfeatures(listoftweets):
wordfreq = nltk.FreqDist(listoftweets)
words = wordfreq.keys()
return words

We will now use the list of tweets in the training dataset to create a comprehensive list of words. At the same time, we will remove the English stop words. These words often provide noise when determining sentiment of a tweet. The function afterwards will be used for identifying the features within each tweet.
wordlist = [i for i in getwords(tweets) if not i in stopwords.words('english')]

def feature_extractor(doc):
docwords = set(doc)
features = {}
for i in wordlist:
features['contains(%s)' % i] = (i in docwords)
return features

Using the cleaned training dataset, we feed it into a classifier. This classifier can then be used to score future tweets. We can also print out some interesting information about this classifier like the top 20 most informative features. In this case, they are the 20 terms that have strong predictive tendencies to be one sentiment value over another one.
training_set = nltk.classify.apply_features(feature_extractor, tweets)

classifier = nltk.NaiveBayesClassifier.train(training_set)
print "These are the top 20 most informative features\n"
print classifier.show_most_informative_features(n=20)

Now that we have a classifier, we can use it for multiple purposes. I think my next blog will be about how to simultaneously improve this classifier while scoring a stream of tweets that get caught with a search. I'd like to be able to also throw some numbers out on appropriate size for a training dataset. Keep an eye out for future blog entries.