Keeping on the trend of using Python to analyze Twitter data, here's a sample program for analyzing those tweets. In general, this program utilizes an initial training dataset (referenced below as "training data.csv") that initially classifies tweets into "Positive", "Neutral", or "Negative". This dataset is then used to train a Naive Bayes classifier that will be used to score future tweets.
First step in any program is to import the useful libraries for the program. We'll be using the NLTK package for analyzing the text and the CSV package for reading in the training dataset.
import nltk
from nltk.corpus import stopwords
import csv
This is a small list of custom stop words that I can use to eliminate from the tweets.
customstopwords = ['band', 'they', 'them']
Next up, let's open up the training dataset and parse it. I don't particularly like the idea of '1' or '-1' to represent positive and negative so I recode these into "positive" and "negative". I'm sure there's a more elegant way of doing this but found this to be a simple way.
ifile = open('training data.csv', 'rb')
reader = csv.reader(ifile)
taggedtweets = []
rownum=0
for row in reader:
if rownum == 0:
header = row
else:
colnum = 0
txt = ''
for col in row:
if colnum == 0:
txt = col
else:
if col == '1':
taggedtweets.append((txt,'positive'))
elif col == '0':
taggedtweets.append((txt,'neutral'))
elif col == '-1':
taggedtweets.append((txt,'negative'))
colnum+=1
rownum+=1
ifile.close()
Now that we have our training dataset, let's create an empty list and insert a tuple of the tweet with its respective sentiment.
tweets = []
for (word, sentiment) in taggedtweets:
word_filter = [i.lower() for i in word.split()]
tweets.append((word_filter, sentiment))
I found that there were two useful functions for being able to analyze the results. First a function that will let you get all of the words in a tweet. The second function orders the list of tweets by their frequency.
def getwords(tweets):
allwords = []
for (words, sentiment) in tweets:
allwords.extend(words)
return allwords
def getwordfeatures(listoftweets):
wordfreq = nltk.FreqDist(listoftweets)
words = wordfreq.keys()
return words
We will now use the list of tweets in the training dataset to create a comprehensive list of words. At the same time, we will remove the English stop words. These words often provide noise when determining sentiment of a tweet. The function afterwards will be used for identifying the features within each tweet.
wordlist = [i for i in getwords(tweets) if not i in stopwords.words('english')]
def feature_extractor(doc):
docwords = set(doc)
features = {}
for i in wordlist:
features['contains(%s)' % i] = (i in docwords)
return features
Using the cleaned training dataset, we feed it into a classifier. This classifier can then be used to score future tweets. We can also print out some interesting information about this classifier like the top 20 most informative features. In this case, they are the 20 terms that have strong predictive tendencies to be one sentiment value over another one.
training_set = nltk.classify.apply_features(feature_extractor, tweets)
classifier = nltk.NaiveBayesClassifier.train(training_set)
print "These are the top 20 most informative features\n"
print classifier.show_most_informative_features(n=20)
Now that we have a classifier, we can use it for multiple purposes. I think my next blog will be about how to simultaneously improve this classifier while scoring a stream of tweets that get caught with a search. I'd like to be able to also throw some numbers out on appropriate size for a training dataset. Keep an eye out for future blog entries.
No comments:
Post a Comment