Feasible Analytics: Twitter

Showing posts with label Twitter. Show all posts

Wednesday, November 26, 2014

ThinkUp: Personal Twitter Analytics

As a fan of various TWiT.TV shows, I learned about a website that will analyze your Twitter account. Given my curiosity around social media analytics, I figured I'd give it a shot. Although I'm not the most active Twitter user, I was curious what it could tell me and what they thought other people would be interested in.

Their analysis seems to be split into a few categories:

Analysis of your tweet contents
Analysis of responses to your tweet
Analysis of your friends and followers profiles

Overall, it's pretty interesting. It helps me manage my "brand" as it comes across on Twitter. For example, I can make sure I vary the content of my tweets and not just talk about myself in the majority of tweets. I also enjoy knowing what times I should tweet in order to get the best response.

Being the curious analytics person I am, there are a few other facts / insights I would like to see:

When are my followers most active?
Who else should I follow?
What is the general sentiment of my tweets?

You can view my ThinkUp page for a nice example, although there are some better ones out there.

Sunday, March 23, 2014

Sentiment Analysis using Python and NLTK

Keeping on the trend of using Python to analyze Twitter data, here's a sample program for analyzing those tweets. In general, this program utilizes an initial training dataset (referenced below as "training data.csv") that initially classifies tweets into "Positive", "Neutral", or "Negative". This dataset is then used to train a Naive Bayes classifier that will be used to score future tweets.

First step in any program is to import the useful libraries for the program. We'll be using the NLTK package for analyzing the text and the CSV package for reading in the training dataset.
import nltk
from nltk.corpus import stopwords
import csv

This is a small list of custom stop words that I can use to eliminate from the tweets.
customstopwords = ['band', 'they', 'them']

Next up, let's open up the training dataset and parse it. I don't particularly like the idea of '1' or '-1' to represent positive and negative so I recode these into "positive" and "negative". I'm sure there's a more elegant way of doing this but found this to be a simple way.
ifile = open('training data.csv', 'rb')
reader = csv.reader(ifile)

taggedtweets = []

rownum=0
for row in reader:
if rownum == 0:
header = row
else:
colnum = 0
txt = ''
for col in row:
if colnum == 0:
txt = col
else:
if col == '1':
taggedtweets.append((txt,'positive'))
elif col == '0':
taggedtweets.append((txt,'neutral'))
elif col == '-1':
taggedtweets.append((txt,'negative'))
colnum+=1
rownum+=1

ifile.close()

Now that we have our training dataset, let's create an empty list and insert a tuple of the tweet with its respective sentiment.
tweets = []

for (word, sentiment) in taggedtweets:
word_filter = [i.lower() for i in word.split()]
tweets.append((word_filter, sentiment))

I found that there were two useful functions for being able to analyze the results. First a function that will let you get all of the words in a tweet. The second function orders the list of tweets by their frequency.
def getwords(tweets):
allwords = []
for (words, sentiment) in tweets:
allwords.extend(words)
return allwords

def getwordfeatures(listoftweets):
wordfreq = nltk.FreqDist(listoftweets)
words = wordfreq.keys()
return words

We will now use the list of tweets in the training dataset to create a comprehensive list of words. At the same time, we will remove the English stop words. These words often provide noise when determining sentiment of a tweet. The function afterwards will be used for identifying the features within each tweet.
wordlist = [i for i in getwords(tweets) if not i in stopwords.words('english')]

def feature_extractor(doc):
docwords = set(doc)
features = {}
for i in wordlist:
features['contains(%s)' % i] = (i in docwords)
return features

Using the cleaned training dataset, we feed it into a classifier. This classifier can then be used to score future tweets. We can also print out some interesting information about this classifier like the top 20 most informative features. In this case, they are the 20 terms that have strong predictive tendencies to be one sentiment value over another one.
training_set = nltk.classify.apply_features(feature_extractor, tweets)

classifier = nltk.NaiveBayesClassifier.train(training_set)
print "These are the top 20 most informative features\n"
print classifier.show_most_informative_features(n=20)

Now that we have a classifier, we can use it for multiple purposes. I think my next blog will be about how to simultaneously improve this classifier while scoring a stream of tweets that get caught with a search. I'd like to be able to also throw some numbers out on appropriate size for a training dataset. Keep an eye out for future blog entries.

Tuesday, March 18, 2014

Downloading Twitter with Python

I've been working recently with downloading social data on my own through R. However, I kept seeing posts on how easy this was to download through Python so I figured I would give it a shot and write some code. As it turns out, it was incredibly simple to download and write out to a CSV file. There are a couple of requirements:

A Twitter Development account: Free through Twitter
Python Library rauth: Free through Rauth

Next the actual code. Note that I'm not a Python expert so feel free to leave a comment on a more efficient method if there is one.

Load the Python libraries.

from rauth import OAuth1Service
import string
import csv

Define a function to keep only printable characters.

def printable(s):

return filter(lambda x: x in string.printable, s)

Here's a list of fields that I want to keep. Note there is a lot more information to a single Tweet, but are often blank or missing.

fieldnames = ['handle',
'text',
'coordinates',
'created_at',
'tweet_id',
'favorite_count',
'retweet_count'
]

Initialize the CSV file to write Tweets into.

writer = csv.DictWriter(open('tweets.csv','wb'),fieldnames=fieldnames)
writer.writeheader()

Get a real consumer key & secret from https://dev.twitter.com/apps/new.

twitter = OAuth1Service(
name='twitter',
consumer_key='consumerkey',
consumer_secret='consumersecret',
request_token_url='https://api.twitter.com/oauth/request_token',
access_token_url='https://api.twitter.com/oauth/access_token',
authorize_url='https://api.twitter.com/oauth/authorize',
base_url='https://api.twitter.com/1.1/')

Initialize the request token and retrieve from your web browser.

request_token, request_token_secret = twitter.get_request_token()
authorize_url = twitter.get_authorize_url(request_token)

print('Visit this URL in your browser: {url}'.format(url=authorize_url))
pin = input('Enter PIN from browser: ')

After entering the pin, your session will start with this section of code.

session = twitter.get_auth_session(request_token,
request_token_secret,
method='POST',
data={'oauth_verifier': pin})

Each method in Twitter has some parameters to it. This example will grab the timeline for the authenticating user. There are two parameters that we will use to include retweets and limit it to 200 tweets.

params = {'include_rts': 1,
'count': 200}
Now let's get the actual JSON object with the information!
r = session.get('statuses/user_timeline.json', params=params, verify=True)

Last but not least, let's iterate through the tweets and write the information out to the CSV. Note that there can be issues if you don't strip out non-printable characters or change the encoding of the text.

for i, tweet in enumerate(r.json(), 1):
t={}
t['handle'] = printable(tweet['user']['screen_name'])
t['text'] = printable(tweet['text'])
t['coordinates'] = tweet['coordinates']
t['created_at'] = printable(tweet['created_at'])
t['tweet_id'] = printable(tweet['id_str'])
t['favorite_count'] = tweet['favorite_count']
t['retweet_count'] = tweet['retweet_count']

print(u'{0}. @{1}: {2}'.format(i, t['created_at'], t['text']))
writer.writerow(t)

I liked this approach a lot better than several versions I saw using R. Next on my list is to look at the NLTK package for natural language processing in Python. I want to be able to do a nice comparison against some equivalent packages in R to see which is easier to use, more comprehensive, and more efficient in processing. Keep tuned for the results.