Tuesday, March 18, 2014

Downloading Twitter with Python

I've been working recently with downloading social data on my own through R. However, I kept seeing posts on how easy this was to download through Python so I figured I would give it a shot and write some code. As it turns out, it was incredibly simple to download and write out to a CSV file. There are a couple of requirements:
  1. A Twitter Development account: Free through Twitter
  2. Python Library rauth: Free through Rauth
Next the actual code. Note that I'm not a Python expert so feel free to leave a comment on a more efficient method if there is one.

Load the Python libraries.
from rauth import OAuth1Service
import string
import csv

Define a function to keep only printable characters.
def printable(s):

return filter(lambda x: x in string.printable, s)
Here's a list of fields that I want to keep. Note there is a lot more information to a single Tweet, but are often blank or missing.

fieldnames = ['handle',
'text',
'coordinates',
'created_at',
'tweet_id',
'favorite_count',
'retweet_count'
]


Initialize the CSV file to write Tweets into.

writer = csv.DictWriter(open('tweets.csv','wb'),fieldnames=fieldnames)
writer.writeheader()


Get a real consumer key & secret from https://dev.twitter.com/apps/new.

twitter = OAuth1Service(
name='twitter',
consumer_key='consumerkey',
consumer_secret='consumersecret',
request_token_url='https://api.twitter.com/oauth/request_token',
access_token_url='https://api.twitter.com/oauth/access_token',
authorize_url='https://api.twitter.com/oauth/authorize',
base_url='https://api.twitter.com/1.1/')

Initialize the request token and retrieve from your web browser.

request_token, request_token_secret = twitter.get_request_token()
authorize_url = twitter.get_authorize_url(request_token)

print('Visit this URL in your browser: {url}'.format(url=authorize_url))
pin = input('Enter PIN from browser: ')

After entering the pin, your session will start with this section of code.

session = twitter.get_auth_session(request_token,
request_token_secret,
method='POST',
data={'oauth_verifier': pin})

Each method in Twitter has some parameters to it. This example will grab the timeline for the authenticating user. There are two parameters that we will use to include retweets and limit it to 200 tweets.

params = {'include_rts': 1,
'count': 200}
Now let's get the actual JSON object with the information!
r = session.get('statuses/user_timeline.json', params=params, verify=True)


Last but not least, let's iterate through the tweets and write the information out to the CSV. Note that there can be issues if you don't strip out non-printable characters or change the encoding of the text.

for i, tweet in enumerate(r.json(), 1):
t={}
t['handle'] = printable(tweet['user']['screen_name'])
t['text'] = printable(tweet['text'])
t['coordinates'] = tweet['coordinates']
t['created_at'] = printable(tweet['created_at'])
t['tweet_id'] = printable(tweet['id_str'])
t['favorite_count'] = tweet['favorite_count']
t['retweet_count'] = tweet['retweet_count']


print(u'{0}. @{1}: {2}'.format(i, t['created_at'], t['text']))
writer.writerow(t)


I liked this approach a lot better than several versions I saw using R. Next on my list is to look at the NLTK package for natural language processing in Python. I want to be able to do a nice comparison against some equivalent packages in R to see which is easier to use, more comprehensive, and more efficient in processing. Keep tuned for the results.

No comments:

Post a Comment