Next the actual code. Note that I'm not a Python expert so feel free to leave a comment on a more efficient method if there is one.
Load the Python libraries.
from rauth import OAuth1Service
import string
import csv
Define a function to keep only printable characters.
fieldnames = ['handle',
'text',
'coordinates',
'created_at',
'tweet_id',
'favorite_count',
'retweet_count'
]
Initialize the CSV file to write Tweets into.
writer = csv.DictWriter(open('tweets.csv','wb'),fieldnames=fieldnames)
writer.writeheader()
Get a real consumer key & secret from https://dev.twitter.com/apps/new.
twitter = OAuth1Service(
name='twitter',
consumer_key='consumerkey',
consumer_secret='consumersecret',
request_token_url='https://api.twitter.com/oauth/request_token',
access_token_url='https://api.twitter.com/oauth/access_token',
authorize_url='https://api.twitter.com/oauth/authorize',
base_url='https://api.twitter.com/1.1/')
Initialize the request token and retrieve from your web browser.
request_token, request_token_secret = twitter.get_request_token()
authorize_url = twitter.get_authorize_url(request_token)
print('Visit this URL in your browser: {url}'.format(url=authorize_url))
pin = input('Enter PIN from browser: ')
After entering the pin, your session will start with this section of code.
session = twitter.get_auth_session(request_token,
request_token_secret,
method='POST',
data={'oauth_verifier': pin})
Each method in Twitter has some parameters to it. This example will grab the timeline for the authenticating user. There are two parameters that we will use to include retweets and limit it to 200 tweets.
params = {'include_rts': 1,
'count': 200}
Now let's get the actual JSON object with the information!
r = session.get('statuses/user_timeline.json', params=params, verify=True)
Last but not least, let's iterate through the tweets and write the information out to the CSV. Note that there can be issues if you don't strip out non-printable characters or change the encoding of the text.
for i, tweet in enumerate(r.json(), 1):
t={}
t['handle'] = printable(tweet['user']['screen_name'])
t['text'] = printable(tweet['text'])
t['coordinates'] = tweet['coordinates']
t['created_at'] = printable(tweet['created_at'])
t['tweet_id'] = printable(tweet['id_str'])
t['favorite_count'] = tweet['favorite_count']
t['retweet_count'] = tweet['retweet_count']
print(u'{0}. @{1}: {2}'.format(i, t['created_at'], t['text']))
writer.writerow(t)
I liked this approach a lot better than several versions I saw using R. Next on my list is to look at the NLTK package for natural language processing in Python. I want to be able to do a nice comparison against some equivalent packages in R to see which is easier to use, more comprehensive, and more efficient in processing. Keep tuned for the results.
def printable(s): return filter(lambda x: x in string.printable, s)Here's a list of fields that I want to keep. Note there is a lot more information to a single Tweet, but are often blank or missing.
fieldnames = ['handle',
'text',
'coordinates',
'created_at',
'tweet_id',
'favorite_count',
'retweet_count'
]
Initialize the CSV file to write Tweets into.
writer = csv.DictWriter(open('tweets.csv','wb'),fieldnames=fieldnames)
writer.writeheader()
Get a real consumer key & secret from https://dev.twitter.com/apps/new.
twitter = OAuth1Service(
name='twitter',
consumer_key='consumerkey',
consumer_secret='consumersecret',
request_token_url='https://api.twitter.com/oauth/request_token',
access_token_url='https://api.twitter.com/oauth/access_token',
authorize_url='https://api.twitter.com/oauth/authorize',
base_url='https://api.twitter.com/1.1/')
Initialize the request token and retrieve from your web browser.
request_token, request_token_secret = twitter.get_request_token()
authorize_url = twitter.get_authorize_url(request_token)
print('Visit this URL in your browser: {url}'.format(url=authorize_url))
pin = input('Enter PIN from browser: ')
After entering the pin, your session will start with this section of code.
session = twitter.get_auth_session(request_token,
request_token_secret,
method='POST',
data={'oauth_verifier': pin})
Each method in Twitter has some parameters to it. This example will grab the timeline for the authenticating user. There are two parameters that we will use to include retweets and limit it to 200 tweets.
params = {'include_rts': 1,
'count': 200}
Now let's get the actual JSON object with the information!
r = session.get('statuses/user_timeline.json', params=params, verify=True)
Last but not least, let's iterate through the tweets and write the information out to the CSV. Note that there can be issues if you don't strip out non-printable characters or change the encoding of the text.
for i, tweet in enumerate(r.json(), 1):
t={}
t['handle'] = printable(tweet['user']['screen_name'])
t['text'] = printable(tweet['text'])
t['coordinates'] = tweet['coordinates']
t['created_at'] = printable(tweet['created_at'])
t['tweet_id'] = printable(tweet['id_str'])
t['favorite_count'] = tweet['favorite_count']
t['retweet_count'] = tweet['retweet_count']
print(u'{0}. @{1}: {2}'.format(i, t['created_at'], t['text']))
writer.writerow(t)
I liked this approach a lot better than several versions I saw using R. Next on my list is to look at the NLTK package for natural language processing in Python. I want to be able to do a nice comparison against some equivalent packages in R to see which is easier to use, more comprehensive, and more efficient in processing. Keep tuned for the results.
No comments:
Post a Comment