Mini-Tutorial: Saving Tweets to a Database with Python and CouchDB, a free NoSQL database.
Mini-Tutorial: Saving Tweets to a Database with Python and CouchDB, a free NoSQL database.
Accessing the Database
Let's begin by creating a class TweetStore for the purpose of initializing the database, writing records to the database, and reading records from the database. All the real work interacting with the CouchDB server is done via couchdb-python. To keep things simple, the class initializer will either create a new database or open an existing database if there already is one with the given name.
import couchdb classTweetStore(object): def __init__(self, dbname, url='http://127.0.0.1:5984/'): try: self.server = couchdb.Server(url=url) self.db = self.server.create(dbname) self._create_views() except couchdb.http.PreconditionFailed: self.db = self.server[dbname]
CouchDB includes a REST API for all database requests. The default port is 5984, so we use http://127.0.0.1:5984 to specify the CouchDB server running on our local machine. When instantiating TweetStore for a database that does not yet exist, __init__ creates the database and calls method _create_views. Here is what that looks like.
def _create_views(self): count_map ='function(doc) { emit(doc.id, 1); }' count_reduce ='function(keys, values) { return sum(values); }' view = couchdb.design.ViewDefinition('twitter', 'count_tweets', count_map, reduce_fun=count_reduce) view.sync(self.db) get_tweets = 'function(doc) { emit(("0000000000000000000"+doc.id).slice(-19), doc); }' view = couchdb.design.ViewDefinition('twitter','get_tweets', get_tweets) view.sync(self.db)
The _create_views method initializes the database with two views. As the names imply, the count_tweets view returns a total count of tweets, and the get_tweets view returns all the stored tweet documents. The views themselves are also stored as documents inside the database. There is more to say about views that is beyond the scope here. I will just point out that count_tweets is composed of two JavaScript functions that work together to perform MapReduce. Also, views accept parameters which let you further refine a query, so get_tweets does not necessarily have to return all tweets.
CouchDB requires that views emit a key along with each document. The key can be anything. Often, it is the document key doc._id. Looking at the get_tweets definition you will see that it instead returns doc.id (no hyphen), the long integer id that Twitter assigns to a tweet.
So, why does get_tweets pad doc.id with 19 zeros, then slice off everything except the right-most 19 characters?
The first problem is long integers max out at 263 (19 digits) and JavaScript numbers max out at only 253 (16 digits). The solution to this is to represent the id as a string. But, if we require that get_tweets outputs tweets that are sorted by id (i.e. in chronological order), a new problem arises: the id may have 1 to 19 digits, and as a string the id no longer sorts numerically but alpha-numerically. ("77" incorrectly sorts after "1000".) So, the zero-padding normalizes the id to 19 characters, and alpha-numeric sorting now works correctly like numeric sorting.
The TweetStore class is almost completed. All that is left is writing and reading. CouchDB stores each record as a document. A document is simply a JSON string with one required field, the document key, named _id. Conveniently, Twitter returns each tweet as a JSON string, so that goes unchanged into the document. For the document key we simply re-use the tweet's id. Here is that.
def save_tweet(self, tw): tw['_id'] = tw['id_str'] self.db.save(tw)
The last thing to expose as methods are the views for counting and getting tweets.
def count_tweets(self):for doc in self.db.view('twitter/count_tweets'): return doc.value def get_tweets(self):return self.db.view('twitter/get_tweets')
Mining for Tweets
Twitter divides its API into two types of calls: REST API calls which return a result then close the connection, and Streaming API calls which keep returning results until you close the connection. We employ TwitterAPI, a lightweight Python package that supports both types of API calls.
Before you can mine tweets you must create your own Twitter application credentials. Go to apps.twitter.com and create an application and generate your API key and access token.
Now we are ready to download tweets!
You can familiarize yourself with the many API calls, or endpoints, that Twitter offers. You will find them here. Most endpoints have both required and optional parameters. For example, the search/tweets endpoint has one required parameter q that sets the word or phrase for filtering downloaded tweets. And, since it is a REST API endpoint, it downloads a finite number of tweets which you specify with the optional count parameter.
The following example uses statuses/filter, a Streaming API endpoint that also downloads tweets. Since it works over a continuous streaming connection it downloads tweets until you close the connection. Alternatively, you may substitute any Twitter endpoint that downloads tweets, including search/tweets. test_db is the name of the CouchDB database where the tweets are saved to.
from TweetStore import TweetStore from TwitterAPI.TwitterAPI import TwitterAPI # Your Twitter authentication credentials... API_KEY = XXX API_SECRET = XXX ACCESS_TOKEN = XXX ACCESS_TOKEN_SECRET = XXX storage = TweetStore('test_db') api = TwitterAPI(API_KEY, API_SECRET, ACCESS_TOKEN, ACCESS_TOKEN_SECRET) for item in api.request('statuses/filter', {'track':'pizza'}): if 'text' in item: print('%s -- %s\n' % (item['user']['screen_name'], item['text'])) storage.save_tweet(item) elif 'message' in item: print('ERROR %s: %s\n' % (item['code'], item['message']))
The parameters for the two endpoints are slightly different. With statuses/filter you specify filter words with track instead of q. Or, if you would prefer to select tweets from a specific geographic location, this endpoint provides a locations parameter as well. Just make sure you supply your credentials.
If everything is working for you up to this point, you are ready to retrieve tweets from the database. But, before trying out the Python code below, you might want to see the results right away in a browser. Type this url into your browser's address bar.
http://127.0.0.1:5984/test_db/_design/twitter/_view/get_tweets
You should see in your browser the entire contents -- meta-data and all -- of the downloaded tweets. Using Python it is just a little more work. The following code prints only the text field from those tweets to the console.
from TweetStore import TweetStore storage = TweetStore('test_db') for doc in storage.get_tweets(): print('%s\n' % doc.value['text'])
That's it! Try other Twitter endpoints that download tweets, such as getting a user's timeline.