19 3 / 2012

I am on the way to launch my tweetpie(python twitter library built on top of requests), streaming api work was on progress, I tried `statuses/filter` found connection was on going for a very long time. Then decided to put the code  on my production web server and analyse it.

Actual script

import requests
import json
import couchdbkit

class Tweet(couchdbkit.Document): user = couchdbkit.StringProperty() created_at = couchdbkit.StringProperty() text = couchdbkit.StringProperty() def store_live_tweets(): track = ['sachin', 'sachinism', 'Sachin', 'Sachinism', 'tendulkar', \ 'Tendulkar', 'sachintendulkar','SachinTendulkar'] s = couchdbkit.Server("https://username:password@username.cloudant.com") d = couchdbkit.Database("https://username:password@username.cloudant.com/db") Tweet.set_db(d) r = requests.post('https://stream.twitter.com/1/statuses/filter.json', data={'track': track}, auth=('username', 'password')) for line in r.iter_lines(): if line: result = json.loads(line) tweet = Tweet(user = result['user']['name'], \ created_at = result['created_at'], text = result['text']) tweet.save() if __name__ == '__main__': try: store_live_tweets() except Exception as e: with open('sachin_log.txt', 'a') as f: f.write(e.message)


The program is simple python(even if you don’t know python it doesn’t matter, its equal to plain english). requests, json, couchdbkit is used. Tweet class has three attributes text, user, created_at are mapped with fetched tweet. requests is used to fetch the tweets from twitter and details are stored in couchdb(google for more). cloudant is used since there is a free plan and I want to test their service.

Why to track = [‘sachin’] ?

" When sachin bats against pakistan number of people watching the match in television is greater than Europe’s population" - This is proved.

Analysis

LAST TWEET
—————
created_at: Sun Mar 18 19:2
text: “In the defence of sachin: When I started a blog f”
user: New Globa

FIRST_TWEET
—————-
creatd_at: Sun Mar 18 14:5
text: “I feel bad that d entire nation still banks on sachin fo”
user: sandy aka

Twitter servers are located in San Francisco, So assumption is created_at fleld is San Francisco time stamp.


Time difference between India Time Zone and San Francisco: 12.30 ahead(India)


Total tweets retrieved: 8003


Total time script was run(connection was closed by twitter): almost one hour

Script Started at :20.00PM(IST), San Francisco time 7.30AM.

Now it seems twitter stores user time stamp.

Why only 8003 tweets are retrieved?

According to docs only 1% of real time tweets are delivered, after 8003 tweets the program was rerun, response content was 200 but no data was received.

[EDIT]

Questions:

  • Does twitter have time limit for open connection and close?

      Answer: You can leave a connection open for as long as you like/can. Server restarts happen and you can cause yourself to get disconnected by falling behind on the stream (not being able to computationally keep up with the volume being sent to you). Make sure that your code expects disconnects and error conditions and that you back off appropriately.

  • How does 1% real time tweet is determined?

Answer: When using filter, you’ll get all the tweets matching your terms if the sum of tweets you would receive in a given second would be less than 1% of the total content of the public firehose. If all the tweets about “sachin” made up less than 1% of the total, you’d get all the tweets. If “sachin” became popular enough to account for more than 1% of the firehose total, you’ll get the tweets up to that 1% along with rate limit messages telling you how many you missed

  • What time stamp is stored for created_at in tweet?

Answer: The created_at timestamp on tweets is in UTC format

Thanks to http://twitter.com/episod

Next Research:

  • Continue the same test with time bound(tweets between time range).

After reading `j2labs` post  - http://j2labs.tumblr.com/post/6929393728/a-twitter-nozzle-class of same research and few feed back from reddit community, modification were done to my code(https://gist.github.com/2130913)

import requests
import time
import json
import zmq
from multiprocessing import Process
import couchdbkit


s = couchdbkit.Server("https://usr:passwd@usr.cloudant.com")
db = couchdbkit.Database("https://usr:passwd@usr.cloudant.com/db")

def worker():
    context = zmq.Context()
    work_receiver = context.socket(zmq.PULL)
    work_receiver.connect("tcp://127.0.0.1:5557")
    while 1:
        d = json.loads(work_receiver.recv())
        try:
            db.save_doc(doc=d)
        except Exception as e:
            with open('sachin_omq_log.txt','a') as f:
                msg = " ".join([e.message, " ", time.ctime(), "\n"])
                f.write(msg)
        

def store_live_tweets():
    Process(target=worker, args=()).start()
    context = zmq.Context()
    send = context.socket(zmq.PUSH)
    send.bind("tcp://127.0.0.1:5557")
    track = ['sachin', 'sachinism', 'Sachin', 'Sachinism', 'tendulkar', 'Tendulkar', 'sachintendulkar','SachinTendulkar']
    r = requests.post('https://stream.twitter.com/1/statuses/filter.json',
        data={'track': track}, auth=('usr', 'passwd'))
    for line in r.iter_lines():
        if line:
            print line
            send.send(line)

if __name__ == '__main__':
    try:
        store_live_tweets()
    except Exception as e:
        with open('sachin_omq_log.txt', 'a') as f:
            msg = " ".join([e.message, " ", time.ctime(), "\n"])
            f.write(msg)




After modification, cloudant account was able to receive 1.6req/sec to 3.2req/sec, my test was carried out in web faction shared hosting plan.

I would be glad if any one can perform same test on aws and rackspace, can provide result for minimum 10 mins results.