Twitter is a good source for recent news, updates, and stories across the web. Although each tweet is limited to 140 characters, they can still contain a good deal of useful information. Twitter is especially useful for near-realtime updates for time sensitive stuff like public transportation.

Living in the peninsula and working in San Francisco, I take the Caltrain almost every day to get to work. Although it's pretty reliable, there are still circumstances that cause major delays such as train breakdowns or accidents. In my experience, the best source of determining delays are the various twitter accounts for Caltrain. There's an official twitter page, @Caltrain_News as well as unofficial ones like @caltrain and @CaltrainStatus. These sources provide up to date information on delays and other problems.

Instead of manually checking these twitter accounts for delays, I decided to try to check them automatically.

Twitter pages are pretty structured as far as content goes. There are bascially a list of tweets that contain text and links. Because of this, parsing twitter pages actually turns out to be fairly straight foward.

Using my scripting language of choice, python, I looked at several different html parsing libraries. The built in HTMLParser module seems like an obvious choice. It appears to work well, but requires a decent amount of manual handling of tags. After a bit more research, I came across the library Beautiful Soup. Beautiful Soup takes care of just about everything for you.

With Beautiful Soup, you can easily navigate, search, and modify HTML and XML content by tag, id, or regular expression. It was easy to set up via apt-get (apt-get install python-bs4).

import urllib2
from bs4 import BeautifulSoup

response = urllib2.urlopen('https://twitter.com/CaltrainStatus')
html = response.read()
soup = BeautifulSoup(html)

Looking at a twitter page, each tweet is in it's own li tag.

<ol class="stream-items js-navigable-stream" id="stream-items-id">
<li class="js-stream-item stream-item stream-item expanding-stream-item" data-item-id="373536647133921280" id="stream-item-tweet-373536647133921280" data-item-type="tweet">
<li class="js-stream-item stream-item stream-item expanding-stream-item" data-item-id="373536102327410689" id="stream-item-tweet-373536102327410689" data-item-type="tweet">
</ol>

Using this information, we can find all li tags with the class js-stream-item.

tweets = soup.find_all('li', 'js-stream-item')

This gets all 20 tweets that are initially displayed on the page. However, we also got a lot of other junk with each tweet. If you dig into each li tag a bit, you can find finer grained tags that contain just the text, timestamp, and link of each tweet.

<a href="/CaltrainStatus/status/373536647133921280" class="tweet-timestamp js-permalink js-nav" title="1:04 PM - 30 Aug 13"><span class="_timestamp js-short-timestamp " data-time="1377893051" data-long-form="true">30 Aug</span></a>
<p class="js-tweet-text tweet-text"><a href="/sanfranpsychol" class="twitter-atreply pretty-link" dir="ltr"><s>@</s><b>sanfranpsychol</b></a> This report was not useful.  Please keep the guidelines (<a href="http://t.co/82mgfoqsaG" rel="nofollow" dir="ltr" data-expanded-url="http://caltrainstatus.wordpress.com/guidelines/" class="twitter-timeline-link" target="_blank" title="http://caltrainstatus.wordpress.com/guidelines/"><span class="tco-ellipsis"></span><span class="invisible">http://</span><span class="js-display-url">caltrainstatus.wordpress.com/guidelines/</span><span class="invisible"></span><span class="tco-ellipsis"><span class="invisible">&nbsp;</span></span></a>) in mind for future reports.</p>
<a class="details with-icn js-details" href="/CaltrainStatus/status/373536647133921280">

It's pretty simple to extract this finer grained info with Beautiful Soup.

tweet_text = soup.find_all('p', 'js-tweet-text')
tweet_timestamps = soup.find_all('a', 'tweet-timestamp')
tweet_links = soup.find_all('a', 'js-details')

This is much better, but there is still a decent amount of html in there. Beautiful Soup has a solution to this problem as well. The text of the tweet is actual text in the p tag. The timestamp of the tweet is stored in the title field. The link of the tweet can be found in the href field of the a tag.

for i in range(0, len(tweet_text):
    text = tweets[i].get_text().encode('ascii', 'ignore')
    timestamp = datetime.datetime.strptime(tweet_timestamps[i]['title'], '%I:%M %p - %d %b %y')
    link = 'https://twitter.com' + tweet_links[i]['href']

Now we have a list of the text, timestamp, and link to the last 20 tweets. All that's left is to do something with them. Since we're checking for train delays, let's check the last 4 hours of tweets for any mentions of northbound or nb.

for i in range(0, len(tweet_text):
    text = tweets[i].get_text().encode('ascii', 'ignore').lower()
    timestamp = datetime.datetime.strptime(tweet_timestamps[i]['title'], '%I:%M %p - %d %b %y')
    link = 'https://twitter.com' + tweet_links[i]['href']

    if datetime.datetime.now() - timestamp < timedelta(hours=4):
        if text.find('northbound') or text.find('nb'):
            print 'Found relevant tweet'

That's interesting, but it would be more useful to have an email sent if such tweets are found.

text = "Twitter alert for CaltrainStatus"
html = "<h1>Twitter alert for <a href='http://twitter.com/CaltrainStatus'>CaltrainStatus</a></h1>"

num_relevant_tweets = 0

for i in range(0, len(tweet_text):
    text = tweets[i].get_text().encode('ascii', 'ignore').lower()
    timestamp = datetime.datetime.strptime(tweet_timestamps[i]['title'], '%I:%M %p - %d %b %y')
    link = 'https://twitter.com' + tweet_links[i]['href']

    if datetime.datetime.now() - timestamp < timedelta(hours=4):
        if text.find('northbound') or text.find('nb'):
            print 'Found relevant tweet'
            num_relevant_tweets = num_relevant_tweets + 1

            time_str = timestamp.strftime('%m/%d %I:%M %p')
            text += "\n\n" + time_str + " - " + text
            html += "<p><b>" + time_str + "</b> - " + text + " <a href='" + link + "'>Link</a></p>"

if num_relevant_tweets > 0:
    message = MIMEMultipart('alternative')
    message['Subject'] = 'Twitter Alter - CaltrainStatus'
    message['From'] = EMAIL_SENDER
    message['To'] = EMAIL_RECIPIENT

    part1 = MIMEText(text, 'plain')
    part2 = MIMEText(html, 'html')
    message.attach(part1)
    message.attach(part2)

    mail = smtplib.SMTP(SMTP_SERVER, SMTP_PORT)
    mail.starttls()
    mail.login(base64.b64decode(SMTP_USERNAME), SMTP_PASSWORD)
    mail.sendmail(EMAIL_SENDER, [EMAIL_RECIPIENT], message.as_string())
    mail.quit()

That's all it takes to extract information from twitter automatically. Beautiful Soup can be used for any web pages, not just twitter. I've used it to check for website updates and as a price checker for several online stores.