Work the Shell

Still Parsing the Twitter Stream

Dave Taylor

Issue #191, March 2010

How do you keep track of which tweets you've already answered?

Last month, you'll hopefully remember that we took the big step in our Twitter stream parsing program of actually having it parse the incoming messages and strip out quotes and other HTML noise. I also republished the send-tweet script too, which we'll use this month.

The biggest challenge we face with the tweet-parser is knowing what messages we've already answered and which are new since the last time the program was run. The solution? To go back and tweak the original script a bit. It turns out that each and every tweet has a unique ID value, as you can see here:


<id>2541771</id>

You'll recall that early in the script we have this grep command:


grep -E '(<screen_name>|<text>)' | \

Simple enough. We'll tweak it to include |<id> and grab that value too. Except, of course, it's not that simple. It turns out that two <id> strings show up in the XML data from Twitter: one that's the ID of the account sending the message, and another that's the ID of the message itself—both conveniently labeled the same. Ugh!

Timestamps and Tricky XML

I can kvetch and wish Twitter would fix its XML to have USERID or similar, but what's the point? They have the same thing with the overloaded <created_at> tag too, so we're going to have to bite the bullet and accept that we are now grabbing four data fields from the XML feed, only three of which we care about.

Once we know that we're going to have four lines of output, cyclically, we simply can decide which of those are actually important and tweak them in the awk statement:


$curl -u "davetaylor:$pw" $inurl | \
  grep -E '(<screen_name>|<text>|<id>)' | \
  sed 's/@DaveTaylor //;s/  <text>//;s/<\/text>//' | \
  sed 's/ *<screen_name>//;s/<\/screen_name>//' | \
  sed 's/ *<id>//;s/<\/id>//' | \
  awk '{ if (NR % 4 == 0) {
           printf ("name=%s; ", $0) }
         else if (NR % 4 == 1) {
           printf("id=%s; ",$0) }
         else if (NR % 4 == 2) {
           print "msg=\"" $0 "\"" }
       }' > $temp

That's a pretty complicated sequence, so let's look at the awk conditional statement a little closer. We have four input records (lines) that we're stepping through. The value of NR is the number of records processed so far. So if NR mod 4 equals 0, it's the first of the four records (lines). The first record is the name value.

Did you see that two lines have printf, and the third uses a simpler print statement? Since we want each set of variables on a separate line, we use the print statement, because it automatically appends a newline to the output. Of course, the same effect could be achieved by putting the newline as a format string passed to printf. Example output follows:

name=thattalldude; id=6507045947; msg="Rates?"
name=KateC; id=6507034680; msg="hours"
name=pbarbanes; id=6507033698; msg="thanks"
name=jodie_nodes; id=6507022063; msg=" $$?"
name=KateC; id=6507019757; msg="price"
name=tarahn; id=6507008559; msg="impact"
name=GaryH2UK; id=6507004771; msg="directions"

We're going to hand these again, line by line, to the eval statement to set the three variables: name, id and msg. Then, it's a simple parsing problem, comparing msg to the known queries we have. Basically, it's what we did last month, except this time, every single tweet also has a unique ID value associated with it.

A typical test might now look like this:

if [ "$msg" == "hours" ] ; then
  echo "@$name asked what our hours are in tweet $id"
fi

Nice! It's simple, straightforward and well worth the preprocessing hoops we've jumped through.

Working with IDs Included

Indeed, I run that against my Twitter stream (after asking people to send me sample queries), and here's what I see:

@TheNose100 asked what our hours are in tweet 6507436100
@crepeauf asked what our hours are in tweet 6507187325
@jdscott asked what our hours are in tweet 6507087136
@KateC asked what our hours are in tweet 6507034680
@inspiremetoday asked what our hours are in tweet 6506966654

I bet you can see how to proceed from here. We write static responses, calculate values as needed and use send-tweet to respond to the user:

$tweet "@$name our hours are Mon-Fri 9-5, Sat 10-4."

For fun, I'll let people send the query “time” and get the current output of the date command too, just to demonstrate how that might work. Here's the code block:

if [ "$msg" == "time" ] ; then
  echo "@$id asked for the time"
  $tweet "@$name the local time on our server is $(date)"
fi

Great. Got it all, except for where we started out. How do you track which tweets you've already answered?

But What Have We Already Seen?

The answer isn't that hard. The stream is newest to oldest, and the message ID values are assigned sequentially by the server, so all we need to do is cache the most recent message ID we've seen after we have answered all queries. Then, on subsequent invocations, compare each query ID to the most recent we've answered. If they're greater, we need to answer them. If not, we've already done so. Like this:

if [ "$id" == "$previouslatestid" -o $answered -eq 1 ] ; then
  echo "already answered query \"$msg\" from $name: skipped"
  answered=1
else
  ...

The previouslatestid is what's cached. We'll also capture the most recent ID of the current wave of queries like this:

if [ -z "$latestid" ] ; then
  latestid=$id        # store most recent ID
fi

Of course, there are a few more steps. We need to grab the cached value at the beginning of the script:

if [ -f "$lastidcache" ] ; then
  previouslatestid="$(cat "$lastidcache")"
else
  previouslatestid="0"
fi

And, we need to save it at the end:

echo $latestid > "$lastidcache"

That's it. I've run out of space, but the full script is available at ftp.linuxjournal.com/pub/lj/listings/issue191/10695.tgz. Next month, we'll polish it a bit and see what fun we can have with a tweetbot!