It's not easy to determine whether someone's worth following on Twitter, but Dave takes on the task with a shell script that extracts account stats for a given Twitter account, then calculates their follow value. He also explains the philosophy behind the project and finds that Twitter has some weirdnesses in its HTML that makes parsing the results interesting.
So, you've been using Twitter since it was all about the fail whale and not about the corporate sponsorships and back-end analytics. Me too. The problem is, Twitter also has become even more crazy and hard to understand as it has gained its millions of followers and its utility ecosystem has expanded and contracted variously.
One thing that's always interested me though, is whether there's a way to calculate a numeric value for given Twitter users based on both their visibility and engagement. How do you measure those? Visibility could be calculated simply by looking at how many followers someone has, but most Twitter users follow lots of random people, so that they can have lots of followers.
This behavior is based on what Dr Robert Cialdini calls the Principle of Reciprocity in his brilliant book Influence, wherein he observes that if someone does something for you, you feel an inherent obligation to return the favor. Think Hare Krishnas at the airport giving you a flower before they ask for a donation. Think of the self-appointed pundits and gurus telling you their rules of netiquette, or of your own reactions—“if this person's following me on Twitter, I should follow them back. It's only polite, after all.”
The upside is that if you just look at how many followers someone has without also checking how many people they follow, you can be duped into thinking something along the lines of “25,000 followers? Impressive.” without ever noticing that the person follows 30,000 people in turn.
One way to differentiate these different types of Twitter users, therefore, is to calculate the ratio of followers to following. That's half the calculation.
Engagement is trickier to calculate, but if you examine someone's Twitter stream, you can separate out broadcast messages from those that are either an at-reply (as in “@DaveTaylor nice column!”) or a retweet.
It's another ratio. If the majority of tweets from someone are broadcast tweets, their level of engagement is low, whereas a Twitter user whose messages almost always are responses is high on the engagement scale.
One more criterion: gross numbers. How many followers does someone have overall? How many tweets has the user sent? An account with a high engagement but only seven tweets in the last six months is less interesting than one with lower engagement but an average of 20 tweets a day. Agreed?
So, how do we calculate these sort of figures?
Twitter offers up quite a bit of information for its public profiles (and just about every Twitter profile is public), including the key stats we want to start with: follower count and following count.
To get them, we don't even need to negotiate the OAUTH login. We can just use curl from the command line:
$ curl -s http://twitter.com/davetaylor | grep 'stats_count numeric' <span id="following_count" class="stats_count numeric">566 </span> <span id="follower_count" class="stats_count numeric">10,187 </span> <span id="lists_count" class="stats_count numeric">790 </span>
You can see that my Twitter account, @DaveTaylor, has 10,187 followers, while I'm following 566 people. The “list” figure suggests popularity too, but since most Twitter users I know eschew lists, let's just ignore that for now.
We'd also like to grab the raw tweet count to see if it's an account that actually has sent some tweets or is dormant. Examining the HTML closely reveals that although the previous items are put into the class stats_count, the number of tweets sent is put in a similar, but not quite identical, class called stat_count. Typo? Maybe. Meanwhile, it forces us to tweak our regular expression:
$ curl -s http://twitter.com/davetaylor | grep -E '(stats_count|stat_count)' <span id="following_count" class="stats_count numeric">566 </span> <span id="follower_count" class="stats_count numeric">10,187 </span> <span id="lists_count" class="stats_count numeric">790 </span> <li id="profile_tab"><a href="/DaveTaylor" accesskey="u"> <span id="update_count" class="stat_count">30,055</span> <span>Tweets</span></a></li>
It's a bit ugly, but it's not much work to extract and reformat the data in a script. The challenge really is just to strip away all the HTML junk, because once we've used it to select the lines in question, we don't actually need it any more.
My first attempt is this:
$ echo "<test me>hello<test 2>" | sed 's/<.*>/-/g' -
That didn't work. We want “hello” as the result, because we don't want to lose the non-HTML values. Here's my second try:
$ echo "<test me>hello<test 2>" | sed 's/<[^>]*>/-/' -hello<test 2>
Aha! That's what we need—a regular expression that basically says “< followed by as many characters as are present other than the '>' character”.
To strip all the HTML, simply make it a global search and replace by appending a “g” to the sed statement:
$ echo "<test me>hello<test 2>" | sed 's/<[^>]*>/-/g' hello
That's great. Now we can turn the mess of results into something hopefully a bit more useful:
curl -s http://twitter.com/davetaylor | grep -E ↪'(stats_count|stat_count)' | sed 's/<[^>]*>/ /g' 566 10,187 790 30,055 Tweets
We still need to get rid of those pesky commas, but that's a small addition to the sed statement, right? Let's use this instead: sed 's/<[^>]*>/ /g;s/,//g'.
The results are ready to be parsed:
566 10187 790 30055 Tweets
That can be done with one of my favorite scripting commands, cut. The wrinkle, however, is that when we drop this into a shell script, the results are a bit surprising if we look at my @FilmBuzz movie news Twitter profile. First, the script snippet:
stats="$(curl -s $twitterurl/$username | grep -E ↪'(stats_count|stat_count)' | sed 's/<[^>]*>/ /g;s/,//g')" echo $stats
And, the results:
$ ./tstats.sh filmbuzz #side .stats a:hover span.stats_count ↪#side .stats a span.stats_count 1701 4529 303 13034 Tweets
Although this just impacts the field number of the cut command, it turns out to be more tricky than it first seems. I run the same tstats script against @DaveTaylor and look what happens:
$ ./tstats.sh davetaylor 566 10187 790 30055 Tweets
Different output. Jeez—there's always something.
Let's stop here with this small dilemma, and next month we'll pick up the parsing challenge and then proceed to calculating some numeric scores for Twitter users. Stay tuned!