Work the Shell

More Twitter User Stats

Dave Taylor

Issue #213, January 2012

Can a formula quantify whether someone is worth following on Twitter? Dave tackles this complex subject with a nifty shell script and some math.

In my last article, I started a script that identified user stats for Twitter accounts, with the intention of being able to analyze those stats and come up with an engagement or popularity score. Yeah, it's kind of like Klout, but without the privacy implications or cross-platform sniffing.

The motivation behind creating the script is to have a tool that lets you quickly differentiate between Twitter users who are spammers or bots and those who are influencers—for example, users who have more followers than people they ostensibly follow.

With surprisingly little work, I created a short script snippet that extracted basic Twitter figures: followers, following, number of tweets and the number of Twitter lists that include the Twitter account in question:

stats="$(curl -s $twitterurl/$username | 
 ↪grep -E '(stats_count|stat_count)' 
 ↪| sed 's/<[^>]*>/ /g;s/,//g')" 
echo $stats

The problem is, I ran out of space after realizing that some accounts were presented in one format while others were in another, as shown in these two differing results:

$ sh tstats.sh gofatherhood
3 0 0 0 Tweets
$ sh tstats.sh filmbuzz
#side .stats a:hover span.stats_count #side .stats a 
 ↪span.stats_count 1698 4664 301 13258 Tweets

That's not good, so let's start by fixing it.

Filters Rely on Low-Level Page Format

The problem, of course, is that my complicated grep sequence relies on the page being formatted in a very specific manner. If Twitter changes it even the slightest bit, things might well require updates and tweaks. Next time, we'll just get a supercomputer and some AI. For now though, I'll make the—rash—assumption that I've found both possible output formats between @FilmBuzz and my new @GoFatherhood Twitter accounts (the former tied to my film blog, www.DaveOnFilm.com, and the latter tied to my new dad blog www.GoFatherhood.com, in case you're curious).

To normalize the output, I simply can filter out the “.stats” line:

twitterurl="http://twitter.com"   # no trailing slash 
if [ $# -ne 1 ] ; then
  echo "Usage: $0 TWITTERID"
  exit 1
fi 
username="$1" 
 stats="$(curl -s $twitterurl/$username | grep -E
 ↪'(stats_count|stat_count)' |
      sed 's/<[^>]*>/ /g;s/,//g' | grep -v '.stats')" 
echo $stats

The result is exactly as desired now:

$ sh tstats.sh filmbuzz
1698 4664 301 13259 Tweets

The next logical step is to identify each of those fields, so we can do some basic calculations and screening.

With a set of numbers separated by spaces, there are a couple ways to pull them into variables, but my favorite is to use sed to turn the set of values into a name=value sequence, as illustrated in this simple example:

eval $(echo 1 2 3 | sed 's/^/a=/;s/ /;b=/;s/ /;c')

The intermediate output of this is a=1;b=2;c=3, and when it's evaluated by the shell (the eval and the $() subshell working together), the result is that there are now three new variables in the shell, a, b and c, with the values 1, 2 and 3, respectively:

$ echo b = $b
b = 2

To apply this in our Twitter script, I'll just make the smallest tweaks:

eval $(echo $stats | cut -d\  -f1-4 | 
 ↪sed 's/^/fwing=/;s/ /;fwers=/;s/ 
 ↪/;lists=/;s/ /;tweets=/') 
echo "$1 has sent $tweets tweets and follows $fwing, 
has $fwers followers and is on $lists lists."

Note that I had to add a cut invocation to get rid of the word “Tweets” (see the earlier script output) to ensure that eval doesn't get confused with its variable assignments. The result is nice:

filmbuzz has sent 13259 tweets and follows 1698, 
has 4664 followers and is on 301 lists. 
Trying a different user? 
davetaylor has sent 30282 tweets and follows 567, 
has 10284 followers and is on 791 lists.

Good. Now let's talk numbers.

Lightweight Numbers, Lightweight Results

Before I proceed, yes, I realize that the only outcome we can have from trying to analyze these most basic of stats is going to be a very simplistic score of whether someone is “interesting” or has any authority in the Twitterverse. Useful additional stats would be how many times they're re-tweeted (others rebroadcast their messages), what percentage of their tweets include a URL (which can indicate whether they're simply disseminating Web content or actually participating on Twitter) and what percentage of their tweets reference another Twitter account or, ideally, are actually replies to other Twitter users.

We could calculate some of these figures by pulling the 100 most-recent tweets from an account and quickly scanning for the @ symbol, an http: sequence and so forth, but I'll leave that as an exercise for you, the reader, and look forward to someone submitting the improved code to our archives at Linux Journal.

For now, I'm going to posit that an interesting tweet value can be calculated like this:

(followers / following) * (lists/1000) * (tweets/1000)

It's not perfect. Indeed, my friend F. Andy Seidl points out that 100/10 isn't necessarily only half as influential as 200/10 and suggests we use logarithms, but let's work with this basic calculation first and see what we get.

For my @DaveTaylor account, here's the base math:

(10284 / 567) * (793/1000) * (30285/1000)

which solves down to 434. By comparison, @FilmBuzz with a much closer ratio of followers to following solves down to the value 11, and the brand-new, zero value @GoFatherhood solves—unsurprisingly—to zero.

Robert Scoble of Rackspace is an interesting case to examine here. His stats: scobleizer has sent 56,157 tweets and follows 32,527, has 21,6782 followers and is on 19,134 lists. Impressive. His score? 7,161.

One more example before we implement the formula: @linuxjournal has sent 3,208 tweets and follows 5,050, has 12,050 followers and is on 1,165 lists. Score: 9.

Suffice it to say, it's a weak analysis system. Still, it's at least something to explore and, as I suggested earlier, there are lots of ways to refine and improve the formula once you can extract individual data points easily from the Twitter stream.

Coding the Score

Math is most commonly implemented using the bc program, and since we have nicely named variables, it's a breeze to implement in the script:

echo "scale=2;($fwers / $fwing) * ($lists/1000) * ($tweets/1000)" | bc

Fully implementing it with some friendly output involves a slight tweak of the earlier echo statement coupled with the use of the /bin/echo version of the command that knows the -n (no line break at the end) version. You'll see why:

/bin/echo -n "$1: $tweets tweets sent, follows $fwing, has $fwers
 ↪followers, is on $lists lists. SCORE: " 

echo "scale=2;($fwers / $fwing) * ($lists/1000) * ($tweets/1000)" | bc

With this in hand, a few quick test calculations:

$ sh tstats.sh davetaylor

davetaylor: 30285 tweets sent, follows 567, has 10283 followers, 
is on 793 lists. SCORE: 433.60

$ sh tstats.sh linuxjournal

linuxjournal: 3208 tweets sent, follows 5050, has 12050 followers, 
is on 1165 lists. SCORE: 8.83

$ sh tstats.sh arrington

arrington: 9163 tweets sent, follows 1852, has 100477 followers, 
is on 7729 lists. SCORE: 3836.29

It's not unreasonable that Mike Arrington, with 100,477 followers against the 1,852 that he follows should have a high Twitter influence score, while Linux Journal, with its 12,050 followers against the 5,050 it's following is ostensibly less popular or influential.

Anyway, I've run out of space here. I hope this has been interesting, and I highly encourage you to push on this idea and see both what additional numbers you can glean from Twitter and how they can all be combined into a single numeric score that could offer up a Twitter score.