LJ Archive

Work the Shell

Simple Scripts to Sophisticated HTML Forms, Take II

Dave Taylor

Issue #195, July 2010

Parsing HTML files.

We've been digging into the Yahoo Movies database for the past few months, as you'll recall, building a command called findmovie that will have the following usage:

USAGE: findmovie -g genre -k keywords -nrst title

However, we slammed into a wall at 100kph last month in the simplest of calculations: how many titles match a given combination of query elements?

For example, how many action films are there that have “death” in the title? That'd look like findmovie -g act death, but making that count actually work is tricky, because the Yahoo Movies database output is different depending on whether there are zero matches, less than a page of matches or more than a page of matches. Examples of each output are “Sorry, no matches were found”, “(All results shown)” and “< Prev | 1 - 20 of 143 | Next 20 >”, respectively.

Oh, and it gets worse. Sometimes when there's less than a full page of results, you'll see something like this: “< Prev | 1 - 3 of 3 | Next >” instead.

It's pretty much a huge pain in the booty, and even if you crack open the source, there's no handy spot that says “0” or “4” or “143”. So, that's what I want to focus on this month—parsing an HTML file to isolate and identify this particular data point.

Caching the Results

The first observation I have about identifying a solution is that we are going to need to cache (or save) the results, so we can parse it more than once to see what we find. This brings up the old shell scripting challenge of choosing a good, unique, temporary filename.

I'm old-school. I'm used to using .$$ to use the process ID as the basis of the temp file, but in fact, there are better solutions in modern Linux systems. Check out mktemp if you're on a BSD-based system. If that's not available, use man smartly: man -k temp | grep '(1' will extract the replacement that your distro has instead. Here's a typical use of mktemp:

appname=$(basename $0)
TMPFILE=$(mktemp /tmp/${appname}.XXXXXX) || exit 1 

It looks pretty similar, but by using that many X characters, the program uses the PID and random letters, making the temp file impossible for a hacker to guess or anticipate. The version of this script I've been developing on my Mac OS X system had the following code snippet:


if [ $dump -eq 1 ] ; then
  exec /usr/bin/curl --silent "$baseurl${params}\&p=$pattern"
else
  exec open -a safari "$baseurl${params}\&p=$pattern"
fi 

The problem here is that using exec to invoke a command replaces the shell script with the command in question, which isn't going to work. Instead, it's time to rewrite it:


if [ $dump -eq 1 ] ; then
   appname=$(basename $0)
   TMPFILE=$(mktemp /tmp/${appname}.XXXXXX) || exit 1
  /usr/bin/curl --silent "$baseurl${params}\&p=$pattern" \
     > $TMPFILE
else
  exec open -a safari "$baseurl${params}\&p=$pattern"
fi 

That looks good. If we're dumping the file source, it'll go to the temporary file for later analysis. If it's a request that is supposed to launch the search results in a browser, it still uses the Mac OS X open command.

Parsing the Results

To figure out what's going on, we need to account for three different possibilities, each of which has a different “fingerprint” in the source file. Here's a rough template:

if [ ! -z "$(grep -i "no matches were found" $TMPFILE)" ]
then
  echo there are zero results for that search.
elif [ ! -z "$(grep -i "Next&nbsp;&gt;" $TMPFILE)" ]
then
  echo got some results with case two.
else
  echo more than a page of results
fi 

Here, I'm showing only output echo statements to give you a sense of the algorithm, but you can see that we're just testing for a known string that hopefully won't show up in other situations. Note the third test, though: Next&nbsp;&gt; is some HTML weirdness. “nbsp” is a non-breaking space, and “gt” is the > symbol. Wrap 'em in “&” and “;”, and you have HTML character entities.

To ascertain the total match count requires yet more parsing of the output. Search for “death race”, and you'll find three matches, which end up looking like this:


<b>3</b> 

Unfortunately, it's rather buried in a more complicated pattern, because here's a typical match:

<td align=right><font face=arial size="-2"><nobr>
↪&lt;&nbsp;Prev&nbsp;|&nbsp;<b>1 - 3</b>
↪&nbsp;of&nbsp;<b>3</b>&nbsp;... 

I have to admit, I was stumped for a bit, which is why having geeky friends like Martin and Lucretia M. Pruitt is so darn helpful. I posed this puzzle on Twitter (I'm @DaveTaylor if you want to follow me), and after some false starts, they suggested a simple and logical solution: turn the <b> and </b> into individual character delimiters, then simply use cut to pull out the field we seek. Smart!

Here's how that looks as a simple command sequence:


grep -i "1 - " $TMPFILE |
   sed 's/<b>/~/g;s/<\/b>/~/g' |
   cut -d\~ -f4 

Armed with this, the ugly HTML sequence above quickly reduces down to the value 3, which is exactly what we want. One nuance, though. It turns out that this data appears both before and after the matches, so we need to slip | head -1 to ensure that we're parsing only one line and not duplicating the data entry or confusing the new parser. This means we can create the following code:


if [ ! -z "$(grep -i "no matches were found" $TMPFILE)" ]
then
  matches=0
elif [ ! -z "$(grep -i "Next&nbsp;&gt;" $TMPFILE)" ]
then
  matches="$(grep -i "1 - " $TMPFILE | head -1 | \
     sed 's/<b>/~/g;s/<\/b>/~/g' | cut -d\~ -f4)"
else
  matches="$(grep -i "1 - " $TMPFILE | head -1 | \
     sed 's/<b>/~/g;s/<\/b>/~/g' | cut -d\~ -f4)"
fi 

You can see how I'm differentiating the three cases and how the resultant code is fairly similar in the second and third cases. In fact, they don't need to be separate cases, so the count is more easily calculated like this:


if [ ! -z "$(grep -i "no matches were found" $TMPFILE)" ]
then
  matches=0
else
  matches="$(grep -i "1 - " $TMPFILE | head -1 | \
     sed 's/<b>/~/g;s/<\/b>/~/g' | cut -d\~ -f4)"
fi 

If you initialized matches to zero, you actually can flip the logic of the first conditional and prune it down even further:


matches=0 
if [ -z "$(grep -i "no matches were found" $TMPFILE)" ]
then
  matches="$(grep -i "1 - " $TMPFILE | head -1 | \
     sed 's/<b>/~/g;s/<\/b>/~/g' | cut -d\~ -f4)"
fi 

Nice. It's a simple, straightforward and fine example of how if you keep thinking about what you're really accomplishing with complex conditionals, they often can be not only simplified, but sped up too.

Next Month

While writing these columns on working with Yahoo Movies, I've found my interest has been pulled in a different direction: a “name that tune” game. That's what we'll start working on next month. If you want to get a sneak peek at it and see how it evolves in real time (rather than here in Linux Journal), jump on Twitter and follow @SongTitle. It's going to be fun!

Dave Taylor has been hacking shell scripts for a really long time, 30 years. He's the author of the popular Wicked Cool Shell Scripts, and he can be found on Twitter as @DaveTaylor and more generally at www.DaveTaylorOnline.com.

LJ Archive