Parsing HTML files.
We've been digging into the Yahoo Movies database for the past few months, as you'll recall, building a command called findmovie that will have the following usage:
USAGE: findmovie -g genre -k keywords -nrst title
However, we slammed into a wall at 100kph last month in the simplest of calculations: how many titles match a given combination of query elements?
For example, how many action films are there that have “death” in the title? That'd look like findmovie -g act death, but making that count actually work is tricky, because the Yahoo Movies database output is different depending on whether there are zero matches, less than a page of matches or more than a page of matches. Examples of each output are “Sorry, no matches were found”, “(All results shown)” and “< Prev | 1 - 20 of 143 | Next 20 >”, respectively.
Oh, and it gets worse. Sometimes when there's less than a full page of results, you'll see something like this: “< Prev | 1 - 3 of 3 | Next >” instead.
It's pretty much a huge pain in the booty, and even if you crack open the source, there's no handy spot that says “0” or “4” or “143”. So, that's what I want to focus on this month—parsing an HTML file to isolate and identify this particular data point.
The first observation I have about identifying a solution is that we are going to need to cache (or save) the results, so we can parse it more than once to see what we find. This brings up the old shell scripting challenge of choosing a good, unique, temporary filename.
I'm old-school. I'm used to using .$$ to use the process ID as the basis of the temp file, but in fact, there are better solutions in modern Linux systems. Check out mktemp if you're on a BSD-based system. If that's not available, use man smartly: man -k temp | grep '(1' will extract the replacement that your distro has instead. Here's a typical use of mktemp:
appname=$(basename $0) TMPFILE=$(mktemp /tmp/${appname}.XXXXXX) || exit 1
It looks pretty similar, but by using that many X characters, the program uses the PID and random letters, making the temp file impossible for a hacker to guess or anticipate. The version of this script I've been developing on my Mac OS X system had the following code snippet:
if [ $dump -eq 1 ] ; then exec /usr/bin/curl --silent "$baseurl${params}\&p=$pattern" else exec open -a safari "$baseurl${params}\&p=$pattern" fi
The problem here is that using exec to invoke a command replaces the shell script with the command in question, which isn't going to work. Instead, it's time to rewrite it:
if [ $dump -eq 1 ] ; then appname=$(basename $0) TMPFILE=$(mktemp /tmp/${appname}.XXXXXX) || exit 1 /usr/bin/curl --silent "$baseurl${params}\&p=$pattern" \ > $TMPFILE else exec open -a safari "$baseurl${params}\&p=$pattern" fi
That looks good. If we're dumping the file source, it'll go to the temporary file for later analysis. If it's a request that is supposed to launch the search results in a browser, it still uses the Mac OS X open command.
To figure out what's going on, we need to account for three different possibilities, each of which has a different “fingerprint” in the source file. Here's a rough template:
if [ ! -z "$(grep -i "no matches were found" $TMPFILE)" ] then echo there are zero results for that search. elif [ ! -z "$(grep -i "Next >" $TMPFILE)" ] then echo got some results with case two. else echo more than a page of results fi
Here, I'm showing only output echo statements to give you a sense of the algorithm, but you can see that we're just testing for a known string that hopefully won't show up in other situations. Note the third test, though: Next > is some HTML weirdness. “nbsp” is a non-breaking space, and “gt” is the > symbol. Wrap 'em in “&” and “;”, and you have HTML character entities.
To ascertain the total match count requires yet more parsing of the output. Search for “death race”, and you'll find three matches, which end up looking like this:
<b>3</b>
Unfortunately, it's rather buried in a more complicated pattern, because here's a typical match:
<td align=right><font face=arial size="-2"><nobr> ↪< Prev | <b>1 - 3</b> ↪ of <b>3</b> ...
I have to admit, I was stumped for a bit, which is why having geeky friends like Martin and Lucretia M. Pruitt is so darn helpful. I posed this puzzle on Twitter (I'm @DaveTaylor if you want to follow me), and after some false starts, they suggested a simple and logical solution: turn the <b> and </b> into individual character delimiters, then simply use cut to pull out the field we seek. Smart!
Here's how that looks as a simple command sequence:
grep -i "1 - " $TMPFILE | sed 's/<b>/~/g;s/<\/b>/~/g' | cut -d\~ -f4
Armed with this, the ugly HTML sequence above quickly reduces down to the value 3, which is exactly what we want. One nuance, though. It turns out that this data appears both before and after the matches, so we need to slip | head -1 to ensure that we're parsing only one line and not duplicating the data entry or confusing the new parser. This means we can create the following code:
if [ ! -z "$(grep -i "no matches were found" $TMPFILE)" ] then matches=0 elif [ ! -z "$(grep -i "Next >" $TMPFILE)" ] then matches="$(grep -i "1 - " $TMPFILE | head -1 | \ sed 's/<b>/~/g;s/<\/b>/~/g' | cut -d\~ -f4)" else matches="$(grep -i "1 - " $TMPFILE | head -1 | \ sed 's/<b>/~/g;s/<\/b>/~/g' | cut -d\~ -f4)" fi
You can see how I'm differentiating the three cases and how the resultant code is fairly similar in the second and third cases. In fact, they don't need to be separate cases, so the count is more easily calculated like this:
if [ ! -z "$(grep -i "no matches were found" $TMPFILE)" ] then matches=0 else matches="$(grep -i "1 - " $TMPFILE | head -1 | \ sed 's/<b>/~/g;s/<\/b>/~/g' | cut -d\~ -f4)" fi
If you initialized matches to zero, you actually can flip the logic of the first conditional and prune it down even further:
matches=0 if [ -z "$(grep -i "no matches were found" $TMPFILE)" ] then matches="$(grep -i "1 - " $TMPFILE | head -1 | \ sed 's/<b>/~/g;s/<\/b>/~/g' | cut -d\~ -f4)" fi
Nice. It's a simple, straightforward and fine example of how if you keep thinking about what you're really accomplishing with complex conditionals, they often can be not only simplified, but sped up too.
While writing these columns on working with Yahoo Movies, I've found my interest has been pulled in a different direction: a “name that tune” game. That's what we'll start working on next month. If you want to get a sneak peek at it and see how it evolves in real time (rather than here in Linux Journal), jump on Twitter and follow @SongTitle. It's going to be fun!