LJ Archive

Work the Shell

Movie Trivia and Fun with Random Numbers

Dave Taylor

Issue #172, August 2008

Use the shell to manipulate a list of movies from the Internet Movie Database (IMDb).

Last month, we had a lot of fun digging around within the Internet Movie Database, producing a set of scripts that together make it easy to generate a list of the top 250 movies on the site with release dates. The format of the output is:

All About Eve | 1950

Hotel Rwanda | 2004

Sin City | 2005

City Lights | 1931

This month, I take a look at how you can break those two fields up and randomly generate some likely release dates close to the actual date, then send it as a question on Twitter. For example, it might ask, “Hotel Rwanda was released in: 2000, 2001, 2004 or 2007?”

Splitting Up the Fields

Okay, this should be super easy for anyone reading this column. There are a bunch of ways to take a two-field data record and split it up, but my favorite tool for this sort of task is cut. So, we can do this:

moviename="$(echo $entry | cut -d\| -f1)"
releasedate=$(echo $entry | cut -d\| -f2)"

That was easy, right? Now, of course, if you want to be fancy about it, you'll want to strip any leading or trailing spaces too, which can be done with this sed command:

sed 's/^ //g;s/ $//g'

But, how do you get a random line out of a text file?

If you recall from previous columns, one of the secret features of the Bash shell's built-in mathematical capabilities—accessible with $(( )) notation—is the ability to get a random integer without any further fuss, like this:

echo $(( $RANDOM ))

Try it in your own command shell a few times, and you'll get a series of random integer values, like 29408 and 17501. To constrain it to the size of the file, we could do something fancy with wc -l to identify the number of lines in the actual data file, but because we already know we're grabbing 250 film titles from IMDb, it's easy just to use that value. Here's the first stab:

pickline="$(( $RANDOM % 250 )) "

It's not quite right though, because we'll get values 0–254. You can verify this by entering the command echo $(( 5 % 5 )), for example. So, we need to shift things up one:

pickline="$(expr $(( $RANDOM % 250 )) + 1 )"

That produces a random number. To extract that value from a file of lines, there are a number of solutions, but I'll stick with sed. In that case, the solution for pulling out line 33, as an example, is:

sed -n 33p

If you change the value to a variable name, however, there's a problem:

sed -n $picklinep

You can't put a space between the variable name and the p, but if you don't, you have a bad variable name, because it's pickline, not picklinep. The solution is a secret notational convention you can use in scripts when there's any sort of ambiguity like this—curly brackets. So, the line ends up as follows:

sed -n ${pickline}p

That does the trick, and in an application like this, sed is lightning fast too.

At this point, we have a data file of interesting information, we can extract a random line from the file, and we can split the resultant data into the film title and release year. How about coming up with plausible alternative release years?

Calculating Random Years

My first inclination with generating random years was to add and subtract 1–3 years and then use those as the alternate values. If we were looking at, say, Shaun of the Dead, released in 2004, we might end up with 2001 and 2007 as the options. Match a film that's more recent though, such as 2007's Grindhouse (though why that's on the IMDb top 250 films list is beyond me), and we have a problem. Suggesting 2009 as a possible release date would be daft.

More important, it wouldn't take long for people to realize that it's the middle value that's always correct on the quiz—not good. Just like with the SAT and GMAT, it's important to avoid any possible patterns in answers.

As a result, we can try something a bit more complicated. Each possible year is the actual year of release plus or minus a random value of 1–5—close enough that it'll be challenging to remember the right year.

Here's the beginning of the script:

add="$(( $RANDOM % 2 ))"              
delta="$(expr $(( $RANDOM % 5 )) + 1)"      

Here, add will be 0 (false) or 1 (true) for later conditional testing, and delta is a value between one and five, just as we need. They can be applied as follows:

if [ $add -eq 1 ] ; then
  newvalue=$(expr $1 + $delta )
  newvalue=$(expr $1 - $delta )

This script can be tested easily by dropping it into a simple script, which I'll call random-years.sh. The result of applying this to the starting year 2000 is 2002, 1998, 2005, 2001, 2003, 2004. Seems sufficiently random, yes?

Now, let's consider some nuances. First, we need to ensure that it's never past the current year, which can be done by grabbing that value from the date command with a format string: date +%Y (learn more about the many, many format strings that the date command understands with man strftime).

Second, here's a more interesting thought. If the movie came out a long time ago, we should have a bigger delta than if it's a recent release. In other words, if the movie is Casablanca, it came out in 1942, 66 years ago. Iron Man, which is also on the top 250 list, came out in 2008, 0 years ago. For Casablanca, we could have possible values of 1938 and even 1951, and it'd be a good quiz question for anyone who isn't a complete film nut. But, that far of a spread for Iron Man makes no sense. No one's going to think it might have come out in 1999.

What I'm thinking about in this situation then is that the delta might be a percentage of the age of the movie, normalized so that we always have some sort of spread. Maybe 20%? That'd give us a delta of 13.2 for Casablanca and 0 for Iron Man. That could work.

Ah, but I've run out of space. Next month, we'll go back to the random adjacent year function to wrap it up, and then look at how to get these questions out on Twitter rather than just on the Linux command line. Until then, “here's lookin' at you, kid.”

Dave Taylor is a 26-year veteran of UNIX, creator of The Elm Mail System, and most recently author of both the best-selling Wicked Cool Shell Scripts and Teach Yourself Unix in 24 Hours, among his 16 technical books. His main Web site is at www.intuitive.com, and he also offers up tech support at AskDaveTaylor.com. Follow him on Twitter if you'd like: twitter.com/DaveTaylor.

LJ Archive