LJ Archive

Work the Shell

Finishing Up the Content Spinner

Dave Taylor

Issue #268, August 2016

In which {writer|columnist|hacker} Dave Taylor helps you become a spammer. Sort of.

You'll recall that in my last article I shared a long, complex explanation for why spam email catches my attention and intrigues me, perhaps more than it should. Part of it is that I've been involved in email forever—I even wrote one of the most popular old-school email programs back in the day. But, there's also just the puzzle factor of taking a massive data set of millions of records and trying to produce “personalized” messages on such a large scale.

The easy version of this is to have named data fields like ${firstname}, so you can open your email with “Dear ${firstname}, I heard you went to ${college}? Me too!” and so on.

But, I'm more interested in the “spinning” side of things—the production of prose that has built-in synonyms, as exemplified by:

The {idea|concept|inspiration} is that each time you'd use a 
{word|phrase} you instead list a set of {similar words|synonyms|
alternative words} and the software automatically picks one 
{randomly|at random} and is done.

I know, you're likely shaking your head and wondering “what the deuce happened to Dave?”, but humor me, let's explore this together as a text-processing puzzle.

In my June 2016 column, I presented the core building blocks of the article spinner, a script that could identify the {} surrounded choices, isolate them, count how many options were present and display it to the user as debugging output.

So, the above would be displayed as:

$ sh spinner.sh spinme.txt
The
3 options, spinning --- idea|concept|inspiration
is that each time you'd use a
2 options, spinning --- word|phrase
you instead list a set of
3 options, spinning --- similar words|synonyms|alternative words
and the software automatically picks one
2 options, spinning --- randomly|at random
and is done.

That's a good start, but this time, let's finish the job and actually pick randomly from the set of choices each time, output only the selected option and reflow the text to make it all look good.

Pick a Card, Any Card

The basic way to work with random numbers in Bash is to use the special $RANDOM variable. Each time it's referenced, it returns a randomly chosen number between 1 and MAXINT (32767). I constrain it to a specific range by using the modulus function, so this will generate a random number between 0 and MAXVALUE:

randomnum=$(( $RANDOM % $MAXVALUE ))

The double-parent notation triggers mathematical evaluation, but you already know that, right?

To make the bottom be the value 1 instead of zero, I just add a bit more math to the equation:

randomnum=$(( $RANDOM % $MAXVALUE + 1 ))

The script already can identify how many choices are in a specific cluster (for example, “{one|two|three}”), and now we have a simple one-liner to help randomly pick one of the values. The challenge, of course, is to pick the actual string value, not just show a number!

I know, I know—work, work, work.

Halfway through the spinline() function (which I'll show in its entirety in just a sec), $choices stores the count of how many options are in the cluster, and $source is the set of choices, minus the open and close curly brackets.

Here's my first attempt at the random word extraction:

pick=$(( $RANDOM % $choices ))
wordpick=$( echo $source | cut -d\| -f$pick )

But, that generates an error message when run. It's not because of a typo, however—it's legit to use cut and specify the pipe symbol as the field delimiter—but because I haven't compensated for the 0..n selection of the random number generator: request field -f0 from cut, and it complains because, well, there is no field zero.

That's easily fixed now that I understand the problem, however, and so here's version two:

pick=$(( $RANDOM % $choices + 1 ))
wordpick=$( echo $source | cut -d\| -f$pick )

Remember that modulus returns 0..(n-1) for its values, so when there are three choices, for example, $RANDOM % 3 returns 0, 1 or 2. Add one to each, and it's back on track with the values 1, 2 and 3.

With a few useful debugging lines, here's the function in its entirety:


function spinline()
{
  source="$*"
  choices=$(grep -o '|' <<< "$*" | wc -l)
  choices=$(( $choices + 1 ))
  echo $choices options, spinning --- $source
  pick=$(( $RANDOM % $choices + 1 ))
  wordpick=$( echo $source | cut -d\| -f$pick )
  echo I pick choice $pick which is $wordpick
}

Yeah, code. Let's see what happens when I run it with the test sentence as input:

$ sh spinner.sh spinme.txt 
The
3 options, spinning --- idea|concept|inspiration
I pick choice 2 which is concept
is that each time you'd use a
2 options, spinning --- word|phrase
I pick choice 1 which is word
you instead list a set of
3 options, spinning --- similar words|synonyms|alternative words
I pick choice 2 which is synonyms
and the software automatically picks one
2 options, spinning --- randomly|at random
I pick choice 2 which is at random
and is done.

It's close, actually—really close!

In fact, let's get rid of those superfluous debugging echo statements (actually, I always just comment them out instead by prepending # on each line, so that if I develop the script further, and things start to go sideways, I can simply uncomment the lines and figure out what's going on).

Here's the result:

$ sh spinner.sh spinme.txt 
The
idea
is that each time you'd use a
word
you instead list a set of
synonyms
and the software automatically picks one
at random
and is done.

The magic really becomes apparent when the entire output is piped through the handy fmt command to put all the puzzle pieces back together on the line:

$ sh spinner.sh spinme.txt | fmt
The idea is that each time you'd use a word you instead list a set of 
synonyms and the software automatically picks one randomly and is done.

Run it a second time, and it's the same concept being discussed, but the specific word choices are different:

$ sh spinner.sh spinme.txt | fmt
The idea is that each time you'd use a phrase you instead list a set of
alternative words and the software automatically picks one randomly and 
is done.

So that's the program—mission accomplished.

Don't Bug Me, Man!

It turns out that there's a bug in the script; however, it's a subtle one that is nonetheless tricky to solve: if the text to spin includes a word cluster followed immediately by punctuation, the punctuation ends up being broken.

For example, consider if I slightly modified the spinme text like this:

The {idea|concept|inspiration} is that each time you'd 
use a {word|phrase}, you instead list a 
set of {similar words|synonyms|alternative words} and the 
software automatically picks one 
{randomly|at random} and is done.

See the added punctuation immediately after the word cluster on the second line? Here's what happens if I run this through the spinner script:

The inspiration is that each time you'd use a phrase , you instead list 
a set of similar words and the software automatically picks one randomly 
and is done.

See the problem? There shouldn't be a space before the comma. That's easily fixed with a sed statement, but it's an instance of a bigger problem, so rather than sed 's/ ,/,/g', I'm going to leave it to you, dear reader, to try to come up with a more generalized solution that takes into account all punctuation, including sequences like:

({cat|dog})

so that they'll be formatted properly in the final output.

And, that's a wrap for this article. For my next article, I'll look at, um, something or other. Perhaps it's time to start another game scriptâ

Dave Taylor has been hacking shell scripts on Unix and Linux systems for a really long time. He's the author of Learning Unix for Mac OS X and the popular shell scripting book Wicked Cool Shell Scripts (new edition coming out this summer!). He can be found on Twitter as @DaveTaylor, and you can reach him through his tech Q&A site: www.AskDaveTaylor.com.

LJ Archive