LJ Archive

Work the Shell

Spinning and Text Processing

Dave Taylor

Issue #267, July 2016

Dave delves into complex string processing to write a tool spammers use.

I have a dirty secret to share, and I hope you won't think less of me once you learn it. I used to be in the internet marketing world and pitched my coaching programs and DVD sets from stages around the United States. Yes, for $999, I'd teach you how to make money online, and if you were one of the first three to sign up, I'd even throw in my friend's dynamite ebook absolutely free!

Truth is, I didn't last long in that space because I'm much more of a do-er than a salesperson, and it would bug me to no end when people would buy my coaching package—at 20% off, but only if you sign up right now!—and then never actually open it and use it to at least try their hand at creating an online business.

That's all in the past, fortunately, but I've retained an interest in those business opportunity pitches and what they're actually selling. Just like the cliché envelope-stuffing job (you know: “Send me $200 in an envelope, and I'll show you how to ask people to send you money!”), it turns out that a lot of online businesses still are predicated on gaming search engines to gain traffic to pages selling daft and usually worthless things.

And, one way that these entrepreneurs game Google and other search engines is by “spinning” to produce lots and lots of content from a single article that they've paid someone a few bucks to write in the first place.

It's all rather uninspiring, except the spinning idea itself is rather interesting, and I've been toying with writing a shell script to allow easy article spinning for quite a long time. There are more prosaic, less questionable uses for this technology too, like in programs or even games that have text messages useful to vary.

The {idea|concept|inspiration} is that each time you'd use a {word|phrase} you instead list a set of {similar words|synonyms|alternative words} and the software automatically picks one {randomly|at random}.

So the previous sentence would come out of the spinner as “The idea is that each time you'd use a phrase you instead list a set of alternative words and the software automatically picks one at random.” Got it? Easy enough.

A more advanced spinner might actually tap a thesaurus, and each time it sees a word, push out a set of synonyms automatically, which the other script then randomly simplifies each time it's invoked.

In fact, go read spam blog comments or spam email, and you'll see the output of these sort of contextless sentence manipulations. They can be...weird, like this:

she's got arriving in can easily dresses, still Beth may be 36 yr old men's city servant, outdoors of waking time 'en femme'. she's single, symmetrical in addition thinks to achieve marital, "Eventually..."

But hey, just because there are bad uses, doesn't mean it's not an interesting project to try to code, right? I trust you to exercise good judgment of your own when you explore this script, okay?

Spinning Out the Spinner

The basic tasks of the script are straightforward: parse the input, isolate each word-choice block, pick one at random, then reassemble everything and display it.

To make things a bit easier, I'm going to start by using fmt to make each paragraph one really long line. That way, I then can break the input into lines that don't have a word-choice block and those that do:

fmt -w$bigwidth "$1" | tr '{' '\n' | tr '}' '\n'

An input line like {this|demo} would then transform.

An input line like
this|demo
would then transform.

See how that works? I'm going to use fmt again at the end of the process to clean up the output.

One facet of shell script programming that most people don't realize is that every loop structure acts as its own subshell, so rather than waste space and time with a temporary file, I'll pipe the output of the fmt|tr sequence directly into a while loop:

fmt -w$bigwidth "$1" | tr '{' '\n' | tr '}' '\n' | \
while read line
do
  if [ $( echo "$line" | grep -c '|' ) -gt 0 ] ; then
    echo "SPIN THIS: $line"
  else
    echo "$line"
  fi
  lines=$(( $lines + 1 ))
done 

See how the fmt line ends with | \, and that feeds directly into the while loop? Very handy structure!

Now I'm going to run this code snippet with the sample input file to see what happens:

$ sh spinner.sh spinme.txt
The
SPIN THIS: idea|concept|inspiration
is that each time you'd use a
SPIN THIS: word|phrase
you instead list a set of
SPIN THIS: similar words|synonyms|alternative words
and the software automatically picks one
SPIN THIS: randomly|at random
.

That pesky period on its own line is a glitch that'll need to be fixed later, but the basic structure of the script is sound: you can parse and break down the input file data and identify which new lines are selector lines.

The Spinning Function

Instead of just prepending SPIN THIS: before a line that has choices, that's a perfect place to put in a function call to a separate block of code that does the actual work.

One of the most interesting parts of the function is how it figures out how many options there are in the given string. It's a specific instance of the general question “how many occurrences of X are in string Y?”, and it exploits the little known -o flag to grep:


grep -o '|' <<< "$*" | wc -l

Take a deep breath; I can talk you through this one! The <<< notation is a variation on the here document (<<) you've hopefully already seen in scripts. The difference is that the result is fed as a single string on stdin.

The "$*" produces the entire argument as given to the function in the main block of the script; the | is the character being counted, and of course, wc -l produces the number of matching lines (in this case, the number of delimiters in the line).

All that, and it's not quite what I want, because a line like word|phrase has one delimiter, but two choices. Here's how I solve that in this first, skeletal version of the function:


function spinline()
{
  source="$*"
  choices=$(grep -o '|' <<< "$*" | wc -l)
  choices=$(( $choices + 1 ))
  echo $choices options, spinning --- $source
}

In use:

$ sh spinner.sh spinme.txt
The
3 options, spinning --- idea|concept|inspiration
is that each time you'd use a
2 options, spinning --- word|phrase
you instead list a set of
3 options, spinning --- similar words|synonyms|alternative words
and the software automatically picks one
2 options, spinning --- randomly|at random
.

That's it for this month. Next month, I'll finish up the function, including implementing a way to pick one entry randomly from a set of n choices, then output the cleaned up copy, ready to use in whatever program or utility you'd like.

Dave Taylor has been hacking shell scripts since the dawn of the computer era. Well, not really. But still, 30 years is a long time! He's the author of the newly revised Learning Unix for Mac OS X and the popular shell scripting book Wicked Cool Shell Scripts. He can be found on Twitter as @DaveTaylor, and you can reach him through his tech Q&A site: www.AskDaveTaylor.com.

LJ Archive