LJ Archive

Work the Shell

Scrabble and Words With Friends

Dave Taylor

Issue #214, February 2012

Got a burning desire to cheat on your next crossword or Scrabble game? If so, you'll want to explore Dave's latest shell script.

It's been a while since I've delved into the world of games and game programming in this column, so I thought it'd be interesting to look at writing a word-finding or matching script from the perspective of crossword puzzles and Scrabble-like games.

You know the drill. You're playing and have encountered a six-letter word with the clue “cat, family”, and you know only a few letters: _ E _ I _ _. You've already figured out “feline”, but can you write a script that you could give that sort of partial match and have it figure out possibilities? Let's give it a whirl.

Finding a Word List

Linux includes the slick GNU aspell program, but the dictionary is compiled, and it's also not designed for word games but is instead based on common English language usage—not optimal.

Fortunately, you can grab some free word-list files off the Internet to use as the basis for this particular script. The one I work with here is the English Open Word List, which contains almost 129K words. Grab a copy for yourself at dreamsteep.com/projects/the-english-open-word-list.html.

The download is a ZIP archive, and it's a bit of a mess, sprawling across multiple directories and ultimately creating a set of 26 files called A Words.txt, B Words.txt and so on. I dumped it all into a single file called words.txt by using the following:

cat *Words.txt > words.txt

The resultant file is not tiny:

$ wc ~/words.txt
128985  128985 1123743 /Users/taylor/words.txt

That's okay though. On a modern Linux system, it'll still be lightning fast, even with 128,985 lines of data.

Looking for Words

Let's start by looking up that partial match for “feline” from earlier. Fortunately, we don't need to look much further than grep, because we can express what we need with a regular expression:

$ grep -E '.e.i..' words.txt

The problem is, a quick glance at the output reveals that this pattern isn't good enough:

abbreviate
aberdevine
abiogenist
abstemious
academical

To root the search so that it's an exact match, not a portion within a larger word, we need to root the regular expression. This is done with ^ to denote the beginning of the line and $ to denote the end:

$ grep -E '^.e.i..$' words.txt
aecium
aedile
aerial
aerier

That's better, except there are a lot of words that match this pattern. In fact, it's surprisingly common. How common? Let's see:

$ grep -E '^.e.i..$' words.txt | wc -l
     264

Sheesh, there are 264 possibilities. Drop in another letter though, and it's a bit more manageable:

$ grep -E '^.e.in.$' words.txt | wc -l
      58

One more letter, and although theoretically we should be able to guess it anyway, at least the choices are reduced to a viewable list:

$ grep -E '^.elin.$' words.txt
feline
heling
reline

That's actually the problem with crossword-puzzle-finder apps. There simply are too many words in the English language. Still, it's quite possible that for some letter combinations, even having most letters missing still produces a short list:

$ grep -E '^...cc.t.$' words.txt
braccate
placcate
spiccato
staccato
stoccata

This should be motivation to learn how to solve as much of that darn crossword puzzle as possible before turning to our little shell solution!

Words With Friends

Games like Scrabble and Words With Friends present an opposite problem. We know all the letters; we just want to figure out what words we can create with them.

On first blush, the solution could be as simple as searching the words database for all words that contain zero or one occurrence of each letter in the sequence. Not so fast though—look at this:

$ grep -E '^s*t*a*c*c*a*t*o*$' words.txt
aa
act
at
cat
coo
oo
sac
sat
scat
scatt
so
st
staccato
ta
taco
tact
tao
tat
tatt
tattoo
to
too

There are two problems here. The first is that the * actually is shorthand for zero or more matches, and the second is that the letters in the pattern are analyzed in order of occurrence. So, although “tao” is a valid word extracted from the letters “STACCATO”, the word “tots” never will be found, because the letter s appears before the others in the pattern.

The first problem can be solved by making a more complex regular expression: X{0,1} means zero or exactly one occurrence of the letter X in the pattern. It's definitely more complex:

$ grep -E '^s{0,1}t{0,1}a{0,1}c{0,1}c{0,1}a{0,1}t{0,1}o{0,1}$' 
 ↪words.txt
aa
act
at
cat
sac
sat
scat
so
st
staccato
ta
taco
tact
tao
tat
to

The second problem of letter order is a lot harder to solve. We could generate random sequences, but some quick math shows that if we have ten characters, that's 10*9*8*7*6*5*4*3*2 possible combinations—a lot!

However, a more interesting way to try to address this is to take a complete left turn and lose this increasingly complicated regular-expression approach by using a chained set of simple one-letter grep patterns instead.

That's not quite what we'll do though, because the first thing we need to do is screen out all words that contain letters not in our acceptable set. That's done with the set notation:

$ grep -vE '[^staccato]' words.txt
aa
accoast
accoasts
accost
accosts

That's still not quite right, because we're not taking into account the frequency of individual letters (note that “accoasts” has two occurrences of the letter s, but there's only one in our original letter set).

Let's wrap this up here for this month, however. Next month, I'll dig deeper into the problem and see if I can come up with some sort of smart solution.

Dave Taylor has been hacking shell scripts for more than 30 years. Really. He's the author of the popular Wicked Cool Shell Scripts and can be found on Twitter as @DaveTaylor and more generally at www.DaveTaylorOnline.com.

LJ Archive