LJ Archive

Work the Shell

A Word Finder for Words With Friends

Dave Taylor

Issue #215, March 2012

Dave continues to explore how to create a Scrabble suggestion engine and learns that regular expressions sometimes aren't the best path to a solution.

In my last article, I looked at a simple crossword-puzzle word finder, something that requires a word list and a basic understanding of grep. Then, I switched to looking at Scrabble and the popular on-line equivalent Words With Friends.

In this latter case, the problem turns out to be quite a bit more difficult. Say you have seven tiles (one or more of which could be a blank or wild-card tile, but I'll address that later) that are a random set of letters, and from that, you want to combine them to create dictionary words. For example, I'm playing a game of Words With Friends with my sister, and the tiles I have to work with are T E C Y Z S X. But, of course, any good Scrabble player knows that you also need to work with the letters already on the board, so although I can make a word like “YET” or “SEX” from these letters, I still have to interlock the word onto the board somehow. It's harder, but that's where big-point word play comes from.

Still, let's stick with the basics: enter a set of letters, and the script will offer a set of possible words using those letters. What about all these nuances? Yeah, they're going to make this way more complicated!

Words from Your Letters

Tapping the word list we downloaded in the last column, the most obvious search is for the letters as a regular expression:

$ grep '^t*e*c*y*z*s*x$' words.txt
ex

Ah, that doesn't work well because the order of the letters is important to grep although it's not important to us. Instead, a complicated pipe offers an interesting alternative:

grep t words.txt | grep e | grep c | grep y 
 ↪| grep z | grep s | grep x

But, that doesn't produce any results because it's looking for words that have all the letters in our hand, and that's basically impossible.

So, what about this:

grep 't*' words.txt | grep 'e*' | grep 'c*' | grep 'y*' 
 ↪| grep 'z*' | grep 's*' | grep 'x*'

The x* notation is “zero or more occurrences of letter x”, and that is clearly not going to work because every word matches this complex search pattern if you think about it.

Let's flip this around and screen out all words that contain letters not in our set of letters instead:

$ grep -vE '[^tecyzsx]' words.txt
cee
cees
cess

There's another problem. The words match, except we're not taking into account the frequency of each letter. That is, although “cess” indeed comprises only letters in our set, we have one “s”, not two, so it's not actually a valid match.

Then again, at least it's a step in the right direction, which is more than the previous examples have demonstrated, so let's run with it.

Analyzing Length and Filtering Results

With seven letters, the first simple secondary filter is that any words longer than seven letters can be axed. How to test? The wc command works, but awkwardly. Still, I have a sense we're going to end up with each possible match going into a more complex test, so let's start building it:

#!/bin/sh
# Findword -- find matching dictionary words for 
 ↪Scrabble given a set of letters
datafile="words.txt"
maxlength=7
minlength=4 
if [ -z "$1" ] ; then
  echo "Usage: $(basename $0) letters"; exit 1
fi 
for possibility in $(grep -vE "[^$1]" words.txt)
do
  length=$(echo $possibility | wc -c)
  if [ $length -gt $maxlength ] ; then
    echo "$possibility is too long"
  elif [ $length -lt $minlength ] ; then
    echo "$possibility is too short"
  else
    echo $possibility is a possibility -- length = $length
  fi
done

There's a built-in problem with this script that you'll realize if you're familiar with how wc counts letters. To demonstrate:

$ echo linux | wc -c
       6

Six? Shouldn't that be five?

The fix is to add the following:

# adjust lengths because our fast wc-c adds 1 char
maxlength=$(( $maxlength + 1 ))
minlength=$(( $minlength + 1 ))

You might find it faster simply to add one to each of the default settings, but because down the road, I am anticipating letting the user specify min/max length words, the compensatory code seems a better solution.

With that added code, we can find five-, six- or seven-letter words that are made up of only the letters in our hand by simply commenting out the “too long/too short” message:

$ findword.sh tecyzsx
cees is a possibility -- length = 5
cess is a possibility -- length = 5
cesse is a possibility -- length = 6
cesses is a possibility -- length = 7
cete is a possibility -- length = 5
cetes is a possibility -- length = 6
cyeses is a possibility -- length = 7

Now, we can't sidestep any longer; it's time to figure out how to test for the frequency of each letter to ensure that the words actually can be formed by the tiles we hold. Note that in the above example, none of the above words are actually a match when letter frequency is taken into account.

Counting Letter Occurrences

The first piece of this puzzle is to figure out how many times a letter occurs in a given word or sequence. Although there probably is a regular expression to do just that, I settled on the -o flag to grep, as demonstrated:

$ echo test | grep -o t
t
t

Append a wc -l, and you can write a simple function that returns the number of times a specified letter occurs in a given word or sequence:

occurrences()
{
  # how many times does 'letter' occur in 'word'? 
  local count=$( echo $2 | grep -o $1 | wc -l )
  echo $count
}
echo occurrences "t" "test"
occurrences "t" "test"

Testing will demonstrate that the result is “2”, as hoped for. This'll be the starting point for us in my next article when we continue this epic scripting journey into the world of Scrabble, Words With Friends and other word games.

Dave Taylor has been hacking shell scripts for more than 30 years. Really. He's the author of the popular Wicked Cool Shell Scripts and can be found on Twitter as @DaveTaylor and more generally at www.DaveTaylorOnline.com.

LJ Archive