Work the Shell

Words—We Can Make Lots of Words

Dave Taylor

Issue #259, November 2015

In this article, Dave Taylor shows complicated script code to complete the findwords script. Now you'll be ready to crush everyone in Scrabble and Words with Friends.

It was a dark and stormy night when I started this series here in Linux Journal—at least two months ago, and in Internet terms, that's quite a while. And just wait until our robot overlords are running the show, because then two months will be 10–20 generations of robot evolution and quite frankly, the T-2000 probably could have solved this problem already anyway.

Puny humans!

But, we haven't yet reached the singularity—at least, I don't think so. I asked Siri, and she said we hadn't, so that's good enough, right? Let's dive back in to this programming project because the end is nigh! Well, for this topic at least.

The challenge started out as trying to make words from a combination of letter blocks. You know, the wooden blocks that babies play with (or, alternatively, hurl at you if you're within 20 feet of them). Those give you six letters per space, but I simplified the problem down to the Scrabble tiles example: you have a set of letters on your rack; what words can you make with them?

I've talked about algorithms for the last few months, so this time, let's really dig in to the code for findwords, the resultant script. After discarding various solutions, the one I've implemented has two phases:

Identify a list of all words that are composed only of the letters started with (so “axe” wouldn't match the starting letters abcdefg).
For each word that matches, check that the number of letters needed to spell the word match up with the occurrences of letters in the starting pattern (so “frogger” can't be made from forger—but almost).

Let's have a look at the code blocks, because it turns out that this is non-trivial to implement, but we have learned to bend The Force to do our bidding (in other words, we used regular expressions).

First we step through the dictionary to identify n-letter words that don't contain letters excluded from the set, with the additional limitation that the word is between (length–3) and (length) letters long:

unique="$(echo $1 | sed 's/./&\
/g' | tr '[[:upper:]]' '[[:lower:]]' | sort | uniq | \
fmt | tr -C -d '[[:alpha:]]')"

while [ $minlength -lt $length ]
do
  regex="^[$unique]{$minlength}$"
  if [ $verbose ] ; then
    echo "Raw word list of length $minlength for \
    letterset $unique:"
    grep -E $regex "$dictionary" | tee -a $testwords
  else
    grep -E $regex "$dictionary" >> $testwords
  fi
  minlength="$(( $minlength + 1 ))"
done

I explained how this block works in my column in the last issue (October 2015), if you want to flip back and read it, but really, the hard work involves the very first line, creating the variable $unique, which is a sorted, de-duped list of letters from the original pattern. Given “messages”, for example, $unique would be “aegms”.

Indeed, given “messages”, here are the words that are identified as possibilities by findwords:

Raw word list of length 6 for letterset aegms:
assess
mammas
masses
messes
sesame
Raw word list of length 7 for letterset aegms:
amasses
massage
message
Raw word list of length 8 for letterset aegms:
assesses
massages
messages

Clearly there's more work to do, because it's not possible to make the word “massages” from the starting pattern “messages”, since there aren't enough occurrences of the letter “a”. That's the job of the second part of the code, so I'm just going to show you the whole thing, and then I'll explain specific sections:

pattern="$(echo $1 | sed 's/./&\
/g' | tr '[[:upper:]]' '[[:lower:]]' | sort | fmt 
 ↪| sed 's/ //g')"
for word in $( cat $testwords )
do
  simplified="$(echo $word | sed 's/./&\
/g' | tr '[[:upper:]]' '[[:lower:]]' | sort | fmt 
 ↪| sed 's/ //g')"

  ## PART THREE: do all letters of the word appear 
  # in the pattern once and exactly once? Easy way: 
  # loop through and remove each letter as used, 
  # then compare end states

  indx=1; failed=0
  before=$pattern
  while [ $indx -lt ${#simplified} ]
  do
    ltr=${simplified:$indx:1}
    after=$(echo $before | sed "s/$ltr/-/")
    if [ $before = $after ] ; then
      failed=1
    else
      before=$after
    fi
    indx=$(( $indx + 1 ))
  done
  if [ $failed -eq 0 ] ; then
    echo "SUCCESS: You can make the word $word"
  fi
done

The first rather gnarly expression to create $pattern from the specified starting argument ($1) normalizes the pattern to all lowercase, sorts the letters alphabetically, then reassembles it. In this case, “messages” would become “aeegmsss”. Why? Because we can do that to each of the possible words too, and then the comparison test becomes easy.

The list of possible words was created in part one and is stored in the temporary file $testwords, so the “for” loop steps us through. For each word, $simplified becomes a similarly normalized pattern to check. For each letter in the proposed word, we replace that letter with a dash in the pattern, using two variables, $before and $after, to stage the change so we can ensure that something really did change for each letter. That's what's done here:

after=$(echo $before | sed "s/$ltr/-/")

If $before = $after, then the needed letter from the proposed word wasn't found in the pattern, and the word can't be assembled from the pattern. On the other hand, if there are extra letters in the pattern after we're done analyzing the word, that's fine. That's the situation where we can make, for example, “games” from “messages”, and that's perfectly valid, even with the leftover letters.

I've added some debugging statements so you can get a sense of what's going on in this example invocation:

$ sh findwords.sh messages
Raw word list of length 5 for letterset aegms:
amass
asses
eases
games
gamma
gases
geese
mamma
sages
seams
seems
Raw word list of length 6 for letterset aegms:
assess
mammas
masses
messes
sesame
Raw word list of length 7 for letterset aegms:
amasses
massage
message
Raw word list of length 8 for letterset aegms:
assesses
massages
messages

created pattern aeegmsss
SUCCESS: You can make the word asses
SUCCESS: You can make the word eases
SUCCESS: You can make the word games
SUCCESS: You can make the word gases
SUCCESS: You can make the word sages
SUCCESS: You can make the word seams
SUCCESS: You can make the word seems
SUCCESS: You can make the word masses
SUCCESS: You can make the word messes
SUCCESS: You can make the word sesame
SUCCESS: You can make the word message
SUCCESS: You can make the word messages

So, we can make a dozen different words out of the word “messages”, including the word messages itself. What about the original pattern we were using in previous columns: “chicken”? For this one, let's skip the potential words and just look at the solution:

SUCCESS: You can make the word chic
SUCCESS: You can make the word chin
SUCCESS: You can make the word heck
SUCCESS: You can make the word hick
SUCCESS: You can make the word hike
SUCCESS: You can make the word inch
SUCCESS: You can make the word neck
SUCCESS: You can make the word nice
SUCCESS: You can make the word nick
SUCCESS: You can make the word check
SUCCESS: You can make the word chick
SUCCESS: You can make the word chink
SUCCESS: You can make the word niche
SUCCESS: You can make the word chicken

Impressive!

To make this work a bit better, I've added some error checking, included an -f flag so we can have the script also output failures, not just successes, and left in some additional debugging output if $verbose is set to true.

See Listing 1 for the complete code. It's also available at www.linuxjournal.com/extra/findwords.

That's it. Now we have a nice tool that can help us figure out what to play the next time we're stuck on Scrabble, Words with Friends, or even looking at a big stack of letter blocks.

Next month, I'll turn my attention to a different scripting challenge. Do you have an idea? Send it to info@linuxjournal.com.

Listing 1. findwords.sh

#!/bin/sh

# Findwords -- given a set of letters, try to find all the words you can
# spell

dictionary="/Users/taylor/Documents/Linux Journal/dictionary.txt"

testwords=$(mktemp /tmp/findwords.XXXXXX) || exit 1

if [ -z "$1" ] ; then
  echo "Usage: findwords [sequence of letters]"
  exit 0
fi

if [ "$1" = "-f" ] ; then
  showfails=1
  shift
fi

## PART ONE: make the regular expression

length="$(echo "$1" | wc -c)"
minlength=$(( $length - 4 ))    # we can ignore a max of 2 letters

if [ $minlength -lt 3 ] ; then
  echo "Error: sequence must be at least 5 letters long"
  exit 0
fi

unique="$(echo $1 | sed 's/./&\
/g' | tr '[[:upper:]]' '[[:lower:]]' | sort | uniq | fmt | \
tr -C -d '[[:alpha:]]')"

while [ $minlength -lt $length ]
do
  regex="^[$unique]{$minlength}$"

  if [ $verbose ] ; then
    echo "Raw word list of length $minlength for letterset $unique:"
    grep -E $regex "$dictionary" | tee -a $testwords
  else
    grep -E $regex "$dictionary" >> $testwords
  fi
  minlength="$(( $minlength + 1 ))"
done

## PART TWO: sort letters for validity filter

pattern="$(echo $1 | sed 's/./&\
/g' | tr '[[:upper:]]' '[[:lower:]]' | sort | fmt | sed 's/ //g')"

for word in $( cat $testwords )
do
  # echo "checking $word for validity"

  simplified="$(echo $word | sed 's/./&\
/g' | tr '[[:upper:]]' '[[:lower:]]' | sort | fmt | sed 's/ //g')"

  ## PART THREE: do all letters of the word appear in the pattern
  #              once and exactly once? Easy way: loop through and
  #              remove each letter as used, then compare end states

  indx=1
  failed=0
  before=$pattern

  while [ $indx -lt ${#simplified} ]
  do
    ltr=${simplified:$indx:1}
    after=$(echo $before | sed "s/$ltr/-/")
    if [ $before = $after ] ; then
      # nothing changed, so we don't have that 
      # letter available any more
      if [ $showfails ] ; then
        echo "FAILURE: came close, but can't make $word"
      fi
      failed=1
    else
     before=$after
    fi

    indx=$(( $indx + 1 ))
  done

  if [ $failed -eq 0 ] ; then
    echo "SUCCESS: You can make the word $word"
  fi

done

/bin/rm -f $testwords

exit 0