Dave continues to explore how to create a Scrabble suggestion engine and learns that regular expressions sometimes aren't the best path to a solution.
In my last article, I looked at a simple crossword-puzzle word finder, something that requires a word list and a basic understanding of grep. Then, I switched to looking at Scrabble and the popular on-line equivalent Words With Friends.
In this latter case, the problem turns out to be quite a bit more difficult. Say you have seven tiles (one or more of which could be a blank or wild-card tile, but I'll address that later) that are a random set of letters, and from that, you want to combine them to create dictionary words. For example, I'm playing a game of Words With Friends with my sister, and the tiles I have to work with are T E C Y Z S X. But, of course, any good Scrabble player knows that you also need to work with the letters already on the board, so although I can make a word like “YET” or “SEX” from these letters, I still have to interlock the word onto the board somehow. It's harder, but that's where big-point word play comes from.
Still, let's stick with the basics: enter a set of letters, and the script will offer a set of possible words using those letters. What about all these nuances? Yeah, they're going to make this way more complicated!
Tapping the word list we downloaded in the last column, the most obvious search is for the letters as a regular expression:
$ grep '^t*e*c*y*z*s*x$' words.txt ex
Ah, that doesn't work well because the order of the letters is important to grep although it's not important to us. Instead, a complicated pipe offers an interesting alternative:
grep t words.txt | grep e | grep c | grep y ↪| grep z | grep s | grep x
But, that doesn't produce any results because it's looking for words that have all the letters in our hand, and that's basically impossible.
So, what about this:
grep 't*' words.txt | grep 'e*' | grep 'c*' | grep 'y*' ↪| grep 'z*' | grep 's*' | grep 'x*'
The x* notation is “zero or more occurrences of letter x”, and that is clearly not going to work because every word matches this complex search pattern if you think about it.
Let's flip this around and screen out all words that contain letters not in our set of letters instead:
$ grep -vE '[^tecyzsx]' words.txt cee cees cess
There's another problem. The words match, except we're not taking into account the frequency of each letter. That is, although “cess” indeed comprises only letters in our set, we have one “s”, not two, so it's not actually a valid match.
Then again, at least it's a step in the right direction, which is more than the previous examples have demonstrated, so let's run with it.
With seven letters, the first simple secondary filter is that any words longer than seven letters can be axed. How to test? The wc command works, but awkwardly. Still, I have a sense we're going to end up with each possible match going into a more complex test, so let's start building it:
#!/bin/sh # Findword -- find matching dictionary words for ↪Scrabble given a set of letters datafile="words.txt" maxlength=7 minlength=4 if [ -z "$1" ] ; then echo "Usage: $(basename $0) letters"; exit 1 fi for possibility in $(grep -vE "[^$1]" words.txt) do length=$(echo $possibility | wc -c) if [ $length -gt $maxlength ] ; then echo "$possibility is too long" elif [ $length -lt $minlength ] ; then echo "$possibility is too short" else echo $possibility is a possibility -- length = $length fi done
There's a built-in problem with this script that you'll realize if you're familiar with how wc counts letters. To demonstrate:
$ echo linux | wc -c 6
Six? Shouldn't that be five?
The fix is to add the following:
# adjust lengths because our fast wc-c adds 1 char maxlength=$(( $maxlength + 1 )) minlength=$(( $minlength + 1 ))
You might find it faster simply to add one to each of the default settings, but because down the road, I am anticipating letting the user specify min/max length words, the compensatory code seems a better solution.
With that added code, we can find five-, six- or seven-letter words that are made up of only the letters in our hand by simply commenting out the “too long/too short” message:
$ findword.sh tecyzsx cees is a possibility -- length = 5 cess is a possibility -- length = 5 cesse is a possibility -- length = 6 cesses is a possibility -- length = 7 cete is a possibility -- length = 5 cetes is a possibility -- length = 6 cyeses is a possibility -- length = 7
Now, we can't sidestep any longer; it's time to figure out how to test for the frequency of each letter to ensure that the words actually can be formed by the tiles we hold. Note that in the above example, none of the above words are actually a match when letter frequency is taken into account.
The first piece of this puzzle is to figure out how many times a letter occurs in a given word or sequence. Although there probably is a regular expression to do just that, I settled on the -o flag to grep, as demonstrated:
$ echo test | grep -o t t t
Append a wc -l, and you can write a simple function that returns the number of times a specified letter occurs in a given word or sequence:
occurrences() { # how many times does 'letter' occur in 'word'? local count=$( echo $2 | grep -o $1 | wc -l ) echo $count } echo occurrences "t" "test" occurrences "t" "test"
Testing will demonstrate that the result is “2”, as hoped for. This'll be the starting point for us in my next article when we continue this epic scripting journey into the world of Scrabble, Words With Friends and other word games.