LJ Archive

Work the Shell

Analyzing Your Search Keywords

Dave Taylor

Issue #154, February 2007

Screen the unwanted results out of your access log searches.

Last month, we started exploring how you can use a shell script to extract and analyze the HTTP_REFERER values out of your Web server log and identify the most common terms and phrases that people used to find your pages. Sounds useful, doesn't it?

The problem is, the script is more nuanced than it initially seems. Last month, we wrapped up with the following shell script:

#!/bin/sh

ACCESSLOG="/var/logs/httpd.logs/access_log"

grep 'google.com/search' $ACCESSLOG | \
  awk '{print $11}' | \
  cut -d\? -f2 | cut -d\& -f1 | \
  sed 's/+/ /g;s/%22/"/g;s/q=//' | \
  sort | \
  uniq -c | \
  sort -rn | \
  head -5

When I run this, here's what I see:

$ sh google-searches.sh
  94 hl=en
  18 client=safari
   6 client=firefox-a
   4 sourceid=navclient
   4 client=opera

That's weird, because it's not search terms, it's other variables that are included with search strings sent from sites like Google (hl=en says that you've constrained searches to English-language sites only, client=safari identifies the user's Web browser as Apple's Safari and so on).

Screening Out False Matches

The problem is revealed when we look at the first ten matches rather than only the first five:

$ sh google-searches.sh | head -10
  94 hl=en
  18 client=safari
   6 client=firefox-a
   4 sourceid=navclient
   4 client=opera
   3 wicked cool scripts
   3 hl=zh-CN
   2 num=100
   2 hs=wNy
   2 barbara nelson%2Bpurses

Ah, so we can see that there are two valid searches here, one for “wicked cool scripts” and one for “Barbara nelson%2Bpurses”. Not sure what the latter one's about, but it's useful and important to see. Fortunately, screening out the bogus matches is as simple as using grep to remove fields that include an equal sign: grep -v '='.

Rather than have that at the very end of the long pipe in the script, however, I'll place it immediately after the sed invocation to strip out false results as soon as possible in the pipeline to speed up the entire script. Now it looks like this:

grep 'google.com/search' $ACCESSLOG | \
  awk '{print $11}' | \
  cut -d\? -f2 | cut -d\& -f1 | \
  sed 's/+/ /g;s/%22/"/g;s/q=//' | \
  grep -v '=' | \
  sort | \
  uniq -c | \
  sort -rn

Notice that the sed statement itself strips out the name= part of the search (q=), so that it's not incorrectly matched in the new grep statement.

Now we have the results we want:

$ sh google-searches.sh | head -10
   3 wicked cool scripts
   2 barbara nelson%2Bpurses
   1 wsj%20password
   1 why did animal kingdom introduce expedition everest
   1 what makes a great speaker%3F
   1 university of phoenix center of writing excellence
   1 ubuntu x problem
   1 triboot osx ubuntu ydl
   1 the best dvd players
   1 symbol html heart

This site doesn't get a huge amount of traffic, so let's run the very same script against my far-busier AskDaveTaylor.com site. The results are more interesting:

$ sh google-searches.sh | head -10
   5 standalone player
   4 psp help
   4 create a myspace
   4 Documents and Settings"
   4 %24NtUninstall
   3 view myspace accounts that are set to private
   3 i cant hear music on runescape
   2 transfer files to psp
   2 sync v3 motorola mac
   2 running unix in windows xp

Much more interesting. Oh, and if you want to know how many searches you're exploring, it's another simple tweak to the script, an invocation of wc:

$ sh google-searches.sh | wc -l
     501

So out of 501 searches, the single-most common search is “standalone player”, which represents only five out of 500, or 1% of my search traffic.

Stripping Out Unwanted Characters

One more step before we walk away from this script for the month: let's get rid of the strange characters that have been carried over from the original URL encoding of the user's Web browser. What am I talking about? The %24, the closing double quote in Documents and Settings and the %2B in the earlier search for purses.

You can figure out all the mappings and convert everything as appropriate, but I'm lazy at the end of the day and will instead simply find all %xx sequences and replace them with a single space.

This sounds hard, but it's a perfect job for sed because it allows you to do pattern matching and then replace the matched material with whatever else you desire. Here's how I'd do that:

sed 's/%[0-9a-fA-F][0-9a-fA-F]/ /g'

Let's look at this closely before you panic. A set separated by square brackets is a set in regular expression terminology, so [0-9] will match any of 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 or 0. It turns out that URL encoding uses hexadecimal, so not only can the values be 0-9, but they can also be A, B, C, D, E and F, in upper- or lowercase letters—hence 0-9 and a-f and A-F in the pattern. The overall pattern form is % followed by any of these possible values, followed by any of these possible values again. Now you can see the full pattern.

Finally, before we beat this completely into the ground, note that the bigger structure here in the sed statement is s/old/new/g, which replaces old with new throughout the line, whether it occurs once or 15 times.

We're not quite done yet, however, because we also need to strip the stray double quotes. Again, this is easily added to the sed statement:

sed 's/%[0-9a-fA-F][0-9a-fA-F]/ /g;s/"//g'

Here's the final script:

grep 'google.com/search' $ACCESSLOG | \
  awk '{print $11}' | \
  cut -d\? -f2 | cut -d\& -f1 | \
  sed 's/+/ /g;s/%22/"/g;s/q=//' | \
  sed 's/%[0-9a-fA-F][0-9a-fA-F]/ /g;s/"//g' | \
  grep -v '=' | sort | uniq -c | sort -rn

And, the final results:

$ sh google-searches.sh | head -15
   6  NtUninstall
   5 standalone player
   4 psp music
   4 psp help
   4 creat a myspace
   4 Documents and Settings
   3 view myspace accounts that are set to private
   3 i cant hear music on runescape
   2 transfer files to psp
   2 sync v3 motorola mac
   2 running unix in windows xp
   2 rss feed reader shell
   2 reinstall windows xp hp
   2 psp transfer music
   2 psp internet

Note that once we remove the stray material, things organize slightly differently (for example, here you can see that psp music is one of the top searches, but earlier we had different variations of psp music, and it didn't make it to a top search value).

Okay, enough torturing of the Apache log file. Let's wrap this up and we'll switch to something completely different in the next column! Suggestions? Please e-mail them to me!

Dave Taylor is a 26-year veteran of UNIX, creator of The Elm Mail System, and most recently author of both the best-selling Wicked Cool Shell Scripts and Teach Yourself Unix in 24 Hours, among his 16 technical books. His main Web site is at www.intuitive.com.

LJ Archive