LJ Archive

Work the Shell

What Day Is That Date in the Past?

Dave Taylor

Issue #207, July 2011

Parsing cal output.

Last month, we started a script that worked backward from a day and month date and figured out the most recent year—including possibly the current year—that would match that date occurring on that particular day. For example, April 1st as a Friday was most recently in this year, 2011, but April 1st as a Tuesday? When did that last occur?

To make things interesting, our script is focused on tapping in to one of the unsung utilities of Linux, cal, and parsing its output to identify a day for a given date.

As is typical with a shell script, much of the work so far has been involved in normalizing the input data so that what we hand to the cal program will work and be understood by the program.

The bigger challenge, however, was to figure out whether a possible date could be in the current year. Since the program always is looking backward, it needs to know the current date to compare. That is, I'm writing this on April 3, 2011. If I check for the most recent April 1 being a Friday, it should say 2011, but if I check for the most recent May 1 being a Sunday, it should not suggest 2011. That's in the future and isn't a valid answer.

That's all shown in my previous column, so let's get on to something new: figuring out how to parse the cal output.

Parsing cal Calendars

For any given month and year, cal produces output similar to this:

    August 2008
Su Mo Tu We Th Fr Sa
                1  2
3  4  5  6  7  8  9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
31

Let's say we're looking for August 3rd. To search for it in this output, we need to specify that there should not be a digit before or after the date. This is doable with a simple regex:

$ cal aug 2008 | grep -e '[^0-9]3[^0-9]'
3  4  5  6  7  8  9

(As you'll learn later, this is insufficient as a regular expression. If you're really paying attention, you're already suspecting it's going to end up being a bit more complicated.)

Now, we need to figure out which digit matches.

awk to the Rescue

The basic approach we're going to use is to have awk step through each field on lines that match the pattern specified by using a for loop:


{ for (i=1;i<=NF;i++) if ($i~/regex/) print i}

We could use this with the grep statement above, but let's save a command by letting awk do the conditional test too:


$ cal aug 2008 | awk -e '/regex/ { for (i=1;i<=NF;i++)
  if ($i~/regex/ print i }'

To test this, let's use a regular expression that tests for the 5th day of the month:

[^0-9]5[^0-9]

This kind of works, but there's a problem. If we search for the 10th, because it appears at the very beginning of the line, it doesn't match the regular expression fragment [^0-9]10. The solution means our regex becomes more complicated, but here it is—one that works for the situation where it's possibly either the beginning of the line or the end of the line:

[^0-9]10[^0-9]|^10[^0-9]|[^0-9]10$

The | is a logical “or” statement, so it's now the earlier expression or one that has the pattern we seek followed by not-a-digit, but is at the beginning of the line (the ^ by itself) or is the pattern preceded by not-a-digit at the end of a line (the $ notation).

Fortunately, we're writing a script so we won't have to type this in more than once. Just as well!

There's another wrinkle in this output. We need to know not only in what field the matching number appears, but also how many fields total are on the matching line. Why? Otherwise, match 2 above occurring on a Monday would look exactly like the above, the 2nd occurring on a Saturday.

Here's our test script fragment, so far:

expr="[^0-9]${day}[^0-9]|^${day}[^0-9]|[^0-9]${day}\$"
cal aug 2008 | awk "/$expr/ { print \$0 }"

Notice that we need to use double quotes so that the variable $day is expended, and then $expr is also expanded, which means that we also need to escape the $0 in this test.

That's not what we want though. The awk statement needs to be more sophisticated, because we want to know the matching field number (for example, day of week 1–7) along with the total number of fields in the matching line. Ready?


expr="[^0-9]${day}[^0-9]|^${day}[^0-9]|[^0-9]${day}\$"
cal aug 2008 | awk "/$expr/ { for (i=1;i<=NF;i++) {
     if (\$i~/${day}/) { print \"i=\"i\", NF=\"NF }}}"

The double quotes add a tiny bit of complication, but really, this is just a complicated script.

The output, against our August 2008 calendar, looks like this:

$ sh match.sh 2
i=2, NF=2
$ sh match.sh 10
i=1, NF=7
$ sh match.sh 19
i=3, NF=7

That all makes sense. The next challenge is to figure out what day of the week we've matched for a given day and number of days in the week. Remember, day #1 on a three-day week is Thursday, while day #1 in a seven-day week is Sunday. Confusing, eh?

Day Of Week as an Array

The fast way to calculate this is to, well, pre-calculate it by creating a bunch of arrays. Like this:

if NF=1 days=[Sat]
if NF=2 days=[Fri,Sat]
if NF=3 days=[Thu,Fri,Sat]

and so on. There's a formula at play here, but more important, there's a pattern: (7-NF)-i is consistent. So day #1 on a three-day week is (7-3)+1 = 5 = Thursday, while day #1 on a 7-day week is (7-7)+1 = Sunday.

Let's double-check: in Aug 2008, Aug 1 is (7-2)+1 = 6 = Saturday, and Aug 4 = (7-7)+2 = Monday and Aug 31 = (7-1)+1 = 7 = Saturday.

Uh-oh, that last one's wrong, showing that we need to differentiate between the first week of the month, in which situation the days are right-aligned (as it were!), but in the last week of the month, they're left-aligned.

Ah, another nuance. Crikey, this is a rather tricky to write, isn't it?

Next month, we'll continue to build the script. Meanwhile, experiment with awk and regular expressions and see if you can find a more streamlined solution.

Dave Taylor has been hacking shell scripts for a really long time, 30 years. He's the author of the popular Wicked Cool Shell Scripts and can be found on Twitter as @DaveTaylor and more generally at www.DaveTaylorOnline.com.

LJ Archive