Book HomeLearning Perl, 3rd EditionSearch this book

15.4. Advanced Sorting

Earlier, in Chapter 3, "Lists and Arrays ", we showed that you could sort a list in ascending ASCIIbetical order by using the builtin sort operator. What if you want a numeric sort? Or a case-insensitive sort? Or maybe you want to sort items according to information stored in a hash. Well, Perl lets you sort a list in whatever order you'd need; we'll see all of those examples by the end of the chapter.

You'll tell Perl what order you want by making a sort-definition subroutine, or sort subroutine for short. Now, when you first hear the term "sort subroutine," if you've been through any computer science courses, visions of bubble sort and shell sort and quick sort race through your head, and you say, "No, never again!" Don't worry; it's not that bad. In fact, it's pretty simple. Perl already knows how to sort a list of items; it merely doesn't know which order you want. So the sort-definition subroutine simply tells it the order.

Why is this necessary? Well, if you think about it, sorting is putting a bunch of things in order by comparing them all. Since you can't compare them all at once, you need to compare two at a time, eventually using what you find out about each pair's order to put the whole kit'n'caboodle in line. Perl already understands all of those steps except for the part about how you'd like to compare the items, so that's all you have to write.

This means that the sort subroutine doesn't need to sort many items after all. It merely has to be able to compare two items. If it can put two items in the proper order, Perl will be able to tell (by repeatedly consulting the sort subroutine) what order you want for your data.

The sort subroutine is defined like an ordinary subroutine (well, almost). This routine will be called repeatedly, each time checking on a pair of elements from the list to be sorted.

Now, if you were writing a subroutine that's expecting to get two parameters that need sorting, you might write something like this to start:

sub any_sort_sub {  # It doesn't really work this way
  my($a, $b) = @_;  # Get and name the two parameters
  # start comparing $a and $b here
  ...
}

But the sort subroutine will be called again and again, often hundreds or thousands of times. Declaring the variables $a and $b and assigning them values at the top of the subroutine will take just a little time, but multiply that by the thousands of times that the routine will be called, and you can see that it contributes significantly to the overall execution speed.

We don't do it like that. (In fact, if you did it that way, it wouldn't work.) Instead, it is as if Perl has done this for us, before our subroutine's code has even started.[340] You'll really write a sort subroutine without that first line; both $a and $b have been assigned for you. When the sort subroutine starts running, $a and $b are two elements from the original list.

[340]To be honest, it's closer to being as if Perl has used local($a, $b) in a private block around the sort invocation, because these variables are really globals rather than lexical variables. (Unless you do something unusual, though, you can't tell the difference inside the sort subroutine; you can pretend these are my variables. use strict makes a special exception for these two globals, so you don't need to declare them in any way.) This means that if your program already has its own $a or $b, you won't be able to access those while Perl is sorting the list. Also, be sure to note that the two items to be sorted are not passed in via @_ (unless you use a subroutine prototype, which we won't cover in this book, but see the documentation for the full details). Inside the sort subroutine, just use $a and $b, and try not to worry too much about where they came from. And as if that wasn't enough, if there's a lexical $a or $b somewhere in scope, the subroutine definition doesn't work either. Whew!

The subroutine returns a coded value describing how the elements compare (like C's qsort(3) does, but it's Perl's own internal sort implementation). If $a should appear before $b in the final list, the sort subroutine returns -1 to say so. If $b should appear before $a, it returns 1.

If the order of $a and $b doesn't matter, the subroutine returns 0. Why would it not matter? Perhaps you're doing a case-insensitive sort and the two strings are fred and Fred. Or perhaps you're doing a numeric sort, and the two numbers are equal.

We could now write a numeric sort subroutine like this:

sub by_number {
  # a sort subroutine, expect $a and $b
  if ($a < $b) { -1 } elsif ($a > $b) { 1 } else { 0 }
}

To use the sort subroutine, just put its name (without an ampersand) between the keyword sort and the list to be sorted. This example puts a numerically sorted list of numbers into @result:

my @result = sort by_number @some_numbers;

We called the subroutine by_number because that describes how it's sorting. But more importantly, you can read the line of code that uses it with sort as saying "sort by number," as you would in English. Many sort-subroutine names begin with by_ to describe how they sort. Or we could have called this one numerically, for a similar reason, but that's more typing and more chance to mess something up.

Notice that we don't have to do anything in the sort subroutine to declare $a and $b, and to set their values -- and if we did, the subroutine wouldn't work right. We just let Perl set up $a and $b for us, and so all we need to write is the comparison.

In fact, we can make it even simpler (and more efficient). Since this kind of three-way comparison is frequent, Perl has a convenient shortcut to use to write it. In this case, we use the spaceship operator (<=>).[341] This operator compares two numbers and returns -1, 0, or 1 as needed to sort them numerically. So we could have written that sort subroutine better, like this:

[341]We call it that because it looks like one of the Tie-fighters from Star Wars. Well, it looks like that to us, anyway.

sub by_number { $a <=> $b }

Since the spaceship compares numbers, you may have guessed that there's a corresponding three-way string-comparison operator: cmp. These two are easy to remember and keep straight. The spaceship has a family resemblance to the numeric comparison operators like >=, but it's three characters long instead of two because it has three possible return values instead of two. And cmp has a family resemblance to the string comparison operators like ge, but it's three characters long instead of two because it also has three possible return values instead of two.[342]

[342]This is no accident. Larry does things like this on purpose, to make Perl easier to learn and remember. Remember, he's a linguist at heart, so he's studied how people think of languages.

Of course, cmp by itself provides the same order as the default sort. You'd never need to write this subroutine, which yields merely the default sort order:[343]

[343]You'd never need to write this unless, of course, you were writing an introductory Perl book and needed it for an example.

sub ASCIIbetically { $a cmp $b }
my @strings = sort ASCIIbetically @any_strings;

But you can use cmp to build a more complex sort order, like a case-insensitive sort:

sub case_insensitive { "\L$a" cmp "\L$b" }

In this case, we're comparing the string from $a (forced to lowercase) against the string from $b (forced to lowercase), giving a case-insensitive sort order.

Note that we're not modifying the elements themselves; we're merely using their values. That's actually important: for efficiency reasons, $a and $b aren't copies of the data items. They're actually new, temporary aliases for elements of the original list, so if we changed them we'd be mangling the original data. Don't do that -- it's neither supported nor recommended.

When your sort subroutine is as simple as the ones we show here (and most of the time, it is), you can make the code even simpler yet, by replacing the name of the sort routine with the entire sort routine "in line," like so:

my @numbers = sort { $a <=> $b } @some_numbers;

In fact, in modern Perl, you'll hardly ever see a separate sort subroutine; you'll frequently find sort routines written inline as we've done here.

Suppose you want to sort in descending numeric order. That's easy enough to do with the help of reverse:

my @descending = reverse sort { $a <=> $b } @some_numbers;

But here's a neat trick. The comparison operators (<=> and cmp) are very nearsighted; that is, they can't see which operand is $a and which is $b, but only which value is on the left and which is on the right. So if $a and $b were to swap places, the comparison operator would get the results backwards every time. That means that this is another way to get a reversed numeric sort:

my @descending = sort { $b <=> $a } @some_numbers;

You can (with a little practice) read this at a glance. It's a descending-order comparison (because $b comes before $a, which is descending order), and it's a numeric comparison (because it uses the spaceship instead of cmp). So, it's sorting numbers in reverse order.

15.4.1. Sorting a Hash by Value

Once you've been sorting lists happily for a while you'll run into a situation where you want to sort a hash by value. For example, three of our characters went out bowling last night, and we've got their bowling scores in the following hash. We want to be able to print out the list in the proper order, with the game winner at the top, so we want to sort the hash by score:

my %score = ("barney" => 195, "fred" => 205, "dino" => 30);
my @winners = sort by_score keys %score;

Of course, we aren't really going to be able to sort the hash by score; that's just a verbal shortcut. You can't sort a hash! But when we've used sort with hashes before now, we've been sorting the keys of the hash (in ASCIIbetical order). Now, we're still going to be sorting the keys of the hash, but the order is now defined by their corresponding values from the hash. In this case, the result should be a list of our three characters' names, in order according to their bowling scores.

Writing this sort subroutine is fairly easy. What we want is to use a numeric comparison on the scores, rather than the names. That is, instead of comparing $a and $b (the players' names), we want to compare $score{$a} and $score{$b} (their scores). If you think of it that way, it almost writes itself, as in:

sub by_score { $score{$b} <=> $score{$a} }

Let's step through this and see how it works. Let's imagine that the first time it's called, Perl has set $a to barney and $b to fred. So the comparison is $score{"fred"} <=> $score{"barney"}, which (as we can see by consulting the hash) is 205 <=> 195. Remember, now, the spaceship is nearsighted, so when it sees 205 before 195, it says, in effect: "No, that's not the right numeric order; $b should come before $a." So it tells Perl that fred should come before barney.

Maybe the next time the routine is called, $a is barney again but $b is now dino. The nearsighted numeric comparison sees 30 <=> 195 this time, so it reports that that they're in the right order; $a does indeed sort in front of $b. That is, barney comes before dino. At this point, Perl has enough information to put the list in order: fred is the winner, then barney in second place, then dino.

Why did the comparison use the $score{$b} before the $score{$a}, instead of the other way around? That's because we want bowling scores arranged in descending order, from the highest score of the winner down. So you can (again, after a little practice) read this one at sight as well: $score{$b} <=> $score{$a} means to sort according to the scores, in reversed numeric order.

15.4.2. Sorting by Multiple Keys

We forgot to mention that there was a fourth player bowling last night with the other three, so the hash really looked like this:

my %score = (
  "barney" => 195, "fred" => 205,
  "dino" => 30, "bamm-bamm" => 195,
);

Now, as you can see, bamm-bamm has the same score as barney. So which one will be first in the sorted list of players? There's no telling, because the comparison operator (seeing the same score on both sides) will have to return zero when checking those two.

Maybe that doesn't matter, but we generally prefer to have a well-defined sort. If several players have the same score, we want them to be together in the list, of course. But within that group, the names should be in ASCIIbetical order. But how can we write the sort subroutine to say that? Again, this turns out to be pretty easy:

my @winners = sort by_score_and_name keys %score;

sub by_score_and_name {
  $score{$b} <=> $score{$a}  # by descending numeric score
    or
  $a cmp $b                  # ASCIIbetically by name
}

How does this work? Well, if the spaceship sees two different scores, that's the comparison we want to use. It returns -1 or 1, a true value, so the low-precedence short-circuit or will mean that the rest of the expression will be skipped, and the comparison we want is returned. (Remember, the short-circuit or returns the last expression evaluated.) But if the spaceship sees two identical scores, it returns 0, a false value, and thus the cmp operator gets its turn at bat, returning an appropriate ordering value considering the keys as strings. That is, if the scores are the same, the string-order comparison breaks the tie.

We know that when we use the by_score_and_name sort subroutine like this, it will never return 0. (Do you see why it won't? The answer is in the footnote.[344]) So we know that the sort order is always well-defined; that is, we know that the result today will be the same as the result with the same data tomorrow.

[344]The only way it could return 0 would be if the two strings were identical, and (since the strings are keys of a hash) we already know that they're different. Of course, if you passed a list with duplicate (identical) strings to sort, it would return 0 when comparing those, but we're passing a list of hash keys.

There's no reason that your sort subroutine has to be limited to two levels of sorting, of course. Here the Bedrock library program puts a list of patron ID numbers in order according to a five-level sort.[345] This example sorts according to the amount of each patron's outstanding fines (as calculated by a subroutine &fines, not shown here), the number of items they currently have checked out (from %items), their name (in order by family name, then by personal name, both from hashes), and finally by the patron's ID number, in case everything else is the same:

[345]It's not unusual in the modern world to need a five-level sort like this, although it was quite infrequent in prehistoric times.

@patron_IDs = sort {
  &fines($b) <=> &fines($a) or
  $items{$b} <=> $items{$a} or
  $family_name{$a} cmp $family_name{$a} or
  $personal_name{$a} cmp $family_name{$b} or
  $a <=> $b
} @patron_IDs;


Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.