Stathis Sideris — Blog

published: 23 Jul 2007

Perl sort tutorial

Perl's sort function is very useful in its simplest form, but it can become very powerful quickly if you learn how to use the extra functionality that it provides. Here we look at the simplest cases of sorting, then we go on to examine ways to perform custom sorting, and finally we look at an example that uses sort on a more complex data structrure.

Prerequisites

Different parts of this article assume different levels of expertise, but in order to understand everything, you need to know about arrays and references. Also, it would be nice (but not essential) to have encountered the join function before.

Also, see the official documentation for sort.

The Simplest Case

The simplest case of sort would be something like that:

@theList = (4, 1, 2, 6, 93, 2, 65);
@theSortedList = sort @theList;
print join(', ', @theSortedList)."\n";

Which produces the following output:

1, 2, 2, 4, 6, 65, 93

Note that Perl will sort a list in the "natural order" of the data, so if you have a list that contains both numbers and strings, it will be sorted in alphanumeric order:

@theList = (3, "pear", "four", "apple", 7);
@theSortedList = sort @theList;
print join(', ', @theSortedList)."\n";

Produces:

3, 7, apple, four, pear

A common misconception about sort has to do with the fact that the function does not have any effect on the original list. sort does not sort the original list, instead it takes the elements of a list, sorts them and returns them as a new list. So, the following does nothing:

sort @theList;

...and you probably meant to say that:

@theList = sort @theList;

Custom Sorting

The default behaviour of sort is to sort numbers and strings in ascending order. This not always desirable, sometimes we would like to order things in descending order. Also, we might want to override the default behaviour of sorting in the most "natural" way: we may want to sort a numerical list in an aplhanumerical manner (to look at the numbers as text).

The behaviour of sort can be modified by providing a, usually short, block of Perl code before the list. This is how a descending sort would look like:

@theList = (4, 1, 2, 6, 93, 2, 65);
@theSortedList = sort {$b <=> $a} @theList;
print join(', ', @theSortedList)."\n";

Produces:

93, 65, 6, 4, 2, 2, 1

The block of code that customises the sorting is {$b <=> $a}. The two variables $a and $b are not declared elsewhere in our program, they are both internal variables declared by sort and they have a special meaning: they represent the two elements from list @theList that are compared in pairs by sort to determine how to order the list. The order of the two variables in this block of code is significant: $b and $a comes after and that causes the sorting to happen in descending order. The <=> operator causes the comparison to happen numerically: $a and $b are compared as numbers.

Here is an example of descending aplhanumerical (string) sorting. The string comparison cmp operator sould be used instead:

@theList = (3, "pear", "four", "apple", 7);
@theSortedList = sort {$b cmp $a} @theList;
print join(', ', @theSortedList)."\n";
pear, four, apple, 7, 3

To get a better feel of the differences between numerical and aplhanumeric sorting, look what happens when you sort numbers as if they were strings (in ascending order this time):

@theList = (205, 4, 100, 1, 2, 2000, 6, 93, 2, 65);
@theSortedList = sort {$a cmp $b} @theList;
print join(', ', @theSortedList)."\n";
1, 100, 2, 2, 2000, 205, 4, 6, 65, 93

Yes, it looks crazy, but aplhanumerically, the string '205' comes before the string '4' because it starts with the character '2' which is a 'smaller' than character '4'.

Also, you can process the passed elements $a and $b before sorting them. This would prove useful for sorting a list of strings by ignoring the case of the strings involved by converting to lowercase (lc()) before the comparison:

@theSortedList = sort { lc($a) cmp lc($b) } @theList;

Finally, $a and $b can be used as lookups to another data structure. Here is an example of sorting the keys of a hash according to the values these keys are pointing to:

%h = {
    'a' => 40,
    'b' => 45,
    'c' => 7,
    'd' => 100,
    'e' => 2
};

@sortedKeys = sort { $h{$a} <=> $h{$b} } keys %h;
print join(', ', @sortedKeys)."\n";
e, c, a, b, d

The sorting behaviour can also be overidden by a named subroutine rather than an anonymous block of code (OK, 'anonymous subroutine' for the pedantics among you). This is useful for reusing the same comparison logic, and if you give reasonable names to your sorting subroutines, it can also lead to very readable code:

sub backwards { $b cmp $a }

@theList  = qw( z g b aa c );
@theSortedList = sort backwards @theList;

print join(', ', @sortedKeys)."\n";
z, g, c, b, aa

Dissecting custom sort

Before we continue, we have to take a closer look at the mechanism of customising sort. The block of code that is used to compare the elements, is a Perl expression that is required to evaluate to -1, 0 or 1. This is very similar to how the cmp and <=> operators work. According to the Perl manual:

Binary "cmp" returns -1, 0, or 1 depending on whether the left argument is stringwise less than, equal to, or greater than the right argument.

and:

Binary "<=>" returns -1, 0, or 1 depending on whether the left argument is numerically less than, equal to, or greater than the right argument.

This essentially means that (10 <=> 2) evaluates to 1, (4 <=> 4) evaluates to 0, and (5 <=> 7) evaluates to -1. The same applies to strings.

This inherent compatibility of sort and the cmp and <=> operators means that usually the code block that customises sort is just one comparison using one of the two operators, but actually the block can be any Perl expression that evaluates to -1, 0 or 1.

For example, suppose that we have a list where the elements are made up of a letter followed by a digit, and we would like to sort by digit, and in the cases where the digits are equal we would like to sort by the prefixing letter:

1 @theList = qw( f2 b1 o9 a1 c2 g9 d8 );
2 @theSortedList = sort {
3         substr($a, 1, 1) <=> substr($b, 1, 1)
4         ||
5         substr($a, 0, 1) cmp substr($b, 0, 1)
6     } @theList;
7 print join(', ', @theSortedList)."\n";
a1, b1, c2, f2, d8, g9, o9

In this example, the comparison logic is implemented in lines 3, 4 and 5. On line 3, the second characters (the numbers) of each element are extracted using the substr function, and they are compared numerically. If the two numbers are found to be different, line 3 and in fact the whole block of code evaluates to -1 or 1, and the comparison is complete. Because of the or (||) operator on line 4, the comparison continues only if the two numbers are equal, in which case line 3 evaluates to 0. In this case, line 5 is evaluated, which extracts the first characters of the each of the two elements and proceeds to compare them alphanumerically. This example demostrates that any sorting logic can be implemented as long as the result of the comparison is returned as -1, 0 or 1.

Sorting an array of arrays

Suppose you have the following array or arrays:

$table = [
    ["aa", "cc", "b"],
    ["cc", "dd", "ye"],
    ["dd", "cc", "hoho"],
    ["dd", "aa", "tq"],
    ["aa", "cc", "a"]
];

Conceptually, this array is equivalent to the following table:

aaccb
ccddye
ddcchoho
ddaatq
aacca

We would like to sort the array of arrays according to all the columns of the above table, giving priority to the first column, then looking at the second column and finally the third. This kind of sorting can be achieved with the following script:

 1 #!/usr/bin/perl -w
 2 use strict;
 3 use Data::Dumper;
 4      
 5 my $table = [
 6     ["aa", "cc", "b"],
 7     ["cc", "dd", "ye"],
 8     ["dd", "cc", "hoho"],
 9     ["dd", "aa", "tq"],
10     ["aa", "cc", "a"]
11 ];
12      
13 my @sorted_table = sort tableSorter @$table;
14  
15 print Dumper(@sorted_table);
16  
17 sub tableSorter ($$) {
18     my($row1, $row2) = @_;
19     
20     my $column1comparison = $row1->[0] cmp $row2->[0];
21     if($column1comparison != 0){
22         return $column1comparison;
23     }
24     
25     my $column2comparison = $row1->[1] cmp $row2->[1];
26     if($column2comparison != 0){
27         return $column2comparison;
28     }
29  
30     return $row1->[2] cmp $row2->[2];
31 }

On line 14 we sort using the tableSorter subroutine which implements our comparison logic. Note that the prototype of this function has to be $$ (line 18). The two elements are passed to this subroutine by sort on line 19, and these are no others than $a and $b, the special variables we discussed before. In this particular case the elements are references to the rows of our table. The first comparison is performed on lines 21 to 24 and if the sub-elements from column 1, the subroutine proceeds to use columnn 2 and then column 3 for subsequent comparisons.

The module Data::Dumper is being used here (line 16) to produce a summary of the resulting sorted data structure, and it produces the following output:

$VAR1 = [
          'aa',
          'cc',
          'a'
        ];
$VAR2 = [
          'aa',
          'cc',
          'b'
        ];
$VAR3 = [
          'cc',
          'dd',
          'ye'
        ];
$VAR4 = [
          'dd',
          'aa',
          'tq'
        ];
$VAR5 = [
          'dd',
          'cc',
          'hoho'
        ];

Which is conceptually equivalent to the following correctly sorted table:

aacca
aaccb
ccddye
ddaatq
ddcchoho

Now, let's see if we can express the same logic, using less and more concise code. Lines 21 to 31 of the above listing, can be summarised as:

return
    $row1->[0] cmp $row2->[0]
    || $row1->[1] cmp $row2->[1]
    || $row1->[2] cmp $row2->[2];

...and if would like to be extremely brief, we can skip the use of a named function altogether, and do everything on one line by replacing line 14:

my @sorted_table = sort tableSorter @$table;

...with:

my @sorted_table = sort {
        $a->[0] cmp $b->[0]
        || $a->[1] cmp $b->[1]
        || $a->[2] cmp $b->[2];
    } @$table;

Where is the return statement?

One final clarification about sorter subroutines: Why does the above tableSorter subroutine need a return statement, while the anonymous sorter block of code (also a subroutine) does not require a return? They both must return a value, so why the different syntax?

The syntax is in fact different, but it means exactly the same. There are two ways to return a value from a subroutine in Perl: you can either do it explicitly using a return statement, or, if there is no return statement, the value of the last statement of the subroutine is used as the return value. So the following subroutines are exactly equivalent:

sub add {
    my($a, $b) = @_;
    
    return $a + $b;
}
sub add {
    my($a, $b) = @_;
    
    $a + $b;
}

The last statement of the second subroutine is $a + $b and this is what is returned, irrespective of the presence of a return statement. The following sort statement uses a custom subroutine which returns the value of $b <=> $a, since $b <=> $a is the last (and only) statement of the anonymous subroutine:

sort {$b <=> $a} @theList;

The following statement is exactly equivalent (but longer):

sort {return $b <=> $a} @theList;

So the answer is that the syntax is different, but semantically the same: it makes no difference at all. We just tend to use the shorthand that omits the return statement in the case of the anonymous sorter subroutine because it makes the sort line shorter and it fits the overall brevity. On the other hand, when you write full named subroutines, the most common way to return a value is by using the return statement.

Conclusion

As with almost every aspect of Perl, the sort function can be used in various degrees of complexity. With sort you can start small by using it in its most simple form, customise it a bit to suit your needs (for descending sorts etc), or go all the way to implement a sorting logic for more complex data structures.

Some points to remember: Don't forget to do something with the sorted list (e.g. store it somewhere), because sort itself does not affect the original list. Lists are sorted in the "natural" order of data, unless this behaviour is overridden by you. Custom sorter subroutines have to have the $$ prototype and they have to return -1, 0 or 1.

Good luck with sorting!

blog comments powered by Disqus