Perl is excellent at finding patterns in text. It does this with regular expressions, similar to the ones used by grep and awk. Any scalar can be matched against a regular expression with the matching binding operator, =~. For example:
if( $user =~ /jjohn/ ){ print "I know you"; }
Without the matching binding operator, regular expressions match against the current value of $_. For example:
while (<>) { if (/quit/i) { print "Looks like you want out.\n"; last; } }
In this code, each line of input is examined for the character sequence quit. The /i modifier at the end of the regular expression makes the matching case-insensitive (i.e., Quit matches as well as qUIT).
As with regular expressions in other utilities, Perl attempts to find the leftmost and longest match for your pattern against a given string. Patterns are made up of characters (which normally match themselves) and special metacharacters, including those found in Table 41-8.
Operator |
Description |
---|---|
Pattern must match at the beginning of the line. |
|
Pattern must match at the end of the line. |
|
Match any character (expect the newline). |
|
Alternation: match the pattern on either the left or right. |
|
Group this pattern together as one (good for quantifiers and capturing). |
|
Define a new character class: any of the symbols given can match one character of input (e.g. /[aeiou]/ matches a string with at least one regular vowel). |
|
Match a letter, number and underscore. |
|
Match a number. |
|
Match a whitespace character: space, tab, \n, \r. |
|
Match 0 or more consecutive occurences of pattern. |
|
Match 1 or more consecutive occurrences of pattern. |
|
Optionally match pattern. |
A very common task for which regular expressions are used is extracting specific information from a line of text. Suppose you wanted to get the first dotted quad that appears in this ifconfig command:
$ ifconfig eth0 eth0 Link encap:Ethernet HWaddr 00:A0:76:C0:1A:E1 inet addr:192.168.1.50 Bcast:192.168.1.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:365079 errors:0 dropped:0 overruns:0 frame:0 TX packets:426050 errors:0 dropped:0 overruns:0 carrier:0 collisions:3844 txqueuelen:100 Interrupt:9 Base address:0x300
The output of a command can be captured into an array using the backtick operator. Each line of the command's output will be an element of the array. One way to extract the IP address from that line is with the following code:
my @ifconfig = `/sbin/ifconfig eth0`; for (@ifconfig) { if ( /(\d+\.\d+\.\d+\.\d+)/ ) { print "Quad: $1\n"; last; } }
This regular expression looks for one or more digits (\d+) followed by a literal dot (rather than the regular expression metacharacter), followed by two more digit/dot pairs, followed by one or more digits. If this pattern is found in the current line, the part that was matched is captured (thanks to the parentheses) into the special variable $1. You can capture more patterns in a regular expression with more parentheses. Each captured text appears in a sequential higher scalar (i.e., the next paren-captured match will be $2).
Sometimes, you need to find all the matches for your pattern in a given string. This can be done with the /g regular expression modifier. If you wanted to find all the dotted quads in the ifconfig output, you could use the following code:
my @ifconfig = `/sbin/ifconfig eth0`; for (@ifconfig) { while( /(\d+\.\d+\.\d+\.\d+)/g ){ print "Quad: $1\n"; } }
Here, the if block is replaced with a while loop. This is important for /g to work as expected. If the current line has something that looks like a dotted quad, that value is capture in $1, just as before. However, the /g modifier remembers where in the string it made the last match and looks after that point for another one.
Perl's regular expression support has set the standard for other langauges. As such, it is impossible to give a comprehensive guide to Perl regular expressions here, but see O'Reilly's Mastering Regular Expressions or the perlre manpage.
Copyright © 2003 O'Reilly & Associates. All rights reserved.