Learning Perl

Learning PerlSearch this book
Previous: 7.2 Simple Uses of Regular ExpressionsChapter 7
Regular Expressions
Next: 7.4 More on the Matching Operator
 

7.3 Patterns

A regular expression is a pattern. Some parts of the pattern match single characters in the string of a particular type. Other parts of the pattern match multiple characters. First, we'll visit the single-character patterns and then the multiple-character patterns.

7.3.1 Single-Character Patterns

The simplest and most common pattern-matching character in regular expressions is a single character that matches itself. In other words, putting a letter a in a regular expression requires a corresponding letter a in the string.

The next most common pattern matching character is the dot ".". This matches any single character except newline (\n). For example, the pattern /a./ matches any two-letter sequence that starts with a and is not "a\n".

A pattern-matching character class is represented by a pair of open and close square brackets and a list of characters between the brackets. One and only one of these characters must be present at the corresponding part of the string for the pattern to match. For example,

/[abcde]/

matches a string containing any one of the first five letters of the lowercase alphabet, while

/[aeiouAEIOU]/

matches any of the five vowels in either lower- or uppercase. If you want to put a right bracket (]) in the list, put a backslash in front of it, or put it as the first character within the list. Ranges of characters (like a through z) can be abbreviated by showing the end points of the range separated by a dash (-); to get a literal dash in the list, precede the dash with a backslash or place it at the end. Here are some other examples:

[0123456789]    # match any single digit
[0-9]           # same thing
[0-9\-]         # match 0-9, or minus
[a-z0-9]        # match any single lowercase letter or digit
[a-zA-Z0-9_]    # match any single letter, digit, or underscore

There's also a negated character class, which is the same as a character class, but has a leading up-arrow (or caret: ^) immediately after the left bracket. This character class matches any single character that is not in the list. For example:

[^0-9]        # match any single non-digit
[^aeiouAEIOU] # match any single non-vowel
[^\^]         # match single character except an up-arrow

For your convenience, some common character classes are predefined, as described in Table 7.1.


Table 7.1: Predefined Character Class Abbreviations

Construct

Equivalent Class

Negated Construct

Equivalent Negated Class

\d (a digit)

[0-9]

\D (digits, not!)

[^0-9]

\w (word char)

[a-zA-Z0-9_]

\W (words, not!)

[^a-zA-Z0-9_]

\s (space char)

[ \r\t\n\f]

\S (space, not!)

[^ \r\t\n\f]

The \d pattern matches one "digit." The \w pattern matches one "word character," although what it is really matching is any character that is legal in a Perl variable name. The \s pattern matches one "space" (whitespace), here defined as spaces, carriage returns (not often used in UNIX), tabs, line feeds, and form feeds. The uppercase versions match the complements of these classes. Thus, \W matches one character that can't be in an identifier, \S matches one character that is not whitespace (including letter, punctuation, control characters, and so on), and \D matches any single nondigit character.

These abbreviated classes can be used as part of other character classes as well:

[\da-fA-F] # match one hex digit

7.3.2 Grouping Patterns

The true power of regular expressions comes into play when you can say "one or more of these" or "up to five of those." Let's talk about how this is done.

7.3.2.1 Sequence

The first (and probably least obvious) grouping pattern is sequence. This means that abc matches an a followed by a b followed by a c. Seems simple, but we're giving it a name so we can talk about it later.

7.3.2.2 Multipliers

We've already seen the asterisk (*) as a grouping pattern. The asterisk indicates zero or more of the immediately previous character (or character class).

Two other grouping patterns that work like this are the plus sign (+), meaning one or more of the immediately previous character, and the question mark (?), meaning zero or one of the immediately previous character. For example, the regular expression /fo+ba?r/ matches an f followed by one or more o's followed by a b, followed by an optional a, followed by an r.

In all three of these grouping patterns, the patterns are greedy. If such a multiplier has a chance to match between five and ten characters, it'll pick the 10-character string every time. For example,

$_ = "fred xxxxxxxxxx barney";
s/x+/boom/;

always replaces all consecutive x's with boom (resulting in fred boom barney), rather than just one or two x's, even though a shorter set of x's would also match the same regular expression.

If you need to say "five to ten" x's, you could get away with putting five x's followed by five x's each immediately followed by a question mark. But this looks ugly. Instead, there's an easier way: the general multiplier. The general multiplier consists of a pair of matching curly braces with one or two numbers inside, as in /x{5,10}/. The immediately preceding character (in this case, the letter "x") must be found within the indicated number of repetitions (five through ten here).[1]

[1] Of course, /\d{3}/ doesn't only match three-digit numbers. It would also match any number with more than three digits in it. To match exactly three, you need to use anchors, described later in Section 7.3.3, "Anchoring Patterns."

If you leave off the second number, as in /x{5,}/, it means "that many or more" (five or more in this case), and if you leave off the comma, as in /x{5}/, it means "exactly this many" (five x's). To get five or less x's, you must put the zero in, as in /x{0,5}/.

So, the regular expression /a.{5}b/ matches the letter a separated from the letter b by any five non-newline characters at any point in the string. (Recall that a period matches any single non-newline character, and we're matching five here.) The five characters do not need to be the same. (We'll learn how to force them to be the same in the next section.)

We could dispense with *, +, and ? entirely, since they are completely equivalent to {0,}, {1,}, and {0,1}. But it's easier to type the equivalent single punctuation character, and more familiar as well.

If two multipliers occur in a single expression, the greedy rule is augmented with "leftmost is greediest." For example:

$_ = "a xxx c xxxxxxxx c xxx d";
/a.*c.*d/;

In this case, the first ".*" in the regular expression matches all characters up to the second c, even though matching only the characters up to the first c would still allow the entire regular expression to match. Right now, this doesn't make any difference (the pattern would match either way), but later when we can look at parts of the regular expression that matched, it'll matter quite a bit.

We can force any multiplier to be nongreedy (or lazy) by following it with a question mark:

$_ = "a xxx c xxxxxxxx c xxx d";
/a.*?c.*d/;

Here, the a.*?c now matches the fewest characters between the a and c, not the most characters. This means the leftmost c is matched, not the rightmost. You can put such a question-mark modifier after any of the multipliers (?,+,*, and {m,n}).

What if the string and regular expression were slightly altered, say, to:

$_ = "a xxx ce xxxxxxxx ci xxx d";
/a.*ce.*d/;

In this case, if the .* matches the most characters possible before the next c, the next regular expression character (e) doesn't match the next character of the string (i). In this case, we get automatic backtracking: the multiplier is unwound and retried, stopping at someplace earlier (in this case, at the earlier c, next to the e).[2] A complex regular expression may involve many such levels of backtracking, leading to long execution times. In this case, making that match lazy (with a trailing "?") will actually simplify the work that Perl has to perform, so you may want to consider that.

[2] Well, technically there was a lot of backtracking of the * operator to find the c's in the first place. But that's a little trickier to describe, and it works on the same principle.

7.3.2.3 Parentheses as memory

Another grouping operator is a pair of open and close parentheses around any part pattern. This doesn't change whether the pattern matches, but instead causes the part of the string matched by the pattern to be remembered, so that it may be referenced later. So for example, (a) still matches an a, and ([a-z]) still matches any single lowercase letter.

To recall a memorized part of a string, you must include a backslash followed by an integer. This pattern construct represents the same sequence of characters matched earlier in the same-numbered pair of parentheses (counting from one). For example,

/fred(.)barney\1/;

matches a string consisting of fred, followed by any single non-newline character, followed by barney, followed by that same single character. So, it matches fredxbarneyx, but not fredxbarneyy. Compare that with

/fred.barney./;

in which the two unspecified characters can be the same, or different; it doesn't matter.

Where did the 1 come from? It means the first parenthesized part of the regular expression. If there's more than one, the second part (counting the left parentheses from left to right) is referenced as \2, the third as \3, and so on. For example,

/a(.)b(.)c\2d\1/;

matches an a, a character (call it #1), a b, another character (call it #2), a c, the character #2, a d, and the character #1. So it matches axbycydx, for example.

The referenced part can be more than a single character. For example,

/a(.*)b\1c/;

matches an a, followed by any number of characters (even zero) followed by b, followed by that same sequence of characters followed by c. So, it would match aFREDbFREDc, or even abc, but not aXXbXXXc.

7.3.2.4 Alternation

Another grouping construct is alternation, as in a|b|c. This means to match exactly one of the alternatives (a or b or c in this case). This works even if the alternatives have multiple characters, as in /song|blue/, which matches either song or blue. (For single character alternatives, you're definitely better off with a character class like /[abc]/.)

What if we wanted to match songbird or bluebird? We could write /songbird|bluebird/, but that bird part shouldn't have to be in there twice. In fact, there's a way out, but we have to talk about the precedence of grouping patterns, which is covered in Section 7.3.4, "Precedence," below.

7.3.3 Anchoring Patterns

Several special notations anchor a pattern. Normally, when a pattern is matched against the string, the beginning of the pattern is dragged through the string from left to right, matching at the first possible opportunity. Anchors allow you to ensure that parts of the pattern line up with particular parts of the string.

The first pair of anchors require that a particular part of the match be located either at a word boundary or not at a word boundary. The \b anchor requires a word boundary at the indicated point for the pattern to match. A word boundary is the place between characters that match \w and \W, or between characters matching \w and the beginning or ending of the string. Note that this has little to do with English words and a lot more to do with C symbols, but that's as close as we get. For example:

/fred\b/;     # matches fred, but not frederick
/\bmo/;       # matches moe and mole, but not Elmo
/\bFred\b/;   # matches Fred but not Frederick or alFred
/\b\+\b/;     # matches "x+y" but not "++" or " + "
/abc\bdef/;   # never matches (impossible for a boundary there)

Likewise, \B requires that there not be a word boundary at the indicated point. For example:

/\bFred\B/; # matches "Frederick" but not "Fred Flintstone"

Two more anchors require that a particular part of the pattern be next to an end of the string. The caret (^) matches the beginning of the string if it is in a place that makes sense to match the beginning of the string. For example, ^a matches an a if, and only if, the a is the first character of the string. However, a^ matches the two characters a and ^ anywhere in the string. In other words, the caret has lost its special meaning. If you need the caret to be a literal caret even at the beginning, put a backslash in front of it.

The $, like the ^, anchors the pattern, but to the end of the string, not the beginning. In other words, c$ matches a c only if it occurs at the end of the string.[3] A dollar sign anywhere else in the pattern is probably going to be interpreted as a scalar value interpretation, so you'll most likely need to backslash it to match a literal dollar sign in the string.

[3] Or just before the newline at the end of the string, for historical simplicity.

Other anchors are supported, including \A, \Z, and lookahead anchors created via (?=...) and (?!...). These are described fully in Chapter 2 of Programming Perl and the perlre (1) manpage.

7.3.4 Precedence

So what happens when we get a|b* together? Is this a or b any number of times, or is it either a single a or any number of b's?

Well, just as operators have precedence, the grouping and anchoring patterns also have precedence. The precedence of patterns from highest to lowest is given in Table 7.2.


Table 7.2: regex Grouping Precedence [4]

Name

Representation

Parentheses

( ) (?: )

Multipliers

? + * {m,n} ?? +? *? {m,n}?

Sequence and anchoring

abc ^ $ \A \Z (?= ) (?! )

Alternation

|

[4] Some of these symbols are not described in this book. See Programming Perl or perlre (1) for details.

According to the table, * has a higher precedence than |. So /a|b*/ is interpreted as a single a, or any number of b's.

What if we want the other meaning, as in "any number of a's or b's"? We simply throw in a pair of parentheses. In this case, enclose the part of the expression that the * operator should apply to inside parentheses, and we've got it, as (a|b)*. If you want to clarify the first expression, you can redundantly parenthesize it with a|(b*).

When you use parentheses to affect precedence they also trigger the memory, as shown earlier in this chapter. That is, this set of parentheses counts when you are figuring out whether something is \2, \3, or whatever. If you want to use parentheses without triggering memory, use the form (?:...) instead of (...). This still allows for multipliers, but doesn't throw off your counting by using up \4 or whatever. For example, /(?:Fred|Wilma) Flintstone/ does not store anything into \1; it's just there for grouping.

Here are some other examples of regular expressions and the effect of parentheses:

abc*             # matches ab, abc, abcc, abccc, abcccc, and so on
(abc)*           # matches "", abc, abcabc, abcabcabc, and so on
^x|y             # matches x at the beginning of line, or y anywhere
^(x|y)           # matches either x or y at the beginning of a line
a|bc|d           # a, or bc, or d
(a|b)(c|d)       # ac, ad, bc, or bd
(song|blue)bird  # songbird or bluebird


Previous: 7.2 Simple Uses of Regular ExpressionsLearning PerlNext: 7.4 More on the Matching Operator
7.2 Simple Uses of Regular ExpressionsBook Index7.4 More on the Matching Operator