Regular Expressions
Regular expressions are too huge of a topic to introduce here, but make sure that you understand these concepts. For tutorials, see perlrequick or perlretut. For the definitive documentation, see perlre.
Matches and replacements return a quantity.
The m// and s/// operators return the number of matches or replacements they made,
respectively.
You can either use the number directly,
or check it for truth.
if ( $str =~ /Diggle|Shelley/ ) {
print "We found Pete or Steve!\n";
}
if ( my $n = ($str =~ s/this/that/g) ) {
print qq{Replaced $n occurrence(s) of "this"\n};
}
Don't use capture variables without checking that the match succeeded.
The capture variables, $1, etc, are not valid unless the match succeeded, and they're not cleared, either.
# BAD: Not checked, but at least it "works".
my $str = 'Perl 101 rocks.';
$str =~ /(\d+)/;
print "Number: $1"; # Prints "Number: 101";
# WORSE: Not checked, and the result is not what you'd expect
$str =~ /(Python|Ruby)/;
print "Language: $1"; # Prints "Language: 101";
Instead, you must check the return value from the match:
# GOOD: Check the results
my $str = 'Perl 101 rocks.';
if ( $str =~ /(\d+)/ ) {
print "Number: $1"; # Prints "Number: 101";
}
if ( $str =~ /(Python|Ruby)/ ) {
print "Language: $1"; # Never gets here
}
XXX m// in list context gives a list of matches
Common match flags
/i- case insensitive match/g- match multiple times$var = "match match match"; while ($var =~ /match/g) { $a++; } print "$a\n"; # prints 3 $a = 0; $a++ foreach ($var =~ /match/g); print "$a\n"; # prints 3/m-^and$change meaning- Ordinarily,
^means "start of string" and$, "end of string" /mmakes them mean start and end of line, respectively$str = "one\ntwo\nthree"; @a = $str =~ /^\w+/g; # @a = ("one"); @b = $str =~ /^\w+/gm; # @b = ("one","two","three")- Use
\Aand\zfor start and end of string regardless of/m \Zis the same as\zexcept it will ignore a final newline
- Ordinarily,
/s-.also matches newline$str = "one\ntwo\nthree\n"; $str =~ /^(.{8})/s; print $1; # prints "one\ntwo\n"
Capture variables $1 and friends
- Sets of capturing parentheses are stored in numeric variables
- Parenthesis are assigned left to right:
my $str = "abc"; $str =~ /(((a)(b))(c))/; print "1: $1 2: $2 3: $3 4: $4 5: $5\n"; # prints: 1: abc 2: ab 3: a 4: b 5: c - No upper limit on number of capturing parenthesis and variables
Avoid capture with ?:
- If a parenthesis is followed by
?:, the group will not be captured - Useful if you don't want the matches to be saved
my $str = "abc"; $str =~ /(?:a(b)c)/; print "$1\n"; # prints "b"
Allow easier reading with the /x switch
- If you're doing something tricky with a regex, comment it.
- You can do this with the
/xflag.This ugly behemoth
my ($num) = $ARGV[0] =~ m/^\+?((?:(?<!\+)-)?(?:\d*.)?\d+)$/x;
is more readable with whitespace and comments, as allowed by the
/xflag.my ($num) = $ARGV[0] =~ m/^ \+? # An optional plus sign, to be discarded ( # Capture... (?:(?<!\+)-)? # a negative sign, if there's no plus behind it, (?:\d*.)? # an optional number, followed by a point if a decimal, \d+ # then any number of numbers. )$/x; - Whitespace and comments are stripped unless escaped.
Automatically quote your regexes with \Q and \E
- Automatically escapes regex metacharacters
- Won't escape dollar signs
my $num = '3.1415'; print "ok 1\n" if $num =~ /\Q3.14\E/; $num = '3X1415'; print "ok 2\n" if $num =~ /\Q3.14\E/; print "ok 3\n" if $num =~ /3.14/;prints
ok 1 ok 3
Execute code with /e flag to s///
- Allows arbitrary code to replace a string in a regular expression
my $str = "AbCdE\n"; $str =~ s/(\w)/lc $1/eg; print $str; # prints "abcde" - Use
$1and friends if necessary
Know when to use study
study is not helpful in the vast majority of cases. All it does is make a table of where the first occurrence of each of 256 bytes is in the string. This means that if you have a 1,000-character string, and you search for lots of strings that begin with a constant character, then the matcher can jump right to it. For example:
"This is a very long [... 900 characters skipped...] string that I have here, ending at position 1000"
Now, if you are matching this against the regex /Icky/, the matcher will try to find the first letter "I" that matches. That may take scanning through the first 900+ characters until you get to it. But what study does is build a table of the 256 possible bytes and where they first appear, so that in this case, the scanner can jump right to that position and start matching.
Handle multi-line regexes
Use re => debug
-Mre=debug
