Regular expression operations
This module provides regular expression matching operations similar to those found in Perl.
Both patterns and strings to be searched can be Unicode strings (str
) as well as 8-bit strings (bytes
). However, Unicode strings and 8-bit strings cannot be mixed: that is, you cannot match a Unicode string with a byte pattern or vice-versa; similarly, when asking for substitution, the replacement string must be of the same type as both the pattern and the search string.
Regular expressions use the backslash character ('\'
) to indicate special forms or to allow special characters to be used without invoking their special meaning. This collides with Python's usage of the same character for the same purpose in string literals.
The solution is to use Python's raw string notation for regular expression patterns; backshalashes are not handled in any special way in a string literal prefixed with 'r'
'. So r"\n"
is a two character string, while "\n"
is a one-character string containing a newline. Usually patterns will be expressed in Python Code using the raw string notation.
Note: It is important to note that most regular expression operations are available as module-level functions and methods on compiled regular expressions. The functions are shortcuts that don't require you to compile a regex object firs, but miss some fine-tuning parameters.
Regular Expression Syntax
A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression or matches a particular string (source string).
Regular expressions can be concatenated to form new regular expressions; if A and B are both regular expressions, then AB is also regular expression.
A brief explanation of the format of regular expressions follows.
Regular expressions can contain both special and ordinary characters. Most ordinary characters, like 'A'
, 'a'
, or '0'
, are the simplest regular expressions; they simply match themselves.
Some characters, like '|'
or '('
, are special. Special characters either stand for classes of ordinary characters, or affect how the regular expressions around them are interpreted.
Repetition qualifiers (*
, +
, ?
, {m,n}
, etc) cannot be directly nested. This avoids ambiguity with the non-greedy modifier suffix ?
, and with other modifiers in other implementations. To apply a second repetition to an inner repetition, parentheses may be used. For example, the expression (?:a{6})*
matches any multiple of six 'a'
characters.
The special characters:
-
.
: (Dot) In the default mode, this matches any character except a newline. If theDOTALL
flag has been specified, this matches any character including newline. -
^
: (Caret) Matches the start of the string, and inMULTILINE
mode also matches immediately after each newline -
$
: Matches the end of the string or just before the newline at the end of the string, and inMULTILINE
mode also matches before a newline.
-
*
: Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. -
+
: Causes the resulting RE to match 1 or more repetitions of the preceding RE. -
?
: Causing the resulting RE to match 0 or 1 repitions of the precedding RE. -
*?
,+?
,??
: The'*'
,'+'
, and'?'
qualifiers are all greedy; they match as much text as possible.