Regular expressions are a powerful method of matching strings in your text.
This appendix describes the following:
Different characters in a regular expression match different things. A list
of all special (or ``magical'') characters is as follows:
|
In the list below, the phrase ``an occurrence of a regular
expression''
means an occurrence of a string that matches the regular expression. |
- A backslash (\) followed by a single character other than new line
matches that character.
- The caret (^) matches the beginning of a line.
- The dollar sign ($) matches the end of a line.
- The dot (.) matches any character.
- A single character that doesn't have any other special meaning matches
that character.
- A string enclosed in brackets [ ] matches any single character
from the
string. Ranges of ASCII character codes may be abbreviated as in
a-z0-9. A right bracket (]) may occur only as the
first character of the string.
You must place a literal hyphen (-) where it can't be mistaken
as a
range indicator. If a caret (^) occurs as the first character inside the
brackets, then any characters NOT in the string are matched.
- A regular expression followed by an asterisk (*) matches a sequence
of zero or more occurrences of the regular expression.
- A regular expression followed by a plus sign (+) matches one or more
occurrences of the regular expression.
- A regular expression followed by a question mark (?) matches zero or
one occurrences of the regular expression.
- Two regular expressions concatenated match an occurrence of the first
followed by an occurrence of the second.
- Two regular expressions separated by an OR bar (|) match either an
occurrence of the first or an occurrence of the second.
- A regular expression enclosed in parentheses matches an occurrence of the
regular expression.
- The order of precedence of operators at the same parenthesis
level is as follows:
- {}
- *+?
- concatenation
- /
- All regular expressions following an at-sign (@) are treated
as case-sensitive.
- All regular expressions following a tilde (~) are
treated as case-insensitive.
If a regular expression could match two different parts of the line,
it matches the earlier one. If both begin in the same place, but
match different lengths, or the same length in different
ways, then the rules are more complicated.
In general:
- the possibilities in a list of branches are considered
from left to right
- the possibilities for *, +, and ? are considered longest first
- nested constructs are considered from the outermost in
- concatenated constructs are considered leftmost first
The match chosen is the one that uses the earliest possibility in the
first choice that has to be made. If there's more than one choice, the
next is made in the same manner (earliest possibility), subject to the
decision on the first choice, and so on.
For example, (ab|a)b*c could match the string
abc in one of two ways. The first choice is between
ab and a.
Since ab is specified earlier in the expression and does lead to
a successful overall match, it's chosen. Since the b
is already spoken for, the b* must match its last possibility,
since it must respect the earlier choice.
If there are no OR bars (|) present and only one
*, +, or ?, the net effect is
that the longest possible match is chosen. So the regular expression
ab*, presented with xabbbby, matches
abbbbb.
Note that if ab* is tried against xabyabbbz,
it matches ab just after x, because that
occurrence starts earlier in the string than abbb.
A pseudo-BNF for regular expressions is:
- reg-exp
- {branch}|{branch}|...
- branch
- {piece}{piece}...
- piece
- {atom{* or + or ?}}{atom{* or + or ?}}...
The following table lists what each of the special characters
matches:
Character: | Matches:
|
---|
* | 0 or more occurrences of atom
|
+ | 1 or more occurrences of atom
|
? | an occurrence of atom, or the null string.
|
- atom
- (reg-exp) or range or @ or ^ or $ or
\char or char
- range
- [{^} char and/or char_lo-char_hi]
where ^ causes negation of range.
- . (dot)
- Match any character.
- ^
- Match the start of the line.
- $
- Match the end of the line.
- @
- Search with case sensitivity.
- ~
- Search without case sensitivity.
- !
- If an exclamation mark (!) occurs as the first character
in a regular expression, all magic characters are treated as special
characters.
An exclamation mark is treated as a regular character if it occurs
anywhere but at the very start of the regular expression.
- char
- Any character
- \char
- Forces \char to be accepted as char (that is,
ignoring any special meaning of char)
When specifying a filename in the Watcom Debugger, it's possible to use
a file-matching regular expression. This expression is similar to a
regular expression, with a few differences:
- A dot (.) specifies an actual dot in the filename.
- An asterisk (*) is the same as (matches 0 or more characters).
- A question mark (?) is the same as a regular expression dot (.);
that is, a question mark matches exactly one character.
- A caret (^) has no meaning.
- A dollar sign ($) has no meaning.
- The backslash (\) has no meaning. It and the slash (/) are used to
separate directories in a path name.
Suppose we have the following list of files:
- a.c
- abc.c
- abc
- bcd.c
- bad
- xyz.c
The following examples show how the files from the above list are
matched by various filename regular expressions.
- a*.c
- All files that start with a and end in .c.
Therefore, it matches a.c and abc.c
- (a|b)*.c
- All files that start with an a or a b, and end in
.c. Therefore, it matches a.c,
abc.c, and bcd.c
- *d.c
- All files that end in d.c. Therefore, it matches
bcd.c
- *
- All files.
- *.*
- All files that have a dot in them. Therefore, it matches
a.c, abc.c, bcd.c,
xyz.c