Appendix: Regular Expressions

Regular expressions are a powerful method of matching strings in your text. This appendix describes the following:

Special Characters in Regular Expressions
Regular Expression BNF
Regular Expressions and Filenames

Special characters in regular expressions

Different characters in a regular expression match different things. A list of all special (or ``magical'') characters is as follows:

In the list below, the phrase ``an occurrence of a regular expression'' means an occurrence of a string that matches the regular expression.

A backslash (\) followed by a single character other than new line matches that character.
The caret (^) matches the beginning of a line.
The dollar sign ($) matches the end of a line.
The dot (.) matches any character.
A single character that doesn't have any other special meaning matches that character.
A string enclosed in brackets [ ] matches any single character from the string. Ranges of ASCII character codes may be abbreviated as in a-z0-9. A right bracket (]) may occur only as the first character of the string. You must place a literal hyphen (-) where it can't be mistaken as a range indicator. If a caret (^) occurs as the first character inside the brackets, then any characters NOT in the string are matched.
A regular expression followed by an asterisk (*) matches a sequence of zero or more occurrences of the regular expression.
A regular expression followed by a plus sign (+) matches one or more occurrences of the regular expression.
A regular expression followed by a question mark (?) matches zero or one occurrences of the regular expression.
Two regular expressions concatenated match an occurrence of the first followed by an occurrence of the second.
Two regular expressions separated by an OR bar (|) match either an occurrence of the first or an occurrence of the second.
A regular expression enclosed in parentheses matches an occurrence of the regular expression.
The order of precedence of operators at the same parenthesis level is as follows:
1. {}
2. *+?
3. concatenation
4. /
All regular expressions following an at-sign (@) are treated as case-sensitive.
All regular expressions following a tilde (~) are treated as case-insensitive.

If a regular expression could match two different parts of the line, it matches the earlier one. If both begin in the same place, but match different lengths, or the same length in different ways, then the rules are more complicated.

In general:

the possibilities in a list of branches are considered from left to right
the possibilities for *, +, and ? are considered longest first
nested constructs are considered from the outermost in
concatenated constructs are considered leftmost first

The match chosen is the one that uses the earliest possibility in the first choice that has to be made. If there's more than one choice, the next is made in the same manner (earliest possibility), subject to the decision on the first choice, and so on.

For example, (ab|a)b*c could match the string abc in one of two ways. The first choice is between ab and a. Since ab is specified earlier in the expression and does lead to a successful overall match, it's chosen. Since the b is already spoken for, the b* must match its last possibility, since it must respect the earlier choice.

If there are no OR bars (|) present and only one *, +, or ?, the net effect is that the longest possible match is chosen. So the regular expression ab*, presented with xabbbby, matches abbbbb. Note that if ab* is tried against xabyabbbz, it matches ab just after x, because that occurrence starts earlier in the string than abbb.

Regular expression BNF

A pseudo-BNF for regular expressions is:

reg-exp

{branch}|{branch}|...

branch

{piece}{piece}...

piece

{atom{* or + or ?}}{atom{* or + or ?}}...

The following table lists what each of the special characters matches:

Character:	Matches:
*	0 or more occurrences of atom
+	1 or more occurrences of atom
?	an occurrence of atom, or the null string.

atom

(reg-exp) or range or @ or ^ or $ or \char or char

range

[{^} char and/or char_lo-char_hi]

where ^ causes negation of range.

. (dot)

Match any character.

^

Match the start of the line.

$

Match the end of the line.

@

Search with case sensitivity.

~

Search without case sensitivity.

!

If an exclamation mark (!) occurs as the first character in a regular expression, all magic characters are treated as special characters. An exclamation mark is treated as a regular character if it occurs anywhere but at the very start of the regular expression.

char

Any character

\char

Forces \char to be accepted as char (that is, ignoring any special meaning of char)

Regular expressions and filenames

When specifying a filename in the Watcom Debugger, it's possible to use a file-matching regular expression. This expression is similar to a regular expression, with a few differences:

A dot (.) specifies an actual dot in the filename.
An asterisk (*) is the same as (matches 0 or more characters).
A question mark (?) is the same as a regular expression dot (.); that is, a question mark matches exactly one character.
A caret (^) has no meaning.
A dollar sign ($) has no meaning.
The backslash (\) has no meaning. It and the slash (/) are used to separate directories in a path name.

Suppose we have the following list of files:

a.c
abc.c
abc
bcd.c
bad
xyz.c

The following examples show how the files from the above list are matched by various filename regular expressions.

a*.c: All files that start with a and end in .c. Therefore, it matches a.c and abc.c
(a|b)*.c: All files that start with an a or a b, and end in .c. Therefore, it matches a.c, abc.c, and bcd.c
*d.c: All files that end in d.c. Therefore, it matches bcd.c
*: All files.
*.*: All files that have a dot in them. Therefore, it matches a.c, abc.c, bcd.c, xyz.c