Most of my search pages use regular expressions. By learning how to
use expressions you gain the most from our search facilities. Doing
so will allow you to utilize the engines to their full capabilities.
Regular expressions are used to search strings for patterns of
characters. In brief, the
wildcard character is the period ('.'). You specify zero or
more characters with the asterisk ('*'), zero or one
characters with the question mark ('?'), one or more
characters with the plus ('+') after the character. The caret
('^') anchors an expression at the beginning of the line, the dollar
('$') anchors an expression at the end of the line.
Examples:
Input
1GE
Matches the string 1GE anywhere in the string to be searched:
Matches: 01GE 456
Doesn't match: 01G E 456
^1GE
Matches the string 1GE at the beginning of the string to be searched.
Matches: 1GEABC
Doesn't match: 01GE 456
1GE$
Matches the string 1GE at the end of the string to be searched.
Matches: ABC1GE
Doesn't match: 1GEABC
1GE*
Matches the string 1G directly followed by 0 or more 'E'
anywhere in the string to be searched.
Matches: 01G 456
Matches: 01GEEE 456
Doesn't match: 01 GE 456
1G[A-E]
Matches the string 1G directly followed by one of the letters
from the interval A-E anywhere in the string to be searched.
Matches: 01GA
Matches: 01GC 456
Doesn't match: 01GF
1G[^ACE]
Matches the string 1G directly followed by a character that
is not A, C, or E anywhere in the string to be searched.
Matches: 01GB
Matches: 01G 456
Matches: 01G456
Doesn't match: 01GC 456
1.E
Matches the string 1 directly followed by any single character,
followed by E.
Matches: 01GE
Matches: 01 E56
Matches: 016E56
Doesn't match: 01G E456
1.*E
Matches the string 1 directly followed by zero or more characters,
followed by E.
Matches: 01GE
Matches: 01 ABC E56
Matches: 01E56
Doesn't match: 0E1456
1.+E
Matches the string 1 directly followed by one or more characters,
followed by E.
Matches: 01GE
Matches: 01 ABC E56
Doesn't match: 01E56
REs Matching a Single Character
The following REs match a single character or a single collating
element:
Ordinary Characters
An ordinary character is an RE that matches itself. An ordinary
character is any character in the supported character set except
newline and the regular expression special characters listed in
Special Characters below. An ordinary character preceded by a
backslash (\) is treated as the ordinary character itself, except when
the character is (, ), {, or }, or the digits 1 through 9 (see REs
Matching Multiple Characters). Matching is based on the bit pattern
used for encoding the character; not on the graphic representation of
the character.
Special Characters
A regular expression special character preceded by a backslash is a
regular expression that matches the special character itself. When
not preceded by a backslash, such characters have special meaning in
the specification of REs. Regular expression special characters and
the contexts in which they have special meaning are:
. [ \
The period, left square bracket, and backslash are
special except when used in a bracket expression
(see RE Bracket Expression).
*
The asterisk is special except when used in a
bracket expression, as the first character of a
regular expression, or as the first character
following the character pair \( (see REs Matching
Multiple Characters).
+
The plus is special except when used in a
bracket expression, as the first character of a
regular expression, or as the first character
following the character pair \( (see REs Matching
Multiple Characters).
^
The circumflex is special when used as the first
character of an entire RE (see Expression
Anchoring) or as the first character of a bracket
expression.
$
The dollar sign is special when used as the last
character of an entire RE (see Expression
Anchoring).
delimiter
Any character used to bound (i.e., delimit) an
entire RE is special for that RE.
Period
A period (.), when used outside of a bracket expression, is an RE that
matches any printable or nonprintable character except <newline>.
RE Bracket Expression
A bracket expression enclosed in square brackets ([ ]) is an RE that
matches a single collating element contained in the nonempty set of
collating elements represented by the bracket expression.
The following rules apply to bracket expressions:
bracket expression
A bracket expression is either a matching list
expression or a non-matching list expression, and
consists of one or more expressions in any order.
Expressions can be: collating elements, collating
symbols, noncollating characters, equivalence
classes, range expressions, or character classes.
The right bracket (]) loses its special meaning
and represents itself in a bracket expression if
it occurs first in the list (after an initial ^,
if any). Otherwise, it terminates the bracket
expression (unless it is the ending right bracket
for a valid collating symbol, equivalence class,
or character class, or it is the collating element
within a collating symbol or equivalence class
expression). The special characters
. * + [ \
(period, asterisk, plus, left bracket, and backslash)
lose their special meaning within a bracket
expression.
matching list
A matching list expression specifies a list that
matches any one of the characters represented in
the list. The first character in the list cannot
be the circumflex. For example, [abc] is an RE
that matches any of a, b, or c.
non-matching list
A non-matching list expression begins with a
circumflex (^), and specifies a list that matches
any character except newline and the characters
represented in the list. For example, [^abc] is
an RE that matches any character except newline
or a, b, or c. The circumflex has this special
meaning only when it occurs first in the list,
immediately following the left square bracket.
collating element
A collating element is a sequence of one or more
characters that represents a single element in the
collating sequence as identified via the most
current setting of the locale category LC_COLLATE
(see setlocale(3C)).
collating symbol
A collating symbol is a collating element enclosed
within bracket-period ([.....]) delimiters.
Multi-character collating elements must be
represented as collating symbols to distinguish
them from single-character collating elements.
For example, if the string ch is a valid collating
element, then [.ch.] is treated as an element
matching the same string of characters, while ch
is treated as a simple list of the characters c
and h. If the string within the bracket-period
delimiters is not a valid collating element in the
current collating sequence definition, the symbol
is treated as an invalid expression.
noncollating character
A noncollating character is a character that is
ignored for collating purposes. By definition,
such characters cannot participate in equivalence
classes or range expressions.
equivalence class
An equivalence class expression represents the set
of collating elements belonging to an equivalence
class. It is expressed by enclosing any one of
the collating elements in the equivalence class
within bracket-equal ([=...=]) delimiters. For
example, if ,,and A belong to the same
equivalence class, then [[=a=]b], =]b], and
[[=A=]b] are each equivalent toAb].
range expression
A range expression represents the set of collating
elements that fall between two elements in the
current collation sequence as defined via the most
current setting of the locale category LC_COLLATE
(see setlocale(3C)). It is expressed as the
starting point and the ending point separated by a
hyphen (-).
The starting range point and the ending range
point must be a collating element, collating
symbol, or equivalence class expression. An
equivalence class expression used as an end point
of a range expression is interpreted such that all
collating elements within the equivalence class
are included in the range. For example, if the
collating order is A, a, B, b, C, c, ch, D, d; and
A and a constitute an equivalence class, then the
expression [[=a=]-D] is treated as
[AaBbCc[.ch.]D].
Both starting and ending range points must be
valid collating elements, collating symbols, or
equivalence class expressions, and the ending
range point must collate equal to or higher than
the starting range point; otherwise the expression
is invalid. For example, with the above collating
order and assuming that E is a noncollating
character, then both the expressions [[=A=]-E] and
[d-a] are invalid.
An ending range point can also be the starting
range point in a subsequent range expression.
Each such range expression is evaluated
separately. For example, the bracket expression
[a-m-o] is treated as [a-mm-o].
The hyphen character is treated as itself if it
occurs first (after an initial ^, if any) or last
in the list, or as the rightmost symbol in a range
expression. As examples, the expressions [-ac]
and [ac-] are equivalent and match any of the
characters a, c, or -; the expressions [^-ac] and
[^ac-] are equivalent and match any characters
except newline, a, c, or -; the expression [%--]
matches any of the characters in the defined
collating sequence between % and - inclusive; the
expression [--@] matches any of the characters in
the defined collating sequence between - and @
inclusive; and the expression [a--@] is invalid,
assuming - precedes a in the collating sequence.
character class
A character class expression represents the set of
characters belonging to a character class, as
defined via the most current setting of the locale
category LC_CTYPE. It is expressed as a character
class name enclosed within bracket-colon ([: :])
delimiters.
Valid character class expressions and the class
they represent are:
Character Classes
[:alpha:] letters
[:upper:] upper-case letters
[:lower:] lower-case letters
[:digit:] decimal digits
[:xdigit:] hexadecimal digits
[:alnum:] letters or decimal digits
[:space:] characters producing white-space in displayed text
[:print:] printing characters
[:punct:] punctuation characters
[:graph:] characters with a visible representation
[:cntrl:] control characters
REs Matching Multiple Characters
The following rules may be used to construct REs matching multiple
characters from REs matching a single character:
RERE
The concatenation of REs is an RE that matches the
first encountered concatenation of the strings
matched by each component of the RE. For example,
the RE bc matches the second and third characters
of the string abcdefabcdef.
RE*
An RE matching a single character followed by an
asterisk (*) is an RE that matches zero or more
occurrences of the RE preceding the asterisk. The
first encountered string that permits a match is
chosen, and the matched string will encompass the
maximum number of characters permitted by the RE.
For example, in the string abbbcdeabbbbbbcde, both
the RE b*c and the RE bbb*c are matched by the
substring bbbc in the second through fifth
positions. An asterisk as the first character of
an RE loses this special meaning and is treated as
itself.
RE+
An RE matching a single character followed by an
plus (+) is an RE that matches one or more
occurrences of the RE preceding the asterisk. The
first encountered string that permits a match is
chosen, and the matched string will encompass the
maximum number of characters permitted by the RE.
For example, in the string abbbcdeabbbbbbcde, both
the RE b+c and the RE bbb+c are matched by the
substring bbbc in the second through fifth
positions. A plus as the first character of
an RE loses this special meaning and is treated as
itself.
\(RE\)
A subexpression can be defined within an RE by
enclosing it between the character pairs \( and
\). Such a subexpression matches whatever it
would have matched without the \( and \).
Subexpressions can be arbitrarily nested. An
asterisk immediately following the \( loses its
special meaning and is treated as itself. An
asterisk immediately following the \) is treated
as an invalid character.
\n
The expression \n matches the same string of
characters as was matched by a subexpression
enclosed between \( and \) preceding the \n. The
character n must be a digit from 1 through 9,
specifying the n-th subexpression (the one that
begins with the n-th \( and ends with the
corresponding paired \). For example, the
expression ^\(.*\)\1$ matches a line consisting of
two adjacent appearances of the same string.
If the \n is followed by an asterisk, it matches
zero or more occurrences of the subexpression
referred to. For example, the expression
\(ab\(cd\)ef\)Z\2*Z\1 matches the string
abcdefZcdcdZabcdef.
RE\{m,n\}
An RE matching a single character followed by
\{m\}, \{m,\}, or \{m,n\} is an RE that matches
repeated occurrences of the RE. The values of m
and n must be decimal integers in the range 0
through 255, with m specifying the exact or
minimum number of occurrences and n specifying the
maximum number of occurrences. \{m\} matches
exactly m occurrences of the preceding RE, \{m,\}
matches at least m occurrences, and \{m,n\}
matches any number of occurrences between m and n,
inclusive.
The first encountered string that matches the
expression is chosen; it will contain as many
occurrences of the RE as possible. For example,
in the string abbbbbbbc the RE b\{3\} is matched
by characters two through four, the RE b\{3,\} is
matched by characters two through eight, and the
RE b\{3,5\}c is matched by characters four through
nine.
Expression Anchoring
An RE can be limited to matching strings that begin or end a line
(i.e., anchored) according to the following rules:
A circumflex (^)as the first RE anchors the expression to the
beginning of a line; only strings starting at the first
character of a line are matched by the RE. For example, the
RE ^ab matches the string ab in the line abcdef, but not the
same string in the line cdefab.
A dollar sign ($) as the last character of an RE anchors the
expression to the end of a line; only strings ending at the
last character of a line are matched by the RE. For example,
the RE ab$ matches the string ab in the line cdefab, but not
the same string in the line abcdef.
An RE anchored by both ^ and $ matches only strings that are
lines. For example, the RE ^abcdef$ matches only lines
consisting of the string abcdef.
Examples
1GE
Matches the string 1GE anywhere in the string to be searched:
Matches: 01GE 456
Doesn't match: 01G E 456
^1GE
Matches the string 1GE at the beginning of the string to be searched.
Matches: 1GEABC
Doesn't match: 01GE 456
1GE*
Matches the string 1G directly followed by 0 or more 'E' anywhere in the
string to be searched.
Matches: 01G 456
Matches: 01GEEE 456
Doesn't match: 01 GE 456
1G[A-E]
Matches the string 1G directly followed by one of the letters from
the interval A-E anywhere in the string to be searched.
Matches: 01GA
Matches: 01GC 456
Doesn't match: 01GF
1G[^ACE]
Matches the string 1G directly followed by a character that is not A,
C, or E anywhere in the string to be searched.
Matches: 01GB
Matches: 01G 456
Matches: 01G456
Doesn't match: 01GC 456