2 '\" Copyright (c) 1998 Sun Microsystems, Inc.
3 '\" Copyright (c) 1999 Scriptics Corporation
5 '\" See the file "license.terms" for information on usage and redistribution
6 '\" of this file, and for a DISCLAIMER OF ALL WARRANTIES.
9 .ie '\w'o''\w'\C'^o''' .ds qo \C'^o'
11 .TH re_syntax n "8.1" Tcl "Tcl Built-In Commands"
14 re_syntax \- Syntax of Tcl regular expressions
18 A \fIregular expression\fR describes strings of characters.
19 It's a pattern that matches certain strings and does not match others.
20 .SH "DIFFERENT FLAVORS OF REs"
23 as defined by POSIX, come in two flavors: \fIextended\fR REs
27 EREs are roughly those of the traditional \fIegrep\fR, while BREs are
28 roughly those of the traditional \fIed\fR. This implementation adds
29 a third flavor, \fIadvanced\fR REs
31 basically EREs with some significant extensions.
33 This manual page primarily describes AREs. BREs mostly exist for
34 backward compatibility in some old programs; they will be discussed at
35 the end. POSIX EREs are almost an exact subset of AREs. Features of
36 AREs that are not present in EREs will be indicated.
37 .SH "REGULAR EXPRESSION SYNTAX"
39 Tcl regular expressions are implemented using the package written by
40 Henry Spencer, based on the 1003.2 spec and some (not quite all) of
41 the Perl5 extensions (thanks, Henry!). Much of the description of
42 regular expressions below is copied verbatim from his manual entry.
44 An ARE is one or more \fIbranches\fR,
47 matching anything that matches any of the branches.
49 A branch is zero or more \fIconstraints\fR or \fIquantified atoms\fR,
51 It matches a match for the first, followed by a match for the second, etc;
52 an empty branch matches the empty string.
54 A quantified atom is an \fIatom\fR possibly followed
55 by a single \fIquantifier\fR.
56 Without a quantifier, it matches a single match for the atom.
58 and what a so-quantified atom matches, are:
63 a sequence of 0 or more matches of the atom
67 a sequence of 1 or more matches of the atom
71 a sequence of 0 or 1 matches of the atom
75 a sequence of exactly \fIm\fR matches of the atom
79 a sequence of \fIm\fR or more matches of the atom
81 \fB{\fIm\fB,\fIn\fB}\fR
83 a sequence of \fIm\fR through \fIn\fR (inclusive) matches of the atom;
84 \fIm\fR may not exceed \fIn\fR
86 \fB*? +? ?? {\fIm\fB}? {\fIm\fB,}? {\fIm\fB,\fIn\fB}?\fR
88 \fInon-greedy\fR quantifiers, which match the same possibilities,
89 but prefer the smallest number rather than the largest number
90 of matches (see \fBMATCHING\fR)
93 The forms using \fB{\fR and \fB}\fR are known as \fIbound\fRs. The
94 numbers \fIm\fR and \fIn\fR are unsigned decimal integers with
95 permissible values from 0 to 255 inclusive.
99 .IP \fB(\fIre\fB)\fR 6
100 matches a match for \fIre\fR (\fIre\fR is any regular expression) with
101 the match noted for possible reporting
102 .IP \fB(?:\fIre\fB)\fR
103 as previous, but does no reporting (a
107 matches an empty string, noted for possible reporting
109 matches an empty string, without reporting
110 .IP \fB[\fIchars\fB]\fR
111 a \fIbracket expression\fR, matching any one of the \fIchars\fR (see
112 \fBBRACKET EXPRESSIONS\fR for more detail)
114 matches any single character
116 matches the non-alphanumeric character \fIk\fR
117 taken as an ordinary character, e.g. \fB\e\e\fR matches a backslash
120 where \fIc\fR is alphanumeric (possibly followed by other characters),
121 an \fIescape\fR (AREs only), see \fBESCAPES\fR below
123 when followed by a character other than a digit, matches the
126 when followed by a digit, it is the beginning of a \fIbound\fR (see above)
128 where \fIx\fR is a single character with no other significance,
129 matches that character.
132 A \fIconstraint\fR matches an empty string when specific conditions
133 are met. A constraint may not be followed by a quantifier. The
134 simple constraints are as follows; some more constraints are described
135 later, under \fBESCAPES\fR.
140 matches at the beginning of the string or a line (according to whether
141 matching is newline-sensitive or not, as described in \fBMATCHING\fR,
146 matches at the end of the string or a line (according to whether
147 matching is newline-sensitive or not, as described in \fBMATCHING\fR,
151 The difference between string and line matching modes is immaterial
152 when the string does not contain a newline character. The \fB\eA\fR
153 and \fB\eZ\fR constraint escapes have a similar purpose but are
154 always constraints for the overall string.
156 The default newline-sensitivity depends on the command that uses the
157 regular expression, and can be overridden as described in
158 \fBMETASYNTAX\fR, below.
163 \fIpositive lookahead\fR (AREs only), matches at any point where a
164 substring matching \fIre\fR begins
168 \fInegative lookahead\fR (AREs only), matches at any point where no
169 substring matching \fIre\fR begins
172 The lookahead constraints may not contain back references (see later),
173 and all parentheses within them are considered non-capturing.
175 An RE may not end with
177 .SH "BRACKET EXPRESSIONS"
178 A \fIbracket expression\fR is a list of characters enclosed in
180 It normally matches any single character from the list
181 (but see below). If the list begins with
183 it matches any single character (but see below) \fInot\fR from the
186 If two characters in the list are separated by
188 this is shorthand for the full \fIrange\fR of characters between those two
189 (inclusive) in the collating sequence, e.g.
191 in Unicode matches any conventional decimal digit. Two ranges may not share an
194 is illegal. Ranges in Tcl always use the
195 Unicode collating sequence, but other programs may use other collating
196 sequences and this can be a source of incompatibility between programs.
198 To include a literal \fB]\fR or \fB\-\fR in the list, the simplest
199 method is to enclose it in \fB[.\fR and \fB.]\fR to make it a
200 collating element (see below). Alternatively, make it the first
201 character (following a possible
203 or (AREs only) precede it with
207 make it the last character, or the second endpoint of a range. To use
208 a literal \fB\-\fR as the first endpoint of a range, make it a
209 collating element or (AREs only) precede it with
211 With the exception of
212 these, some combinations using \fB[\fR (see next paragraphs), and
213 escapes, all other special characters lose their special significance
214 within a bracket expression.
215 .SS "CHARACTER CLASSES"
216 Within a bracket expression, the name of a \fIcharacter class\fR
217 enclosed in \fB[:\fR and \fB:]\fR stands for the list of all
218 characters (not all collating elements!) belonging to that class.
219 Standard character classes are:
223 An upper-case letter.
231 An alphanumeric (letter or digit).
233 A "printable" (same as graph, except also including space).
235 A space or tab character.
237 A character producing white space in displayed text.
239 A punctuation character.
241 A character with a visible representation (includes both \fBalnum\fR
246 A locale may provide others. A character class may not be used as an endpoint
250 (\fINote:\fR the current Tcl implementation has only one locale, the Unicode
251 locale, which supports exactly the above classes.)
253 .SS "BRACKETED CONSTRAINTS"
254 There are two special cases of bracket expressions: the bracket
259 are constraints, matching empty strings at the beginning and end of a word
261 .\" note, discussion of escapes below references this definition of word
262 A word is defined as a sequence of word characters that is neither preceded
263 nor followed by word characters. A word character is an \fIalnum\fR character
266 These special bracket expressions are deprecated; users of AREs should use
267 constraint escapes instead (see below).
268 .SS "COLLATING ELEMENTS"
269 Within a bracket expression, a collating element (a character, a
270 multi-character sequence that collates as if it were a single
271 character, or a collating-sequence name for either) enclosed in
272 \fB[.\fR and \fB.]\fR stands for the sequence of characters of that
273 collating element. The sequence is a single element of the bracket
274 expression's list. A bracket expression in a locale that has
275 multi-character collating elements can thus match more than one
276 character. So (insidiously), a bracket expression that starts with
277 \fB^\fR can match multi-character collating elements even if none of
278 them appear in the bracket expression!
281 (\fINote:\fR Tcl has no multi-character collating elements. This information
282 is only for illustration.)
285 For example, assume the collating sequence includes a \fBch\fR multi-character
286 collating element. Then the RE
292 matches the first five characters of
300 matches the multi-character
302 .SS "EQUIVALENCE CLASSES"
303 Within a bracket expression, a collating element enclosed in \fB[=\fR
304 and \fB=]\fR is an equivalence class, standing for the sequences of
305 characters of all collating elements equivalent to that one, including
306 itself. (If there are no other equivalent collating elements, the
307 treatment is as if the enclosing delimiters were
311 For example, if \fBo\fR and \fB\(^o\fR are the members of an
312 equivalence class, then
314 .QW \fB[[=\(^o=]]\fR ,
317 are all synonymous. An equivalence class may not be an endpoint of a range.
320 (\fINote:\fR Tcl implements only the Unicode locale. It does not define any
321 equivalence classes. The examples above are just illustrations.)
324 Escapes (AREs only), which begin with a \fB\e\fR followed by an
325 alphanumeric character, come in several varieties: character entry,
326 class shorthands, constraint escapes, and back references. A \fB\e\fR
327 followed by an alphanumeric character but not constituting a valid
328 escape is illegal in AREs. In EREs, there are no escapes: outside a
329 bracket expression, a \fB\e\fR followed by an alphanumeric character
330 merely stands for that character as an ordinary character, and inside
331 a bracket expression, \fB\e\fR is an ordinary character. (The latter
332 is the one actual incompatibility between EREs and AREs.)
333 .SS "CHARACTER-ENTRY ESCAPES"
334 Character-entry escapes (AREs only) exist to make it easier to specify
335 non-printing and otherwise inconvenient characters in REs:
340 alert (bell) character, as in C
348 synonym for \fB\e\fR to help reduce backslash doubling in some
349 applications where there are multiple levels of backslash processing
353 (where \fIX\fR is any character) the character whose low-order 5 bits
354 are the same as those of \fIX\fR, and whose other bits are all zero
358 the character whose collating-sequence name is
360 or failing that, the character with octal value 033
372 carriage return, as in C
376 horizontal tab, as in C
380 (where \fIwxyz\fR is one up to four hexadecimal digits) the Unicode
381 character \fBU+\fIwxyz\fR in the local byte ordering
385 (where \fIstuvwxyz\fR is one up to eight hexadecimal digits) reserved
386 for a Unicode extension up to 21 bits. The digits are parsed until the
387 first non-hexadecimal character is encountered, the maximun of eight
388 hexadecimal digits are reached, or an overflow would occur in the maximum
389 value of \fBU+\fI10ffff\fR.
393 vertical tab, as in C are all available.
397 (where \fIhh\fR is one or two hexadecimal digits) the character
398 whose hexadecimal value is \fB0x\fIhh\fR.
402 the character whose value is \fB0\fR
406 (where \fIxyz\fR is exactly three octal digits, and is not a \fIback
407 reference\fR (see below)) the character whose octal value is
408 \fB0\fIxyz\fR. The first digit must be in the range 0-3, otherwise
409 the two-digit form is assumed.
413 (where \fIxy\fR is exactly two octal digits, and is not a \fIback
414 reference\fR (see below)) the character whose octal value is
418 Hexadecimal digits are
419 .QR \fB0\fR \fB9\fR ,
420 .QR \fBa\fR \fBf\fR ,
422 .QR \fBA\fR \fBF\fR .
424 .QR \fB0\fR \fB7\fR .
426 The character-entry escapes are always taken as ordinary characters.
427 For example, \fB\e135\fR is \fB]\fR in Unicode, but \fB\e135\fR does
428 not terminate a bracket expression. Beware, however, that some
429 applications (e.g., C compilers and the Tcl interpreter if the regular
430 expression is not quoted with braces) interpret such sequences
431 themselves before the regular-expression package gets to see them,
432 which may require doubling (quadrupling, etc.) the
434 .SS "CLASS-SHORTHAND ESCAPES"
435 Class-shorthand escapes (AREs only) provide shorthands for certain
436 commonly-used character classes:
449 \fB[[:alnum:]_\eu203F\eu2040\eu2054\euFE33\euFE34\euFE4D\euFE4E\euFE4F\euFF3F]\fR (including punctuation connector characters)
461 \fB[^[:alnum:]_\eu203F\eu2040\eu2054\euFE33\euFE34\euFE4D\euFE4E\euFE4F\euFF3F]\fR (including punctuation connector characters)
464 Within bracket expressions,
469 lose their outer brackets, and
474 are illegal. (So, for example,
477 .QW \fB[a-c[:digit:]]\fR .
480 which is equivalent to
481 .QW \fB[a-c^[:digit:]]\fR ,
483 .SS "CONSTRAINT ESCAPES"
484 A constraint escape (AREs only) is a constraint, matching the empty
485 string if specific conditions are met, written as an escape:
490 matches only at the beginning of the string (see \fBMATCHING\fR,
491 below, for how this differs from
496 matches only at the beginning of a word
500 matches only at the end of a word
504 matches only at the beginning or end of a word
508 matches only at a point that is not the beginning or end of a word
512 matches only at the end of the string (see \fBMATCHING\fR, below, for
513 how this differs from
518 (where \fIm\fR is a nonzero digit) a \fIback reference\fR, see below
522 (where \fIm\fR is a nonzero digit, and \fInn\fR is some more digits,
523 and the decimal value \fImnn\fR is not greater than the number of
524 closing capturing parentheses seen so far) a \fIback reference\fR, see
528 A word is defined as in the specification of
532 above. Constraint escapes are illegal within bracket expressions.
533 .SS "BACK REFERENCES"
534 A back reference (AREs only) matches the same string matched by the
535 parenthesized subexpression specified by the number, so that (e.g.)
543 The subexpression must entirely precede the back reference in the RE.
544 Subexpressions are numbered in the order of their leading parentheses.
545 Non-capturing parentheses do not define subexpressions.
547 There is an inherent historical ambiguity between octal
548 character-entry escapes and back references, which is resolved by
549 heuristics, as hinted at above. A leading zero always indicates an
550 octal escape. A single non-zero digit, not followed by another digit,
551 is always taken as a back reference. A multi-digit sequence not
552 starting with a zero is taken as a back reference if it comes after a
553 suitable subexpression (i.e. the number is in the legal range for a
554 back reference), and otherwise is taken as octal.
556 In addition to the main syntax described above, there are some special
557 forms and miscellaneous syntactic facilities available.
559 Normally the flavor of RE being used is specified by
560 application-dependent means. However, this can be overridden by a
561 \fIdirector\fR. If an RE of any flavor begins with
563 the rest of the RE is an ARE. If an RE of any flavor begins with
565 the rest of the RE is taken to be a literal string, with
566 all characters considered ordinary characters.
568 An ARE may begin with \fIembedded options\fR: a sequence
569 \fB(?\fIxyz\fB)\fR (where \fIxyz\fR is one or more alphabetic
570 characters) specifies options affecting the rest of the RE. These
571 supplement, and can override, any options specified by the
572 application. The available option letters are:
581 case-sensitive matching (usual default)
589 case-insensitive matching (see \fBMATCHING\fR, below)
593 historical synonym for \fBn\fR
597 newline-sensitive matching (see \fBMATCHING\fR, below)
601 partial newline-sensitive matching (see \fBMATCHING\fR, below)
605 rest of RE is a literal
607 string, all ordinary characters
611 non-newline-sensitive matching (usual default)
615 tight syntax (usual default; see below)
619 inverse partial newline-sensitive
621 matching (see \fBMATCHING\fR, below)
625 expanded syntax (see below)
628 Embedded options take effect at the \fB)\fR terminating the sequence.
629 They are available only at the start of an ARE, and may not be used
632 In addition to the usual (\fItight\fR) RE syntax, in which all
633 characters are significant, there is an \fIexpanded\fR syntax,
634 available in all flavors of RE with the \fB\-expanded\fR switch, or in
635 AREs with the embedded x option. In the expanded syntax, white-space
636 characters are ignored and all characters between a \fB#\fR and the
637 following newline (or the end of the RE) are ignored, permitting
638 paragraphing and commenting a complex RE. There are three exceptions
641 a white-space character or
649 within a bracket expression is retained
651 white space and comments are illegal within multi-character symbols
657 Expanded-syntax white-space characters are blank, tab, newline, and
658 any character that belongs to the \fIspace\fR character class.
660 Finally, in an ARE, outside bracket expressions, the sequence
661 .QW \fB(?#\fIttt\fB)\fR
662 (where \fIttt\fR is any text not containing a
664 is a comment, completely ignored. Again, this is not
665 allowed between the characters of multi-character symbols like
667 Such comments are more a historical artifact than a useful facility,
668 and their use is deprecated; use the expanded syntax instead.
670 \fINone\fR of these metasyntax extensions is available if the
671 application (or an initial
673 director) has specified that the
674 user's input be treated as a literal string rather than as an RE.
676 In the event that an RE could match more than one substring of a given
677 string, the RE matches the one starting earliest in the string. If
678 the RE could match more than one substring starting at that point, its
679 choice is determined by its \fIpreference\fR: either the longest
680 substring, or the shortest.
682 Most atoms, and all constraints, have no preference. A parenthesized
683 RE has the same preference (possibly none) as the RE. A quantified
684 atom with quantifier \fB{\fIm\fB}\fR or \fB{\fIm\fB}?\fR has the same
685 preference (possibly none) as the atom itself. A quantified atom with
686 other normal quantifiers (including \fB{\fIm\fB,\fIn\fB}\fR with
687 \fIm\fR equal to \fIn\fR) prefers longest match. A quantified atom
688 with other non-greedy quantifiers (including \fB{\fIm\fB,\fIn\fB}?\fR
689 with \fIm\fR equal to \fIn\fR) prefers shortest match. A branch has
690 the same preference as the first quantified atom in it which has a
691 preference. An RE consisting of two or more branches connected by the
692 \fB|\fR operator prefers longest match.
694 Subject to the constraints imposed by the rules for matching the whole
695 RE, subexpressions also match the longest or shortest possible
696 substrings, based on their preferences, with subexpressions starting
697 earlier in the RE taking priority over ones starting later. Note that
698 outer subexpressions thus take priority over their component
701 The quantifiers \fB{1,1}\fR and \fB{1,1}?\fR can be used to
702 force longest and shortest preference, respectively, on a
703 subexpression or a whole RE.
706 \fBNOTE:\fR This means that you can usually make a RE be non-greedy overall by
707 putting \fB{1,1}?\fR after one of the first non-constraint atoms or
708 parenthesized sub-expressions in it. \fIIt pays to experiment\fR with the
709 placing of this non-greediness override on a suitable range of input texts
710 when you are writing a RE if you are using this level of complexity.
712 For example, this regular expression is non-greedy, and will match the
713 shortest substring possible given that
715 will be matched as early as possible (the quantifier does not change that):
723 has no greediness preference, we explicitly give one for
725 and the remaining quantifiers are overridden to be non-greedy by the preceding
726 non-greedy quantifier.
729 Match lengths are measured in characters, not collating elements. An
730 empty string is considered longer than no match at all. For example,
732 matches the three middle characters of
734 .QW \fB(week|wee)(night|knights)\fR
735 matches all ten characters of
736 .QW \fBweeknights\fR ,
741 the parenthesized subexpression matches all three characters, and when
745 both the whole RE and the parenthesized subexpression match an empty string.
747 If case-independent matching is specified, the effect is much as if
748 all case distinctions had vanished from the alphabet. When an
749 alphabetic that exists in multiple cases appears as an ordinary
750 character outside a bracket expression, it is effectively transformed
751 into a bracket expression containing both cases, so that \fBx\fR
754 When it appears inside a bracket expression,
755 all case counterparts of it are added to the bracket expression, so
765 If newline-sensitive matching is specified, \fB.\fR and bracket
766 expressions using \fB^\fR will never match the newline character (so
767 that matches will never cross newlines unless the RE explicitly
768 arranges it) and \fB^\fR and \fB$\fR will match the empty string after
769 and before a newline respectively, in addition to matching at
770 beginning and end of string respectively. ARE \fB\eA\fR and \fB\eZ\fR
771 continue to match beginning or end of string \fIonly\fR.
773 If partial newline-sensitive matching is specified, this affects
774 \fB.\fR and bracket expressions as with newline-sensitive matching,
775 but not \fB^\fR and \fB$\fR.
777 If inverse partial newline-sensitive matching is specified, this
778 affects \fB^\fR and \fB$\fR as with newline-sensitive matching, but
779 not \fB.\fR and bracket expressions. This is not very useful but is
780 provided for symmetry.
781 .SH "LIMITS AND COMPATIBILITY"
782 No particular limit is imposed on the length of REs. Programs
783 intended to be highly portable should not employ REs longer than 256
784 bytes, as a POSIX-compliant implementation can refuse to accept such
787 The only feature of AREs that is actually incompatible with POSIX EREs
788 is that \fB\e\fR does not lose its special significance inside bracket
789 expressions. All other ARE features use syntax which is illegal or
790 has undefined or unspecified effects in POSIX EREs; the \fB***\fR
791 syntax of directors likewise is outside the POSIX syntax for both BREs
794 Many of the ARE extensions are borrowed from Perl, but some have been
795 changed to clean them up, and a few Perl extensions are not present.
796 Incompatibilities of note include
799 the lack of special treatment for a trailing newline, the addition of
800 complemented bracket expressions to the things affected by
801 newline-sensitive matching, the restrictions on parentheses and back
802 references in lookahead constraints, and the longest/shortest-match
803 (rather than first-match) matching semantics.
805 The matching rules for REs containing both normal and non-greedy
806 quantifiers have changed since early beta-test versions of this
807 package. (The new rules are much simpler and cleaner, but do not work
808 as hard at guessing the user's real intentions.)
810 Henry Spencer's original 1986 \fIregexp\fR package, still in
811 widespread use (e.g., in pre-8.1 releases of Tcl), implemented an
812 early version of today's EREs. There are four incompatibilities
813 between \fIregexp\fR's near-EREs
814 .PQ RREs " for short"
815 and AREs. In roughly increasing order of significance:
817 In AREs, \fB\e\fR followed by an alphanumeric character is either an
818 escape or an error, while in RREs, it was just another way of writing
819 the alphanumeric. This should not be a problem because there was no
820 reason to write such a sequence in RREs.
822 \fB{\fR followed by a digit in an ARE is the beginning of a bound,
823 while in RREs, \fB{\fR was always an ordinary character. Such
824 sequences should be rare, and will often result in an error because
825 following characters will not look like a valid bound.
827 In AREs, \fB\e\fR remains a special character within
829 so a literal \fB\e\fR within \fB[\|]\fR must be written
831 \fB\e\e\fR also gives a literal \fB\e\fR within \fB[\|]\fR in RREs,
832 but only truly paranoid programmers routinely doubled the backslash.
834 AREs report the longest/shortest match for the RE, rather than the
835 first found in a specified search order. This may affect some RREs
836 which were written in the expectation that the first match would be
837 reported. (The careful crafting of RREs to optimize the search order
838 for fast matching is obsolete (AREs examine all possible matches in
839 parallel, and their performance is largely insensitive to their
840 complexity) but cases where the search order was exploited to
841 deliberately find a match which was \fInot\fR the longest/shortest
842 will need rewriting.)
843 .SH "BASIC REGULAR EXPRESSIONS"
844 BREs differ from EREs in several respects.
847 and \fB?\fR are ordinary characters and there is no equivalent for their
848 functionality. The delimiters for bounds are \fB\e{\fR and
850 with \fB{\fR and \fB}\fR by themselves ordinary characters. The
851 parentheses for nested subexpressions are \fB\e(\fR and
853 with \fB(\fR and \fB)\fR by themselves ordinary
854 characters. \fB^\fR is an ordinary character except at the beginning
855 of the RE or the beginning of a parenthesized subexpression, \fB$\fR
856 is an ordinary character except at the end of the RE or the end of a
857 parenthesized subexpression, and \fB*\fR is an ordinary character if
858 it appears at the beginning of the RE or the beginning of a
859 parenthesized subexpression (after a possible leading
861 Finally, single-digit back references are available, and \fB\e<\fR and
862 \fB\e>\fR are synonyms for
866 respectively; no other escapes are available.
868 RegExp(3), regexp(n), regsub(n), lsearch(n), switch(n), text(n)
870 match, regular expression, string