util/src/TclTk/tcl8.6.12/doc/re_syntax.n

   1 '\"
   2 '\" Copyright (c) 1998 Sun Microsystems, Inc.
   3 '\" Copyright (c) 1999 Scriptics Corporation
   4 '\"
   5 '\" See the file "license.terms" for information on usage and redistribution
   6 '\" of this file, and for a DISCLAIMER OF ALL WARRANTIES.
   7 '\"
   8 .so man.macros
   9 .ie '\w'o''\w'\C'^o''' .ds qo \C'^o'
  10 .el .ds qo u
  11 .TH re_syntax n "8.1" Tcl "Tcl Built-In Commands"
  12 .BS
  13 .SH NAME
  14 re_syntax \- Syntax of Tcl regular expressions
  15 .BE
  16 .SH DESCRIPTION
  17 .PP
  18 A \fIregular expression\fR describes strings of characters.
  19 It's a pattern that matches certain strings and does not match others.
  20 .SH "DIFFERENT FLAVORS OF REs"
  21 Regular expressions
  22 .PQ RE s ,
  23 as defined by POSIX, come in two flavors: \fIextended\fR REs
  24 .PQ ERE s
  25 and \fIbasic\fR REs
  26 .PQ BRE s .
  27 EREs are roughly those of the traditional \fIegrep\fR, while BREs are
  28 roughly those of the traditional \fIed\fR. This implementation adds
  29 a third flavor, \fIadvanced\fR REs
  30 .PQ ARE s ,
  31 basically EREs with some significant extensions.
  32 .PP
  33 This manual page primarily describes AREs. BREs mostly exist for
  34 backward compatibility in some old programs; they will be discussed at
  35 the end. POSIX EREs are almost an exact subset of AREs. Features of
  36 AREs that are not present in EREs will be indicated.
  37 .SH "REGULAR EXPRESSION SYNTAX"
  38 .PP
  39 Tcl regular expressions are implemented using the package written by
  40 Henry Spencer, based on the 1003.2 spec and some (not quite all) of
  41 the Perl5 extensions (thanks, Henry!). Much of the description of
  42 regular expressions below is copied verbatim from his manual entry.
  43 .PP
  44 An ARE is one or more \fIbranches\fR,
  45 separated by
  46 .QW \fB|\fR ,
  47 matching anything that matches any of the branches.
  48 .PP
  49 A branch is zero or more \fIconstraints\fR or \fIquantified atoms\fR,
  50 concatenated.
  51 It matches a match for the first, followed by a match for the second, etc;
  52 an empty branch matches the empty string.
  53 .SS QUANTIFIERS
  54 A quantified atom is an \fIatom\fR possibly followed
  55 by a single \fIquantifier\fR.
  56 Without a quantifier, it matches a single match for the atom.
  57 The quantifiers,
  58 and what a so-quantified atom matches, are:
  59 .RS 2
  60 .TP 6
  61 \fB*\fR
  62 .
  63 a sequence of 0 or more matches of the atom
  64 .TP
  65 \fB+\fR
  66 .
  67 a sequence of 1 or more matches of the atom
  68 .TP
  69 \fB?\fR
  70 .
  71 a sequence of 0 or 1 matches of the atom
  72 .TP
  73 \fB{\fIm\fB}\fR
  74 .
  75 a sequence of exactly \fIm\fR matches of the atom
  76 .TP
  77 \fB{\fIm\fB,}\fR
  78 .
  79 a sequence of \fIm\fR or more matches of the atom
  80 .TP
  81 \fB{\fIm\fB,\fIn\fB}\fR
  82 .
  83 a sequence of \fIm\fR through \fIn\fR (inclusive) matches of the atom;
  84 \fIm\fR may not exceed \fIn\fR
  85 .TP
  86 \fB*?  +?  ??  {\fIm\fB}?  {\fIm\fB,}?  {\fIm\fB,\fIn\fB}?\fR
  87 .
  88 \fInon-greedy\fR quantifiers, which match the same possibilities,
  89 but prefer the smallest number rather than the largest number
  90 of matches (see \fBMATCHING\fR)
  91 .RE
  92 .PP
  93 The forms using \fB{\fR and \fB}\fR are known as \fIbound\fRs. The
  94 numbers \fIm\fR and \fIn\fR are unsigned decimal integers with
  95 permissible values from 0 to 255 inclusive.
  96 .SS ATOMS
  97 An atom is one of:
  98 .RS 2
  99 .IP \fB(\fIre\fB)\fR 6
 100 matches a match for \fIre\fR (\fIre\fR is any regular expression) with
 101 the match noted for possible reporting
 102 .IP \fB(?:\fIre\fB)\fR
 103 as previous, but does no reporting (a
 104 .QW non-capturing
 105 set of parentheses)
 106 .IP \fB()\fR
 107 matches an empty string, noted for possible reporting
 108 .IP \fB(?:)\fR
 109 matches an empty string, without reporting
 110 .IP \fB[\fIchars\fB]\fR
 111 a \fIbracket expression\fR, matching any one of the \fIchars\fR (see
 112 \fBBRACKET EXPRESSIONS\fR for more detail)
 113 .IP \fB.\fR
 114 matches any single character
 115 .IP \fB\e\fIk\fR
 116 matches the non-alphanumeric character \fIk\fR
 117 taken as an ordinary character, e.g. \fB\e\e\fR matches a backslash
 118 character
 119 .IP \fB\e\fIc\fR
 120 where \fIc\fR is alphanumeric (possibly followed by other characters),
 121 an \fIescape\fR (AREs only), see \fBESCAPES\fR below
 122 .IP \fB{\fR
 123 when followed by a character other than a digit, matches the
 124 left-brace character
 125 .QW \fB{\fR ;
 126 when followed by a digit, it is the beginning of a \fIbound\fR (see above)
 127 .IP \fIx\fR
 128 where \fIx\fR is a single character with no other significance,
 129 matches that character.
 130 .RE
 131 .SS CONSTRAINTS
 132 A \fIconstraint\fR matches an empty string when specific conditions
 133 are met. A constraint may not be followed by a quantifier. The
 134 simple constraints are as follows; some more constraints are described
 135 later, under \fBESCAPES\fR.
 136 .RS 2
 137 .TP 8
 138 \fB^\fR
 139 .
 140 matches at the beginning of the string or a line (according to whether
 141 matching is newline-sensitive or not, as described in \fBMATCHING\fR,
 142 below).
 143 .TP
 144 \fB$\fR
 145 .
 146 matches at the end of the string or a line (according to whether
 147 matching is newline-sensitive or not, as described in \fBMATCHING\fR,
 148 below).
 149 .RS
 150 .PP
 151 The difference between string and line matching modes is immaterial
 152 when the string does not contain a newline character.  The \fB\eA\fR
 153 and \fB\eZ\fR constraint escapes have a similar purpose but are
 154 always constraints for the overall string.
 155 .PP
 156 The default newline-sensitivity depends on the command that uses the
 157 regular expression, and can be overridden as described in
 158 \fBMETASYNTAX\fR, below.
 159 .RE
 160 .TP
 161 \fB(?=\fIre\fB)\fR
 162 .
 163 \fIpositive lookahead\fR (AREs only), matches at any point where a
 164 substring matching \fIre\fR begins
 165 .TP
 166 \fB(?!\fIre\fB)\fR
 167 .
 168 \fInegative lookahead\fR (AREs only), matches at any point where no
 169 substring matching \fIre\fR begins
 170 .RE
 171 .PP
 172 The lookahead constraints may not contain back references (see later),
 173 and all parentheses within them are considered non-capturing.
 174 .PP
 175 An RE may not end with
 176 .QW \fB\e\fR .
 177 .SH "BRACKET EXPRESSIONS"
 178 A \fIbracket expression\fR is a list of characters enclosed in
 179 .QW \fB[\|]\fR .
 180 It normally matches any single character from the list
 181 (but see below). If the list begins with
 182 .QW \fB^\fR ,
 183 it matches any single character (but see below) \fInot\fR from the
 184 rest of the list.
 185 .PP
 186 If two characters in the list are separated by
 187 .QW \fB\-\fR ,
 188 this is shorthand for the full \fIrange\fR of characters between those two
 189 (inclusive) in the collating sequence, e.g.
 190 .QW \fB[0\-9]\fR
 191 in Unicode matches any conventional decimal digit. Two ranges may not share an
 192 endpoint, so e.g.
 193 .QW \fBa\-c\-e\fR
 194 is illegal. Ranges in Tcl always use the
 195 Unicode collating sequence, but other programs may use other collating
 196 sequences and this can be a source of incompatibility between programs.
 197 .PP
 198 To include a literal \fB]\fR or \fB\-\fR in the list, the simplest
 199 method is to enclose it in \fB[.\fR and \fB.]\fR to make it a
 200 collating element (see below). Alternatively, make it the first
 201 character (following a possible
 202 .QW \fB^\fR ),
 203 or (AREs only) precede it with
 204 .QW \fB\e\fR .
 205 Alternatively, for
 206 .QW \fB\-\fR ,
 207 make it the last character, or the second endpoint of a range. To use
 208 a literal \fB\-\fR as the first endpoint of a range, make it a
 209 collating element or (AREs only) precede it with
 210 .QW \fB\e\fR .
 211 With the exception of
 212 these, some combinations using \fB[\fR (see next paragraphs), and
 213 escapes, all other special characters lose their special significance
 214 within a bracket expression.
 215 .SS "CHARACTER CLASSES"
 216 Within a bracket expression, the name of a \fIcharacter class\fR
 217 enclosed in \fB[:\fR and \fB:]\fR stands for the list of all
 218 characters (not all collating elements!) belonging to that class.
 219 Standard character classes are:
 220 .IP \fBalpha\fR 8
 221 A letter.
 222 .IP \fBupper\fR 8
 223 An upper-case letter.
 224 .IP \fBlower\fR 8
 225 A lower-case letter.
 226 .IP \fBdigit\fR 8
 227 A decimal digit.
 228 .IP \fBxdigit\fR 8
 229 A hexadecimal digit.
 230 .IP \fBalnum\fR 8
 231 An alphanumeric (letter or digit).
 232 .IP \fBprint\fR 8
 233 A "printable" (same as graph, except also including space).
 234 .IP \fBblank\fR 8
 235 A space or tab character.
 236 .IP \fBspace\fR 8
 237 A character producing white space in displayed text.
 238 .IP \fBpunct\fR 8
 239 A punctuation character.
 240 .IP \fBgraph\fR 8
 241 A character with a visible representation (includes both \fBalnum\fR
 242 and \fBpunct\fR).
 243 .IP \fBcntrl\fR 8
 244 A control character.
 245 .PP
 246 A locale may provide others. A character class may not be used as an endpoint
 247 of a range.
 248 .RS
 249 .PP
 250 (\fINote:\fR the current Tcl implementation has only one locale, the Unicode
 251 locale, which supports exactly the above classes.)
 252 .RE
 253 .SS "BRACKETED CONSTRAINTS"
 254 There are two special cases of bracket expressions: the bracket
 255 expressions
 256 .QW \fB[[:<:]]\fR
 257 and
 258 .QW \fB[[:>:]]\fR
 259 are constraints, matching empty strings at the beginning and end of a word
 260 respectively.
 261 .\" note, discussion of escapes below references this definition of word
 262 A word is defined as a sequence of word characters that is neither preceded
 263 nor followed by word characters. A word character is an \fIalnum\fR character
 264 or an underscore
 265 .PQ \fB_\fR "" .
 266 These special bracket expressions are deprecated; users of AREs should use
 267 constraint escapes instead (see below).
 268 .SS "COLLATING ELEMENTS"
 269 Within a bracket expression, a collating element (a character, a
 270 multi-character sequence that collates as if it were a single
 271 character, or a collating-sequence name for either) enclosed in
 272 \fB[.\fR and \fB.]\fR stands for the sequence of characters of that
 273 collating element. The sequence is a single element of the bracket
 274 expression's list. A bracket expression in a locale that has
 275 multi-character collating elements can thus match more than one
 276 character. So (insidiously), a bracket expression that starts with
 277 \fB^\fR can match multi-character collating elements even if none of
 278 them appear in the bracket expression!
 279 .RS
 280 .PP
 281 (\fINote:\fR Tcl has no multi-character collating elements. This information
 282 is only for illustration.)
 283 .RE
 284 .PP
 285 For example, assume the collating sequence includes a \fBch\fR multi-character
 286 collating element. Then the RE
 287 .QW \fB[[.ch.]]*c\fR
 288 (zero or more
 289 .QW \fBch\fRs
 290 followed by
 291 .QW \fBc\fR )
 292 matches the first five characters of
 293 .QW \fBchchcc\fR .
 294 Also, the RE
 295 .QW \fB[^c]b\fR
 296 matches all of
 297 .QW \fBchb\fR
 298 (because
 299 .QW \fB[^c]\fR
 300 matches the multi-character
 301 .QW \fBch\fR ).
 302 .SS "EQUIVALENCE CLASSES"
 303 Within a bracket expression, a collating element enclosed in \fB[=\fR
 304 and \fB=]\fR is an equivalence class, standing for the sequences of
 305 characters of all collating elements equivalent to that one, including
 306 itself. (If there are no other equivalent collating elements, the
 307 treatment is as if the enclosing delimiters were
 308 .QW \fB[.\fR \&
 309 and
 310 .QW \fB.]\fR .)
 311 For example, if \fBo\fR and \fB\(^o\fR are the members of an
 312 equivalence class, then
 313 .QW \fB[[=o=]]\fR ,
 314 .QW \fB[[=\(^o=]]\fR ,
 315 and
 316 .QW \fB[o\(^o]\fR \&
 317 are all synonymous. An equivalence class may not be an endpoint of a range.
 318 .RS
 319 .PP
 320 (\fINote:\fR Tcl implements only the Unicode locale. It does not define any
 321 equivalence classes. The examples above are just illustrations.)
 322 .RE
 323 .SH ESCAPES
 324 Escapes (AREs only), which begin with a \fB\e\fR followed by an
 325 alphanumeric character, come in several varieties: character entry,
 326 class shorthands, constraint escapes, and back references. A \fB\e\fR
 327 followed by an alphanumeric character but not constituting a valid
 328 escape is illegal in AREs. In EREs, there are no escapes: outside a
 329 bracket expression, a \fB\e\fR followed by an alphanumeric character
 330 merely stands for that character as an ordinary character, and inside
 331 a bracket expression, \fB\e\fR is an ordinary character. (The latter
 332 is the one actual incompatibility between EREs and AREs.)
 333 .SS "CHARACTER-ENTRY ESCAPES"
 334 Character-entry escapes (AREs only) exist to make it easier to specify
 335 non-printing and otherwise inconvenient characters in REs:
 336 .RS 2
 337 .TP 5
 338 \fB\ea\fR
 339 .
 340 alert (bell) character, as in C
 341 .TP
 342 \fB\eb\fR
 343 .
 344 backspace, as in C
 345 .TP
 346 \fB\eB\fR
 347 .
 348 synonym for \fB\e\fR to help reduce backslash doubling in some
 349 applications where there are multiple levels of backslash processing
 350 .TP
 351 \fB\ec\fIX\fR
 352 .
 353 (where \fIX\fR is any character) the character whose low-order 5 bits
 354 are the same as those of \fIX\fR, and whose other bits are all zero
 355 .TP
 356 \fB\ee\fR
 357 .
 358 the character whose collating-sequence name is
 359 .QW \fBESC\fR ,
 360 or failing that, the character with octal value 033
 361 .TP
 362 \fB\ef\fR
 363 .
 364 formfeed, as in C
 365 .TP
 366 \fB\en\fR
 367 .
 368 newline, as in C
 369 .TP
 370 \fB\er\fR
 371 .
 372 carriage return, as in C
 373 .TP
 374 \fB\et\fR
 375 .
 376 horizontal tab, as in C
 377 .TP
 378 \fB\eu\fIwxyz\fR
 379 .
 380 (where \fIwxyz\fR is one up to four hexadecimal digits) the Unicode
 381 character \fBU+\fIwxyz\fR in the local byte ordering
 382 .TP
 383 \fB\eU\fIstuvwxyz\fR
 384 .
 385 (where \fIstuvwxyz\fR is one up to eight hexadecimal digits) reserved
 386 for a Unicode extension up to 21 bits. The digits are parsed until the
 387 first non-hexadecimal character is encountered, the maximun of eight
 388 hexadecimal digits are reached, or an overflow would occur in the maximum
 389 value of \fBU+\fI10ffff\fR.
 390 .TP
 391 \fB\ev\fR
 392 .
 393 vertical tab, as in C are all available.
 394 .TP
 395 \fB\ex\fIhh\fR
 396 .
 397 (where \fIhh\fR is one or two hexadecimal digits) the character
 398 whose hexadecimal value is \fB0x\fIhh\fR.
 399 .TP
 400 \fB\e0\fR
 401 .
 402 the character whose value is \fB0\fR
 403 .TP
 404 \fB\e\fIxyz\fR
 405 .
 406 (where \fIxyz\fR is exactly three octal digits, and is not a \fIback
 407 reference\fR (see below)) the character whose octal value is
 408 \fB0\fIxyz\fR. The first digit must be in the range 0-3, otherwise
 409 the two-digit form is assumed.
 410 .TP
 411 \fB\e\fIxy\fR
 412 .
 413 (where \fIxy\fR is exactly two octal digits, and is not a \fIback
 414 reference\fR (see below)) the character whose octal value is
 415 \fB0\fIxy\fR
 416 .RE
 417 .PP
 418 Hexadecimal digits are
 419 .QR \fB0\fR \fB9\fR ,
 420 .QR \fBa\fR \fBf\fR ,
 421 and
 422 .QR \fBA\fR \fBF\fR .
 423 Octal digits are
 424 .QR \fB0\fR \fB7\fR .
 425 .PP
 426 The character-entry escapes are always taken as ordinary characters.
 427 For example, \fB\e135\fR is \fB]\fR in Unicode, but \fB\e135\fR does
 428 not terminate a bracket expression. Beware, however, that some
 429 applications (e.g., C compilers and the Tcl interpreter if the regular
 430 expression is not quoted with braces) interpret such sequences
 431 themselves before the regular-expression package gets to see them,
 432 which may require doubling (quadrupling, etc.) the
 433 .QW \fB\e\fR .
 434 .SS "CLASS-SHORTHAND ESCAPES"
 435 Class-shorthand escapes (AREs only) provide shorthands for certain
 436 commonly-used character classes:
 437 .RS 2
 438 .TP 10
 439 \fB\ed\fR
 440 .
 441 \fB[[:digit:]]\fR
 442 .TP
 443 \fB\es\fR
 444 .
 445 \fB[[:space:]]\fR
 446 .TP
 447 \fB\ew\fR
 448 .
 449 \fB[[:alnum:]_\eu203F\eu2040\eu2054\euFE33\euFE34\euFE4D\euFE4E\euFE4F\euFF3F]\fR (including punctuation connector characters)
 450 .TP
 451 \fB\eD\fR
 452 .
 453 \fB[^[:digit:]]\fR
 454 .TP
 455 \fB\eS\fR
 456 .
 457 \fB[^[:space:]]\fR
 458 .TP
 459 \fB\eW\fR
 460 .
 461 \fB[^[:alnum:]_\eu203F\eu2040\eu2054\euFE33\euFE34\euFE4D\euFE4E\euFE4F\euFF3F]\fR (including punctuation connector characters)
 462 .RE
 463 .PP
 464 Within bracket expressions,
 465 .QW \fB\ed\fR ,
 466 .QW \fB\es\fR ,
 467 and
 468 .QW \fB\ew\fR \&
 469 lose their outer brackets, and
 470 .QW \fB\eD\fR ,
 471 .QW \fB\eS\fR ,
 472 and
 473 .QW \fB\eW\fR \&
 474 are illegal. (So, for example,
 475 .QW \fB[a-c\ed]\fR
 476 is equivalent to
 477 .QW \fB[a-c[:digit:]]\fR .
 478 Also,
 479 .QW \fB[a-c\eD]\fR ,
 480 which is equivalent to
 481 .QW \fB[a-c^[:digit:]]\fR ,
 482 is illegal.)
 483 .SS "CONSTRAINT ESCAPES"
 484 A constraint escape (AREs only) is a constraint, matching the empty
 485 string if specific conditions are met, written as an escape:
 486 .RS 2
 487 .TP 6
 488 \fB\eA\fR
 489 .
 490 matches only at the beginning of the string (see \fBMATCHING\fR,
 491 below, for how this differs from
 492 .QW \fB^\fR )
 493 .TP
 494 \fB\em\fR
 495 .
 496 matches only at the beginning of a word
 497 .TP
 498 \fB\eM\fR
 499 .
 500 matches only at the end of a word
 501 .TP
 502 \fB\ey\fR
 503 .
 504 matches only at the beginning or end of a word
 505 .TP
 506 \fB\eY\fR
 507 .
 508 matches only at a point that is not the beginning or end of a word
 509 .TP
 510 \fB\eZ\fR
 511 .
 512 matches only at the end of the string (see \fBMATCHING\fR, below, for
 513 how this differs from
 514 .QW \fB$\fR )
 515 .TP
 516 \fB\e\fIm\fR
 517 .
 518 (where \fIm\fR is a nonzero digit) a \fIback reference\fR, see below
 519 .TP
 520 \fB\e\fImnn\fR
 521 .
 522 (where \fIm\fR is a nonzero digit, and \fInn\fR is some more digits,
 523 and the decimal value \fImnn\fR is not greater than the number of
 524 closing capturing parentheses seen so far) a \fIback reference\fR, see
 525 below
 526 .RE
 527 .PP
 528 A word is defined as in the specification of
 529 .QW \fB[[:<:]]\fR
 530 and
 531 .QW \fB[[:>:]]\fR
 532 above. Constraint escapes are illegal within bracket expressions.
 533 .SS "BACK REFERENCES"
 534 A back reference (AREs only) matches the same string matched by the
 535 parenthesized subexpression specified by the number, so that (e.g.)
 536 .QW \fB([bc])\e1\fR
 537 matches
 538 .QW \fBbb\fR
 539 or
 540 .QW \fBcc\fR
 541 but not
 542 .QW \fBbc\fR .
 543 The subexpression must entirely precede the back reference in the RE.
 544 Subexpressions are numbered in the order of their leading parentheses.
 545 Non-capturing parentheses do not define subexpressions.
 546 .PP
 547 There is an inherent historical ambiguity between octal
 548 character-entry escapes and back references, which is resolved by
 549 heuristics, as hinted at above. A leading zero always indicates an
 550 octal escape. A single non-zero digit, not followed by another digit,
 551 is always taken as a back reference. A multi-digit sequence not
 552 starting with a zero is taken as a back reference if it comes after a
 553 suitable subexpression (i.e. the number is in the legal range for a
 554 back reference), and otherwise is taken as octal.
 555 .SH "METASYNTAX"
 556 In addition to the main syntax described above, there are some special
 557 forms and miscellaneous syntactic facilities available.
 558 .PP
 559 Normally the flavor of RE being used is specified by
 560 application-dependent means. However, this can be overridden by a
 561 \fIdirector\fR. If an RE of any flavor begins with
 562 .QW \fB***:\fR ,
 563 the rest of the RE is an ARE. If an RE of any flavor begins with
 564 .QW \fB***=\fR ,
 565 the rest of the RE is taken to be a literal string, with
 566 all characters considered ordinary characters.
 567 .PP
 568 An ARE may begin with \fIembedded options\fR: a sequence
 569 \fB(?\fIxyz\fB)\fR (where \fIxyz\fR is one or more alphabetic
 570 characters) specifies options affecting the rest of the RE. These
 571 supplement, and can override, any options specified by the
 572 application. The available option letters are:
 573 .RS 2
 574 .TP 3
 575 \fBb\fR
 576 .
 577 rest of RE is a BRE
 578 .TP 3
 579 \fBc\fR
 580 .
 581 case-sensitive matching (usual default)
 582 .TP 3
 583 \fBe\fR
 584 .
 585 rest of RE is an ERE
 586 .TP 3
 587 \fBi\fR
 588 .
 589 case-insensitive matching (see \fBMATCHING\fR, below)
 590 .TP 3
 591 \fBm\fR
 592 .
 593 historical synonym for \fBn\fR
 594 .TP 3
 595 \fBn\fR
 596 .
 597 newline-sensitive matching (see \fBMATCHING\fR, below)
 598 .TP 3
 599 \fBp\fR
 600 .
 601 partial newline-sensitive matching (see \fBMATCHING\fR, below)
 602 .TP 3
 603 \fBq\fR
 604 .
 605 rest of RE is a literal
 606 .PQ quoted
 607 string, all ordinary characters
 608 .TP 3
 609 \fBs\fR
 610 .
 611 non-newline-sensitive matching (usual default)
 612 .TP 3
 613 \fBt\fR
 614 .
 615 tight syntax (usual default; see below)
 616 .TP 3
 617 \fBw\fR
 618 .
 619 inverse partial newline-sensitive
 620 .PQ weird
 621 matching (see \fBMATCHING\fR, below)
 622 .TP 3
 623 \fBx\fR
 624 .
 625 expanded syntax (see below)
 626 .RE
 627 .PP
 628 Embedded options take effect at the \fB)\fR terminating the sequence.
 629 They are available only at the start of an ARE, and may not be used
 630 later within it.
 631 .PP
 632 In addition to the usual (\fItight\fR) RE syntax, in which all
 633 characters are significant, there is an \fIexpanded\fR syntax,
 634 available in all flavors of RE with the \fB\-expanded\fR switch, or in
 635 AREs with the embedded x option. In the expanded syntax, white-space
 636 characters are ignored and all characters between a \fB#\fR and the
 637 following newline (or the end of the RE) are ignored, permitting
 638 paragraphing and commenting a complex RE. There are three exceptions
 639 to that basic rule:
 640 .IP \(bu 3
 641 a white-space character or
 642 .QW \fB#\fR
 643 preceded by
 644 .QW \fB\e\fR
 645 is retained
 646 .IP \(bu 3
 647 white space or
 648 .QW \fB#\fR
 649 within a bracket expression is retained
 650 .IP \(bu 3
 651 white space and comments are illegal within multi-character symbols
 652 like the ARE
 653 .QW \fB(?:\fR
 654 or the BRE
 655 .QW \fB\e(\fR
 656 .PP
 657 Expanded-syntax white-space characters are blank, tab, newline, and
 658 any character that belongs to the \fIspace\fR character class.
 659 .PP
 660 Finally, in an ARE, outside bracket expressions, the sequence
 661 .QW \fB(?#\fIttt\fB)\fR
 662 (where \fIttt\fR is any text not containing a
 663 .QW \fB)\fR )
 664 is a comment, completely ignored. Again, this is not
 665 allowed between the characters of multi-character symbols like
 666 .QW \fB(?:\fR .
 667 Such comments are more a historical artifact than a useful facility,
 668 and their use is deprecated; use the expanded syntax instead.
 669 .PP
 670 \fINone\fR of these metasyntax extensions is available if the
 671 application (or an initial
 672 .QW \fB***=\fR
 673 director) has specified that the
 674 user's input be treated as a literal string rather than as an RE.
 675 .SH MATCHING
 676 In the event that an RE could match more than one substring of a given
 677 string, the RE matches the one starting earliest in the string. If
 678 the RE could match more than one substring starting at that point, its
 679 choice is determined by its \fIpreference\fR: either the longest
 680 substring, or the shortest.
 681 .PP
 682 Most atoms, and all constraints, have no preference. A parenthesized
 683 RE has the same preference (possibly none) as the RE. A quantified
 684 atom with quantifier \fB{\fIm\fB}\fR or \fB{\fIm\fB}?\fR has the same
 685 preference (possibly none) as the atom itself. A quantified atom with
 686 other normal quantifiers (including \fB{\fIm\fB,\fIn\fB}\fR with
 687 \fIm\fR equal to \fIn\fR) prefers longest match. A quantified atom
 688 with other non-greedy quantifiers (including \fB{\fIm\fB,\fIn\fB}?\fR
 689 with \fIm\fR equal to \fIn\fR) prefers shortest match. A branch has
 690 the same preference as the first quantified atom in it which has a
 691 preference. An RE consisting of two or more branches connected by the
 692 \fB|\fR operator prefers longest match.
 693 .PP
 694 Subject to the constraints imposed by the rules for matching the whole
 695 RE, subexpressions also match the longest or shortest possible
 696 substrings, based on their preferences, with subexpressions starting
 697 earlier in the RE taking priority over ones starting later. Note that
 698 outer subexpressions thus take priority over their component
 699 subexpressions.
 700 .PP
 701 The quantifiers \fB{1,1}\fR and \fB{1,1}?\fR can be used to
 702 force longest and shortest preference, respectively, on a
 703 subexpression or a whole RE.
 704 .RS
 705 .PP
 706 \fBNOTE:\fR This means that you can usually make a RE be non-greedy overall by
 707 putting \fB{1,1}?\fR after one of the first non-constraint atoms or
 708 parenthesized sub-expressions in it. \fIIt pays to experiment\fR with the
 709 placing of this non-greediness override on a suitable range of input texts
 710 when you are writing a RE if you are using this level of complexity.
 711 .PP
 712 For example, this regular expression is non-greedy, and will match the
 713 shortest substring possible given that
 714 .QW \fBabc\fR
 715 will be matched as early as possible (the quantifier does not change that):
 716 .PP
 717 .CS
 718 ab{1,1}?c.*x.*cba
 719 .CE
 720 .PP
 721 The atom
 722 .QW \fBa\fR
 723 has no greediness preference, we explicitly give one for
 724 .QW \fBb\fR ,
 725 and the remaining quantifiers are overridden to be non-greedy by the preceding
 726 non-greedy quantifier.
 727 .RE
 728 .PP
 729 Match lengths are measured in characters, not collating elements. An
 730 empty string is considered longer than no match at all. For example,
 731 .QW \fBbb*\fR
 732 matches the three middle characters of
 733 .QW \fBabbbc\fR ,
 734 .QW \fB(week|wee)(night|knights)\fR
 735 matches all ten characters of
 736 .QW \fBweeknights\fR ,
 737 when
 738 .QW \fB(.*).*\fR
 739 is matched against
 740 .QW \fBabc\fR
 741 the parenthesized subexpression matches all three characters, and when
 742 .QW \fB(a*)*\fR
 743 is matched against
 744 .QW \fBbc\fR
 745 both the whole RE and the parenthesized subexpression match an empty string.
 746 .PP
 747 If case-independent matching is specified, the effect is much as if
 748 all case distinctions had vanished from the alphabet. When an
 749 alphabetic that exists in multiple cases appears as an ordinary
 750 character outside a bracket expression, it is effectively transformed
 751 into a bracket expression containing both cases, so that \fBx\fR
 752 becomes
 753 .QW \fB[xX]\fR .
 754 When it appears inside a bracket expression,
 755 all case counterparts of it are added to the bracket expression, so
 756 that
 757 .QW \fB[x]\fR
 758 becomes
 759 .QW \fB[xX]\fR
 760 and
 761 .QW \fB[^x]\fR
 762 becomes
 763 .QW \fB[^xX]\fR .
 764 .PP
 765 If newline-sensitive matching is specified, \fB.\fR and bracket
 766 expressions using \fB^\fR will never match the newline character (so
 767 that matches will never cross newlines unless the RE explicitly
 768 arranges it) and \fB^\fR and \fB$\fR will match the empty string after
 769 and before a newline respectively, in addition to matching at
 770 beginning and end of string respectively. ARE \fB\eA\fR and \fB\eZ\fR
 771 continue to match beginning or end of string \fIonly\fR.
 772 .PP
 773 If partial newline-sensitive matching is specified, this affects
 774 \fB.\fR and bracket expressions as with newline-sensitive matching,
 775 but not \fB^\fR and \fB$\fR.
 776 .PP
 777 If inverse partial newline-sensitive matching is specified, this
 778 affects \fB^\fR and \fB$\fR as with newline-sensitive matching, but
 779 not \fB.\fR and bracket expressions. This is not very useful but is
 780 provided for symmetry.
 781 .SH "LIMITS AND COMPATIBILITY"
 782 No particular limit is imposed on the length of REs. Programs
 783 intended to be highly portable should not employ REs longer than 256
 784 bytes, as a POSIX-compliant implementation can refuse to accept such
 785 REs.
 786 .PP
 787 The only feature of AREs that is actually incompatible with POSIX EREs
 788 is that \fB\e\fR does not lose its special significance inside bracket
 789 expressions. All other ARE features use syntax which is illegal or
 790 has undefined or unspecified effects in POSIX EREs; the \fB***\fR
 791 syntax of directors likewise is outside the POSIX syntax for both BREs
 792 and EREs.
 793 .PP
 794 Many of the ARE extensions are borrowed from Perl, but some have been
 795 changed to clean them up, and a few Perl extensions are not present.
 796 Incompatibilities of note include
 797 .QW \fB\eb\fR ,
 798 .QW \fB\eB\fR ,
 799 the lack of special treatment for a trailing newline, the addition of
 800 complemented bracket expressions to the things affected by
 801 newline-sensitive matching, the restrictions on parentheses and back
 802 references in lookahead constraints, and the longest/shortest-match
 803 (rather than first-match) matching semantics.
 804 .PP
 805 The matching rules for REs containing both normal and non-greedy
 806 quantifiers have changed since early beta-test versions of this
 807 package. (The new rules are much simpler and cleaner, but do not work
 808 as hard at guessing the user's real intentions.)
 809 .PP
 810 Henry Spencer's original 1986 \fIregexp\fR package, still in
 811 widespread use (e.g., in pre-8.1 releases of Tcl), implemented an
 812 early version of today's EREs. There are four incompatibilities
 813 between \fIregexp\fR's near-EREs
 814 .PQ RREs " for short"
 815 and AREs. In roughly increasing order of significance:
 816 .IP \(bu 3
 817 In AREs, \fB\e\fR followed by an alphanumeric character is either an
 818 escape or an error, while in RREs, it was just another way of writing
 819 the alphanumeric. This should not be a problem because there was no
 820 reason to write such a sequence in RREs.
 821 .IP \(bu 3
 822 \fB{\fR followed by a digit in an ARE is the beginning of a bound,
 823 while in RREs, \fB{\fR was always an ordinary character. Such
 824 sequences should be rare, and will often result in an error because
 825 following characters will not look like a valid bound.
 826 .IP \(bu 3
 827 In AREs, \fB\e\fR remains a special character within
 828 .QW \fB[\|]\fR ,
 829 so a literal \fB\e\fR within \fB[\|]\fR must be written
 830 .QW \fB\e\e\fR .
 831 \fB\e\e\fR also gives a literal \fB\e\fR within \fB[\|]\fR in RREs,
 832 but only truly paranoid programmers routinely doubled the backslash.
 833 .IP \(bu 3
 834 AREs report the longest/shortest match for the RE, rather than the
 835 first found in a specified search order. This may affect some RREs
 836 which were written in the expectation that the first match would be
 837 reported. (The careful crafting of RREs to optimize the search order
 838 for fast matching is obsolete (AREs examine all possible matches in
 839 parallel, and their performance is largely insensitive to their
 840 complexity) but cases where the search order was exploited to
 841 deliberately find a match which was \fInot\fR the longest/shortest
 842 will need rewriting.)
 843 .SH "BASIC REGULAR EXPRESSIONS"
 844 BREs differ from EREs in several respects.
 845 .QW \fB|\fR ,
 846 .QW \fB+\fR ,
 847 and \fB?\fR are ordinary characters and there is no equivalent for their
 848 functionality. The delimiters for bounds are \fB\e{\fR and
 849 .QW \fB\e}\fR ,
 850 with \fB{\fR and \fB}\fR by themselves ordinary characters. The
 851 parentheses for nested subexpressions are \fB\e(\fR and
 852 .QW \fB\e)\fR ,
 853 with \fB(\fR and \fB)\fR by themselves ordinary
 854 characters. \fB^\fR is an ordinary character except at the beginning
 855 of the RE or the beginning of a parenthesized subexpression, \fB$\fR
 856 is an ordinary character except at the end of the RE or the end of a
 857 parenthesized subexpression, and \fB*\fR is an ordinary character if
 858 it appears at the beginning of the RE or the beginning of a
 859 parenthesized subexpression (after a possible leading
 860 .QW \fB^\fR ).
 861 Finally, single-digit back references are available, and \fB\e<\fR and
 862 \fB\e>\fR are synonyms for
 863 .QW \fB[[:<:]]\fR
 864 and
 865 .QW \fB[[:>:]]\fR
 866 respectively; no other escapes are available.
 867 .SH "SEE ALSO"
 868 RegExp(3), regexp(n), regsub(n), lsearch(n), switch(n), text(n)
 869 .SH KEYWORDS
 870 match, regular expression, string
 871 .\" Local Variables:
 872 .\" mode: nroff
 873 .\" End: