From 0471cd5f62a4dc85902e4cf11c21933ac9684c52 Mon Sep 17 00:00:00 2001 From: Tom Lane Date: Sun, 9 Jan 2005 20:08:50 +0000 Subject: [PATCH] Clarify description of greedy and non-greedy POSIX regular expressions, per discussion in Nov 2004 with Ken Tanzer. --- doc/src/sgml/func.sgml | 124 +++++++++++++++++++++++++++++++++++++------------ 1 file changed, 94 insertions(+), 30 deletions(-) diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 759c67df31..c01fad599a 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -1,5 +1,5 @@ @@ -3772,45 +3772,109 @@ substring('foobar' from 'o(.)b') o In the event that an RE could match more than one substring of a given string, the RE matches the one starting earliest in the string. If the RE could match more than one substring starting at that point, - its choice is determined by its preference: - either the longest substring, or the shortest. + either the longest possible match or the shortest possible match will + be taken, depending on whether the RE is greedy or + non-greedy. - Most atoms, and all constraints, have no preference. - A parenthesized RE has the same preference (possibly none) as the RE. - A quantified atom with quantifier - {m} - or - {m}? - has the same preference (possibly none) as the atom itself. - A quantified atom with other normal quantifiers (including - {m,n} - with m equal to n) - prefers longest match. - A quantified atom with other non-greedy quantifiers (including - {m,n}? - with m equal to n) - prefers shortest match. - A branch has the same preference as the first quantified atom in it - which has a preference. - An RE consisting of two or more branches connected by the - | operator prefers longest match. + Whether an RE is greedy or not is determined by the following rules: + + + + Most atoms, and all constraints, have no greediness attribute (because + they cannot match variable amounts of text anyway). + + + + + Adding parentheses around an RE does not change its greediness. + + + + + A quantified atom with a fixed-repetition quantifier + ({m} + or + {m}?) + has the same greediness (possibly none) as the atom itself. + + + + + A quantified atom with other normal quantifiers (including + {m,n} + with m equal to n) + is greedy (prefers longest match). + + + + + A quantified atom with a non-greedy quantifier (including + {m,n}? + with m equal to n) + is non-greedy (prefers shortest match). + + + + + A branch — that is, an RE that has no top-level + | operator — has the same greediness as the first + quantified atom in it that has a greediness attribute. + + + + + An RE consisting of two or more branches connected by the + | operator is always greedy. + + + + + + + The above rules associate greediness attributes not only with individual + quantified atoms, but with branches and entire REs that contain quantified + atoms. What that means is that the matching is done in such a way that + the branch, or whole RE, matches the longest or shortest possible + substring as a whole. Once the length of the entire match + is determined, the part of it that matches any particular subexpression + is determined on the basis of the greediness attribute of that + subexpression, with subexpressions starting earlier in the RE taking + priority over ones starting later. + + + + An example of what this means: + +SELECT SUBSTRING('XY1234Z', 'Y*([0-9]{1,3})'); +Result: 123 +SELECT SUBSTRING('XY1234Z', 'Y*?([0-9]{1,3})'); +Result: 1 + + In the first case, the RE as a whole is greedy because Y* + is greedy. It can match beginning at the Y, and it matches + the longest possible string starting there, i.e., Y123. + The output is the parenthesized part of that, or 123. + In the second case, the RE as a whole is non-greedy because Y*? + is non-greedy. It can match beginning at the Y, and it matches + the shortest possible string starting there, i.e., Y1. + The subexpression [0-9]{1,3} is greedy but it cannot change + the decision as to the overall match length; so it is forced to match + just 1. - Subject to the constraints imposed by the rules for matching the whole RE, - subexpressions also match the longest or shortest possible substrings, - based on their preferences, - with subexpressions starting earlier in the RE taking priority over - ones starting later. - Note that outer subexpressions thus take priority over - their component subexpressions. + In short, when an RE contains both greedy and non-greedy subexpressions, + the total match length is either as long as possible or as short as + possible, according to the attribute assigned to the whole RE. The + attributes assigned to the subexpressions only affect how much of that + match they are allowed to eat relative to each other. The quantifiers {1,1} and {1,1}? - can be used to force longest and shortest preference, respectively, + can be used to force greediness or non-greediness, respectively, on a subexpression or a whole RE. -- 2.11.0