Annotation of parser3/src/lib/pcre/Tech.Notes, revision 1.1
1.1 ! paf 1: Technical Notes about PCRE
! 2: --------------------------
! 3:
! 4: Many years ago I implemented some regular expression functions to an algorithm
! 5: suggested by Martin Richards. These were not Unix-like in form, and were quite
! 6: restricted in what they could do by comparison with Perl. The interesting part
! 7: about the algorithm was that the amount of space required to hold the compiled
! 8: form of an expression was known in advance. The code to apply an expression did
! 9: not operate by backtracking, as the Henry Spencer and Perl code does, but
! 10: instead checked all possibilities simultaneously by keeping a list of current
! 11: states and checking all of them as it advanced through the subject string. (In
! 12: the terminology of Jeffrey Friedl's book, it was a "DFA algorithm".) When the
! 13: pattern was all used up, all remaining states were possible matches, and the
! 14: one matching the longest subset of the subject string was chosen. This did not
! 15: necessarily maximize the individual wild portions of the pattern, as is
! 16: expected in Unix and Perl-style regular expressions.
! 17:
! 18: By contrast, the code originally written by Henry Spencer and subsequently
! 19: heavily modified for Perl actually compiles the expression twice: once in a
! 20: dummy mode in order to find out how much store will be needed, and then for
! 21: real. The execution function operates by backtracking and maximizing (or,
! 22: optionally, minimizing in Perl) the amount of the subject that matches
! 23: individual wild portions of the pattern. This is an "NFA algorithm" in Friedl's
! 24: terminology.
! 25:
! 26: For this set of functions that forms PCRE, I tried at first to invent an
! 27: algorithm that used an amount of store bounded by a multiple of the number of
! 28: characters in the pattern, to save on compiling time. However, because of the
! 29: greater complexity in Perl regular expressions, I couldn't do this. In any
! 30: case, a first pass through the pattern is needed, in order to find internal
! 31: flag settings like (?i) at top level. So it works by running a very degenerate
! 32: first pass to calculate a maximum store size, and then a second pass to do the
! 33: real compile - which may use a bit less than the predicted amount of store. The
! 34: idea is that this is going to turn out faster because the first pass is
! 35: degenerate and the second can just store stuff straight into the vector. It
! 36: does make the compiling functions bigger, of course, but they have got quite
! 37: big anyway to handle all the Perl stuff.
! 38:
! 39: The compiled form of a pattern is a vector of bytes, containing items of
! 40: variable length. The first byte in an item is an opcode, and the length of the
! 41: item is either implicit in the opcode or contained in the data bytes which
! 42: follow it. A list of all the opcodes follows:
! 43:
! 44: Opcodes with no following data
! 45: ------------------------------
! 46:
! 47: These items are all just one byte long
! 48:
! 49: OP_END end of pattern
! 50: OP_ANY match any character
! 51: OP_SOD match start of data: \A
! 52: OP_CIRC ^ (start of data, or after \n in multiline)
! 53: OP_NOT_WORD_BOUNDARY \W
! 54: OP_WORD_BOUNDARY \w
! 55: OP_NOT_DIGIT \D
! 56: OP_DIGIT \d
! 57: OP_NOT_WHITESPACE \S
! 58: OP_WHITESPACE \s
! 59: OP_NOT_WORDCHAR \W
! 60: OP_WORDCHAR \w
! 61: OP_EODN match end of data or \n at end: \Z
! 62: OP_EOD match end of data: \z
! 63: OP_DOLL $ (end of data, or before \n in multiline)
! 64:
! 65:
! 66: Repeating single characters
! 67: ---------------------------
! 68:
! 69: The common repeats (*, +, ?) when applied to a single character appear as
! 70: two-byte items using the following opcodes:
! 71:
! 72: OP_STAR
! 73: OP_MINSTAR
! 74: OP_PLUS
! 75: OP_MINPLUS
! 76: OP_QUERY
! 77: OP_MINQUERY
! 78:
! 79: Those with "MIN" in their name are the minimizing versions. Each is followed by
! 80: the character that is to be repeated. Other repeats make use of
! 81:
! 82: OP_UPTO
! 83: OP_MINUPTO
! 84: OP_EXACT
! 85:
! 86: which are followed by a two-byte count (most significant first) and the
! 87: repeated character. OP_UPTO matches from 0 to the given number. A repeat with a
! 88: non-zero minimum and a fixed maximum is coded as an OP_EXACT followed by an
! 89: OP_UPTO (or OP_MINUPTO).
! 90:
! 91:
! 92: Repeating character types
! 93: -------------------------
! 94:
! 95: Repeats of things like \d are done exactly as for single characters, except
! 96: that instead of a character, the opcode for the type is stored in the data
! 97: byte. The opcodes are:
! 98:
! 99: OP_TYPESTAR
! 100: OP_TYPEMINSTAR
! 101: OP_TYPEPLUS
! 102: OP_TYPEMINPLUS
! 103: OP_TYPEQUERY
! 104: OP_TYPEMINQUERY
! 105: OP_TYPEUPTO
! 106: OP_TYPEMINUPTO
! 107: OP_TYPEEXACT
! 108:
! 109:
! 110: Matching a character string
! 111: ---------------------------
! 112:
! 113: The OP_CHARS opcode is followed by a one-byte count and then that number of
! 114: characters. If there are more than 255 characters in sequence, successive
! 115: instances of OP_CHARS are used.
! 116:
! 117:
! 118: Character classes
! 119: -----------------
! 120:
! 121: OP_CLASS is used for a character class, provided there are at least two
! 122: characters in the class. If there is only one character, OP_CHARS is used for a
! 123: positive class, and OP_NOT for a negative one (that is, for something like
! 124: [^a]). Another set of repeating opcodes (OP_NOTSTAR etc.) are used for a
! 125: repeated, negated, single-character class. The normal ones (OP_STAR etc.) are
! 126: used for a repeated positive single-character class.
! 127:
! 128: OP_CLASS is followed by a 32-byte bit map containing a 1
! 129: bit for every character that is acceptable. The bits are counted from the least
! 130: significant end of each byte.
! 131:
! 132:
! 133: Back references
! 134: ---------------
! 135:
! 136: OP_REF is followed by a single byte containing the reference number.
! 137:
! 138:
! 139: Repeating character classes and back references
! 140: -----------------------------------------------
! 141:
! 142: Single-character classes are handled specially (see above). This applies to
! 143: OP_CLASS and OP_REF. In both cases, the repeat information follows the base
! 144: item. The matching code looks at the following opcode to see if it is one of
! 145:
! 146: OP_CRSTAR
! 147: OP_CRMINSTAR
! 148: OP_CRPLUS
! 149: OP_CRMINPLUS
! 150: OP_CRQUERY
! 151: OP_CRMINQUERY
! 152: OP_CRRANGE
! 153: OP_CRMINRANGE
! 154:
! 155: All but the last two are just single-byte items. The others are followed by
! 156: four bytes of data, comprising the minimum and maximum repeat counts.
! 157:
! 158:
! 159: Brackets and alternation
! 160: ------------------------
! 161:
! 162: A pair of non-identifying (round) brackets is wrapped round each expression at
! 163: compile time, so alternation always happens in the context of brackets.
! 164: Non-identifying brackets use the opcode OP_BRA, while identifying brackets use
! 165: OP_BRA+1, OP_BRA+2, etc. [Note for North Americans: "bracket" to some English
! 166: speakers, including myself, can be round, square, or curly. Hence this usage.]
! 167:
! 168: A bracket opcode is followed by two bytes which give the offset to the next
! 169: alternative OP_ALT or, if there aren't any branches, to the matching KET
! 170: opcode. Each OP_ALT is followed by two bytes giving the offset to the next one,
! 171: or to the KET opcode.
! 172:
! 173: OP_KET is used for subpatterns that do not repeat indefinitely, while
! 174: OP_KETRMIN and OP_KETRMAX are used for indefinite repetitions, minimally or
! 175: maximally respectively. All three are followed by two bytes giving (as a
! 176: positive number) the offset back to the matching BRA opcode.
! 177:
! 178: If a subpattern is quantified such that it is permitted to match zero times, it
! 179: is preceded by one of OP_BRAZERO or OP_BRAMINZERO. These are single-byte
! 180: opcodes which tell the matcher that skipping this subpattern entirely is a
! 181: valid branch.
! 182:
! 183: A subpattern with an indefinite maximum repetition is replicated in the
! 184: compiled data its minimum number of times (or once with a BRAZERO if the
! 185: minimum is zero), with the final copy terminating with a KETRMIN or KETRMAX as
! 186: appropriate.
! 187:
! 188: A subpattern with a bounded maximum repetition is replicated in a nested
! 189: fashion up to the maximum number of times, with BRAZERO or BRAMINZERO before
! 190: each replication after the minimum, so that, for example, (abc){2,5} is
! 191: compiled as (abc)(abc)((abc)((abc)(abc)?)?)?. The 200-bracket limit does not
! 192: apply to these internally generated brackets.
! 193:
! 194:
! 195: Assertions
! 196: ----------
! 197:
! 198: Forward assertions are just like other subpatterns, but starting with one of
! 199: the opcodes OP_ASSERT or OP_ASSERT_NOT. Backward assertions use the opcodes
! 200: OP_ASSERTBACK and OP_ASSERTBACK_NOT, and the first opcode inside the assertion
! 201: is OP_REVERSE, followed by a two byte count of the number of characters to move
! 202: back the pointer in the subject string. A separate count is present in each
! 203: alternative of a lookbehind assertion, allowing them to have different fixed
! 204: lengths.
! 205:
! 206:
! 207: Once-only subpatterns
! 208: ---------------------
! 209:
! 210: These are also just like other subpatterns, but they start with the opcode
! 211: OP_ONCE.
! 212:
! 213:
! 214: Conditional subpatterns
! 215: -----------------------
! 216:
! 217: These are like other subpatterns, but they start with the opcode OP_COND. If
! 218: the condition is a back reference, this is stored at the start of the
! 219: subpattern using the opcode OP_CREF followed by one byte containing the
! 220: reference number. Otherwise, a conditional subpattern will always start with
! 221: one of the assertions.
! 222:
! 223:
! 224: Changing options
! 225: ----------------
! 226:
! 227: If any of the /i, /m, or /s options are changed within a parenthesized group,
! 228: an OP_OPT opcode is compiled, followed by one byte containing the new settings
! 229: of these flags. If there are several alternatives in a group, there is an
! 230: occurrence of OP_OPT at the start of all those following the first options
! 231: change, to set appropriate options for the start of the alternative.
! 232: Immediately after the end of the group there is another such item to reset the
! 233: flags to their previous values. Other changes of flag within the pattern can be
! 234: handled entirely at compile time, and so do not cause anything to be put into
! 235: the compiled data.
! 236:
! 237:
! 238: Philip Hazel
! 239: January 1999
E-mail: