Annotation of parser3/src/pcre/README, revision 1.1
1.1 ! paf 1: README file for PCRE (Perl-compatible regular expressions)
! 2: ----------------------------------------------------------
! 3:
! 4: *******************************************************************************
! 5: * IMPORTANT FOR THOSE UPGRADING FROM VERSIONS BEFORE 2.00 *
! 6: * *
! 7: * Please note that there has been a change in the API such that a larger *
! 8: * ovector is required at matching time, to provide some additional workspace. *
! 9: * The new man page has details. This change was necessary in order to support *
! 10: * some of the new functionality in Perl 5.005. *
! 11: * *
! 12: * IMPORTANT FOR THOSE UPGRADING FROM VERSION 2.00 *
! 13: * *
! 14: * Another (I hope this is the last!) change has been made to the API for the *
! 15: * pcre_compile() function. An additional argument has been added to make it *
! 16: * possible to pass over a pointer to character tables built in the current *
! 17: * locale by pcre_maketables(). To use the default tables, this new arguement *
! 18: * should be passed as NULL. *
! 19: * *
! 20: * IMPORTANT FOR THOSE UPGRADING FROM VERSION 2.05 *
! 21: * *
! 22: * Yet another (and again I hope this really is the last) change has been made *
! 23: * to the API for the pcre_exec() function. An additional argument has been *
! 24: * added to make it possible to start the match other than at the start of the *
! 25: * subject string. This is important if there are lookbehinds. The new man *
! 26: * page has the details, but you just want to convert existing programs, all *
! 27: * you need to do is to stick in a new fifth argument to pcre_exec(), with a *
! 28: * value of zero. For example, change *
! 29: * *
! 30: * pcre_exec(pattern, extra, subject, length, options, ovec, ovecsize) *
! 31: * to *
! 32: * pcre_exec(pattern, extra, subject, length, 0, options, ovec, ovecsize) *
! 33: *******************************************************************************
! 34:
! 35:
! 36: The distribution should contain the following files:
! 37:
! 38: ChangeLog log of changes to the code
! 39: LICENCE conditions for the use of PCRE
! 40: Makefile for building PCRE in Unix systems
! 41: README this file
! 42: RunTest a Unix shell script for running tests
! 43: Tech.Notes notes on the encoding
! 44: pcre.3 man page source for the functions
! 45: pcre.3.txt plain text version
! 46: pcre.3.html HTML version
! 47: pcreposix.3 man page source for the POSIX wrapper API
! 48: pcreposix.3.txt plain text version
! 49: pcreposix.3.HTML HTML version
! 50: dftables.c auxiliary program for building chartables.c
! 51: get.c )
! 52: maketables.c )
! 53: study.c ) source of
! 54: pcre.c ) the functions
! 55: pcreposix.c )
! 56: pcre.h header for the external API
! 57: pcreposix.h header for the external POSIX wrapper API
! 58: internal.h header for internal use
! 59: pcretest.c test program
! 60: pgrep.1 man page source for pgrep
! 61: pgrep.1.txt plain text version
! 62: pgrep.1.HTML HTML version
! 63: pgrep.c source of a grep utility that uses PCRE
! 64: perltest Perl test program
! 65: testinput1 test data, compatible with Perl 5.004 and 5.005
! 66: testinput2 test data for error messages and non-Perl things
! 67: testinput3 test data, compatible with Perl 5.005
! 68: testinput4 test data for locale-specific tests
! 69: testoutput1 test results corresponding to testinput1
! 70: testoutput2 test results corresponding to testinput2
! 71: testoutput3 test results corresponding to testinput3
! 72: testoutput4 test results corresponding to testinput4
! 73: dll.mk for Win32 DLL
! 74: pcre.def ditto
! 75:
! 76: To build PCRE on a Unix system, first edit Makefile for your system. It is a
! 77: fairly simple make file, and there are some comments near the top, after the
! 78: text "On a Unix system". Then run "make". It builds two libraries called
! 79: libpcre.a and libpcreposix.a, a test program called pcretest, and the pgrep
! 80: command. You can use "make install" to copy these, and the public header file
! 81: pcre.h, to appropriate live directories on your system. These installation
! 82: directories are defined at the top of the Makefile, and you should edit them if
! 83: necessary.
! 84:
! 85: For a non-Unix system, read the comments at the top of Makefile, which give
! 86: some hints on what needs to be done. PCRE has been compiled on Windows systems
! 87: and on Macintoshes, but I don't know the details as I don't use those systems.
! 88: It should be straightforward to build PCRE on any system that has a Standard C
! 89: compiler.
! 90:
! 91: Some help in building a Win32 DLL of PCRE in GnuWin32 environments was
! 92: contributed by Paul.Sokolovsky@technologist.com. These environments are
! 93: Mingw32 (http://www.xraylith.wisc.edu/~khan/software/gnu-win32/) and
! 94: CygWin (http://sourceware.cygnus.com/cygwin/). Paul comments:
! 95:
! 96: For CygWin, set CFLAGS=-mno-cygwin, and do 'make dll'. You'll get
! 97: pcre.dll (containing pcreposix also), libpcre.dll.a, and dynamically
! 98: linked pgrep and pcretest. If you have /bin/sh, run RunTest (three
! 99: main test go ok, locale not supported).
! 100:
! 101: To test PCRE, run the RunTest script in the pcre directory. This can also be
! 102: run by "make runtest". It runs the pcretest test program (which is documented
! 103: below) on each of the testinput files in turn, and compares the output with the
! 104: contents of the corresponding testoutput file. A file called testtry is used to
! 105: hold the output from pcretest. To run pcretest on just one of the test files,
! 106: give its number as an argument to RunTest, for example:
! 107:
! 108: RunTest 3
! 109:
! 110: The first and third test files can also be fed directly into the perltest
! 111: script to check that Perl gives the same results. The third file requires the
! 112: additional features of release 5.005, which is why it is kept separate from the
! 113: main test input, which needs only Perl 5.004. In the long run, when 5.005 is
! 114: widespread, these two test files may get amalgamated.
! 115:
! 116: The second set of tests check pcre_info(), pcre_study(), pcre_copy_substring(),
! 117: pcre_get_substring(), pcre_get_substring_list(), error detection and run-time
! 118: flags that are specific to PCRE, as well as the POSIX wrapper API.
! 119:
! 120: The fourth set of tests checks pcre_maketables(), the facility for building a
! 121: set of character tables for a specific locale and using them instead of the
! 122: default tables. The tests make use of the "fr" (French) locale. Before running
! 123: the test, the script checks for the presence of this locale by running the
! 124: "locale" command. If that command fails, or if it doesn't include "fr" in the
! 125: list of available locales, the fourth test cannot be run, and a comment is
! 126: output to say why. If running this test produces instances of the error
! 127:
! 128: ** Failed to set locale "fr"
! 129:
! 130: in the comparison output, it means that locale is not available on your system,
! 131: despite being listed by "locale". This does not mean that PCRE is broken.
! 132:
! 133: PCRE has its own native API, but a set of "wrapper" functions that are based on
! 134: the POSIX API are also supplied in the library libpcreposix.a. Note that this
! 135: just provides a POSIX calling interface to PCRE: the regular expressions
! 136: themselves still follow Perl syntax and semantics. The header file
! 137: for the POSIX-style functions is called pcreposix.h. The official POSIX name is
! 138: regex.h, but I didn't want to risk possible problems with existing files of
! 139: that name by distributing it that way. To use it with an existing program that
! 140: uses the POSIX API, it will have to be renamed or pointed at by a link.
! 141:
! 142:
! 143: Character tables
! 144: ----------------
! 145:
! 146: PCRE uses four tables for manipulating and identifying characters. The final
! 147: argument of the pcre_compile() function is a pointer to a block of memory
! 148: containing the concatenated tables. A call to pcre_maketables() can be used to
! 149: generate a set of tables in the current locale. If the final argument for
! 150: pcre_compile() is passed as NULL, a set of default tables that is built into
! 151: the binary is used.
! 152:
! 153: The source file called chartables.c contains the default set of tables. This is
! 154: not supplied in the distribution, but is built by the program dftables
! 155: (compiled from dftables.c), which uses the ANSI C character handling functions
! 156: such as isalnum(), isalpha(), isupper(), islower(), etc. to build the table
! 157: sources. This means that the default C locale which is set for your system will
! 158: control the contents of these default tables. You can change the default tables
! 159: by editing chartables.c and then re-building PCRE. If you do this, you should
! 160: probably also edit Makefile to ensure that the file doesn't ever get
! 161: re-generated.
! 162:
! 163: The first two 256-byte tables provide lower casing and case flipping functions,
! 164: respectively. The next table consists of three 32-byte bit maps which identify
! 165: digits, "word" characters, and white space, respectively. These are used when
! 166: building 32-byte bit maps that represent character classes.
! 167:
! 168: The final 256-byte table has bits indicating various character types, as
! 169: follows:
! 170:
! 171: 1 white space character
! 172: 2 letter
! 173: 4 decimal digit
! 174: 8 hexadecimal digit
! 175: 16 alphanumeric or '_'
! 176: 128 regular expression metacharacter or binary zero
! 177:
! 178: You should not alter the set of characters that contain the 128 bit, as that
! 179: will cause PCRE to malfunction.
! 180:
! 181:
! 182: The pcretest program
! 183: --------------------
! 184:
! 185: This program is intended for testing PCRE, but it can also be used for
! 186: experimenting with regular expressions.
! 187:
! 188: If it is given two filename arguments, it reads from the first and writes to
! 189: the second. If it is given only one filename argument, it reads from that file
! 190: and writes to stdout. Otherwise, it reads from stdin and writes to stdout, and
! 191: prompts for each line of input.
! 192:
! 193: The program handles any number of sets of input on a single input file. Each
! 194: set starts with a regular expression, and continues with any number of data
! 195: lines to be matched against the pattern. An empty line signals the end of the
! 196: set. The regular expressions are given enclosed in any non-alphameric
! 197: delimiters other than backslash, for example
! 198:
! 199: /(a|bc)x+yz/
! 200:
! 201: White space before the initial delimiter is ignored. A regular expression may
! 202: be continued over several input lines, in which case the newline characters are
! 203: included within it. See the testinput files for many examples. It is possible
! 204: to include the delimiter within the pattern by escaping it, for example
! 205:
! 206: /abc\/def/
! 207:
! 208: If you do so, the escape and the delimiter form part of the pattern, but since
! 209: delimiters are always non-alphameric, this does not affect its interpretation.
! 210: If the terminating delimiter is immediately followed by a backslash, for
! 211: example,
! 212:
! 213: /abc/\
! 214:
! 215: then a backslash is added to the end of the pattern. This is done to provide a
! 216: way of testing the error condition that arises if a pattern finishes with a
! 217: backslash, because
! 218:
! 219: /abc\/
! 220:
! 221: is interpreted as the first line of a pattern that starts with "abc/", causing
! 222: pcretest to read the next line as a continuation of the regular expression.
! 223:
! 224: The pattern may be followed by i, m, s, or x to set the PCRE_CASELESS,
! 225: PCRE_MULTILINE, PCRE_DOTALL, or PCRE_EXTENDED options, respectively. For
! 226: example:
! 227:
! 228: /caseless/i
! 229:
! 230: These modifier letters have the same effect as they do in Perl. There are
! 231: others which set PCRE options that do not correspond to anything in Perl: /A,
! 232: /E, and /X set PCRE_ANCHORED, PCRE_DOLLAR_ENDONLY, and PCRE_EXTRA respectively.
! 233:
! 234: Searching for all possible matches within each subject string can be requested
! 235: by the /g or /G modifier. After finding a match, PCRE is called again to search
! 236: the remainder of the subject string. The difference between /g and /G is that
! 237: the former uses the startoffset argument to pcre_exec() to start searching at
! 238: a new point within the entire string (which is in effect what Perl does),
! 239: whereas the latter passes over a shortened substring. This makes a difference
! 240: to the matching process if the pattern begins with a lookbehind assertion
! 241: (including \b or \B).
! 242:
! 243: If any call to pcre_exec() in a /g or /G sequence matches an empty string, the
! 244: next call is done with the PCRE_NOTEMPTY flag set so that it cannot match an
! 245: empty string again. This imitates the way Perl handles such cases when using
! 246: the /g modifier or the split() function.
! 247:
! 248: There are a number of other modifiers for controlling the way pcretest
! 249: operates.
! 250:
! 251: The /+ modifier requests that as well as outputting the substring that matched
! 252: the entire pattern, pcretest should in addition output the remainder of the
! 253: subject string. This is useful for tests where the subject contains multiple
! 254: copies of the same substring.
! 255:
! 256: The /L modifier must be followed directly by the name of a locale, for example,
! 257:
! 258: /pattern/Lfr
! 259:
! 260: For this reason, it must be the last modifier letter. The given locale is set,
! 261: pcre_maketables() is called to build a set of character tables for the locale,
! 262: and this is then passed to pcre_compile() when compiling the regular
! 263: expression. Without an /L modifier, NULL is passed as the tables pointer; that
! 264: is, /L applies only to the expression on which it appears.
! 265:
! 266: The /I modifier requests that pcretest output information about the compiled
! 267: expression (whether it is anchored, has a fixed first character, and so on). It
! 268: does this by calling pcre_info() after compiling an expression, and outputting
! 269: the information it gets back. If the pattern is studied, the results of that
! 270: are also output.
! 271:
! 272: The /D modifier is a PCRE debugging feature, which also assumes /I. It causes
! 273: the internal form of compiled regular expressions to be output after
! 274: compilation.
! 275:
! 276: The /S modifier causes pcre_study() to be called after the expression has been
! 277: compiled, and the results used when the expression is matched.
! 278:
! 279: The /M modifier causes the size of memory block used to hold the compiled
! 280: pattern to be output.
! 281:
! 282: Finally, the /P modifier causes pcretest to call PCRE via the POSIX wrapper API
! 283: rather than its native API. When this is done, all other modifiers except /i,
! 284: /m, and /+ are ignored. REG_ICASE is set if /i is present, and REG_NEWLINE is
! 285: set if /m is present. The wrapper functions force PCRE_DOLLAR_ENDONLY always,
! 286: and PCRE_DOTALL unless REG_NEWLINE is set.
! 287:
! 288: Before each data line is passed to pcre_exec(), leading and trailing whitespace
! 289: is removed, and it is then scanned for \ escapes. The following are recognized:
! 290:
! 291: \a alarm (= BEL)
! 292: \b backspace
! 293: \e escape
! 294: \f formfeed
! 295: \n newline
! 296: \r carriage return
! 297: \t tab
! 298: \v vertical tab
! 299: \nnn octal character (up to 3 octal digits)
! 300: \xhh hexadecimal character (up to 2 hex digits)
! 301:
! 302: \A pass the PCRE_ANCHORED option to pcre_exec()
! 303: \B pass the PCRE_NOTBOL option to pcre_exec()
! 304: \Cdd call pcre_copy_substring() for substring dd after a successful match
! 305: (any decimal number less than 32)
! 306: \Gdd call pcre_get_substring() for substring dd after a successful match
! 307: (any decimal number less than 32)
! 308: \L call pcre_get_substringlist() after a successful match
! 309: \N pass the PCRE_NOTEMPTY option to pcre_exec()
! 310: \Odd set the size of the output vector passed to pcre_exec() to dd
! 311: (any number of decimal digits)
! 312: \Z pass the PCRE_NOTEOL option to pcre_exec()
! 313:
! 314: A backslash followed by anything else just escapes the anything else. If the
! 315: very last character is a backslash, it is ignored. This gives a way of passing
! 316: an empty line as data, since a real empty line terminates the data input.
! 317:
! 318: If /P was present on the regex, causing the POSIX wrapper API to be used, only
! 319: \B, and \Z have any effect, causing REG_NOTBOL and REG_NOTEOL to be passed to
! 320: regexec() respectively.
! 321:
! 322: When a match succeeds, pcretest outputs the list of captured substrings that
! 323: pcre_exec() returns, starting with number 0 for the string that matched the
! 324: whole pattern. Here is an example of an interactive pcretest run.
! 325:
! 326: $ pcretest
! 327: PCRE version 2.06 08-Jun-1999
! 328:
! 329: re> /^abc(\d+)/
! 330: data> abc123
! 331: 0: abc123
! 332: 1: 123
! 333: data> xyz
! 334: No match
! 335:
! 336: If the strings contain any non-printing characters, they are output as \0x
! 337: escapes. If the pattern has the /+ modifier, then the output for substring 0 is
! 338: followed by the the rest of the subject string, identified by "0+" like this:
! 339:
! 340: re> /cat/+
! 341: data> cataract
! 342: 0: cat
! 343: 0+ aract
! 344:
! 345: If the pattern has the /g or /G modifier, the results of successive matching
! 346: attempts are output in sequence, like this:
! 347:
! 348: re> /\Bi(\w\w)/g
! 349: data> Mississippi
! 350: 0: iss
! 351: 1: ss
! 352: 0: iss
! 353: 1: ss
! 354: 0: ipp
! 355: 1: pp
! 356:
! 357: "No match" is output only if the first match attempt fails.
! 358:
! 359: If any of \C, \G, or \L are present in a data line that is successfully
! 360: matched, the substrings extracted by the convenience functions are output with
! 361: C, G, or L after the string number instead of a colon. This is in addition to
! 362: the normal full list. The string length (that is, the return from the
! 363: extraction function) is given in parentheses after each string for \C and \G.
! 364:
! 365: Note that while patterns can be continued over several lines (a plain ">"
! 366: prompt is used for continuations), data lines may not. However newlines can be
! 367: included in data by means of the \n escape.
! 368:
! 369: If the -p option is given to pcretest, it is equivalent to adding /P to each
! 370: regular expression: the POSIX wrapper API is used to call PCRE. None of the
! 371: following flags has any effect in this case.
! 372:
! 373: If the option -d is given to pcretest, it is equivalent to adding /D to each
! 374: regular expression: the internal form is output after compilation.
! 375:
! 376: If the option -i is given to pcretest, it is equivalent to adding /I to each
! 377: regular expression: information about the compiled pattern is given after
! 378: compilation.
! 379:
! 380: If the option -m is given to pcretest, it outputs the size of each compiled
! 381: pattern after it has been compiled. It is equivalent to adding /M to each
! 382: regular expression. For compatibility with earlier versions of pcretest, -s is
! 383: a synonym for -m.
! 384:
! 385: If the -t option is given, each compile, study, and match is run 20000 times
! 386: while being timed, and the resulting time per compile or match is output in
! 387: milliseconds. Do not set -t with -s, because you will then get the size output
! 388: 20000 times and the timing will be distorted. If you want to change the number
! 389: of repetitions used for timing, edit the definition of LOOPREPEAT at the top of
! 390: pcretest.c
! 391:
! 392:
! 393:
! 394: The perltest program
! 395: --------------------
! 396:
! 397: The perltest program tests Perl's regular expressions; it has the same
! 398: specification as pcretest, and so can be given identical input, except that
! 399: input patterns can be followed only by Perl's lower case modifiers. The
! 400: contents of testinput1 and testinput3 meet this condition.
! 401:
! 402: The data lines are processed as Perl double-quoted strings, so if they contain
! 403: " \ $ or @ characters, these have to be escaped. For this reason, all such
! 404: characters in testinput1 and testinput3 are escaped so that they can be used
! 405: for perltest as well as for pcretest, and the special upper case modifiers such
! 406: as /A that pcretest recognizes are not used in these files. The output should
! 407: be identical, apart from the initial identifying banner.
! 408:
! 409: The testinput2 and testinput4 files are not suitable for feeding to perltest,
! 410: since they do make use of the special upper case modifiers and escapes that
! 411: pcretest uses to test some features of PCRE. The first of these files also
! 412: contains malformed regular expressions, in order to check that PCRE diagnoses
! 413: them correctly.
! 414:
! 415: Philip Hazel <ph10@cam.ac.uk>
! 416: July 1999
E-mail: