parser3/src/pcre/README - annotate

Return to README CVS log
Up to [parser3project] / parser3 / src / pcre
Annotation of parser3/src/pcre/README, revision 1.1

1.1     ! paf         1: README file for PCRE (Perl-compatible regular expressions)
        !             2: ----------------------------------------------------------
        !             3: 
        !             4: *******************************************************************************
        !             5: *           IMPORTANT FOR THOSE UPGRADING FROM VERSIONS BEFORE 2.00           *
        !             6: *                                                                             *
        !             7: * Please note that there has been a change in the API such that a larger      *
        !             8: * ovector is required at matching time, to provide some additional workspace. *
        !             9: * The new man page has details. This change was necessary in order to support *
        !            10: * some of the new functionality in Perl 5.005.                                *
        !            11: *                                                                             *
        !            12: *           IMPORTANT FOR THOSE UPGRADING FROM VERSION 2.00                   *
        !            13: *                                                                             *
        !            14: * Another (I hope this is the last!) change has been made to the API for the  *
        !            15: * pcre_compile() function. An additional argument has been added to make it   *
        !            16: * possible to pass over a pointer to character tables built in the current    *
        !            17: * locale by pcre_maketables(). To use the default tables, this new arguement  *
        !            18: * should be passed as NULL.                                                   *
        !            19: *                                                                             *
        !            20: *           IMPORTANT FOR THOSE UPGRADING FROM VERSION 2.05                   *
        !            21: *                                                                             *
        !            22: * Yet another (and again I hope this really is the last) change has been made *
        !            23: * to the API for the pcre_exec() function. An additional argument has been    *
        !            24: * added to make it possible to start the match other than at the start of the *
        !            25: * subject string. This is important if there are lookbehinds. The new man     *
        !            26: * page has the details, but you just want to convert existing programs, all   *
        !            27: * you need to do is to stick in a new fifth argument to pcre_exec(), with a   *
        !            28: * value of zero. For example, change                                          *
        !            29: *                                                                             *
        !            30: *   pcre_exec(pattern, extra, subject, length, options, ovec, ovecsize)       *
        !            31: * to                                                                          *
        !            32: *   pcre_exec(pattern, extra, subject, length, 0, options, ovec, ovecsize)    *
        !            33: *******************************************************************************
        !            34: 
        !            35: 
        !            36: The distribution should contain the following files:
        !            37: 
        !            38:   ChangeLog         log of changes to the code
        !            39:   LICENCE           conditions for the use of PCRE
        !            40:   Makefile          for building PCRE in Unix systems
        !            41:   README            this file
        !            42:   RunTest           a Unix shell script for running tests
        !            43:   Tech.Notes        notes on the encoding
        !            44:   pcre.3            man page source for the functions
        !            45:   pcre.3.txt        plain text version
        !            46:   pcre.3.html       HTML version
        !            47:   pcreposix.3       man page source for the POSIX wrapper API
        !            48:   pcreposix.3.txt   plain text version
        !            49:   pcreposix.3.HTML  HTML version
        !            50:   dftables.c        auxiliary program for building chartables.c
        !            51:   get.c             )
        !            52:   maketables.c      )
        !            53:   study.c           ) source of
        !            54:   pcre.c            )   the functions
        !            55:   pcreposix.c       )
        !            56:   pcre.h            header for the external API
        !            57:   pcreposix.h       header for the external POSIX wrapper API
        !            58:   internal.h        header for internal use
        !            59:   pcretest.c        test program
        !            60:   pgrep.1           man page source for pgrep
        !            61:   pgrep.1.txt       plain text version
        !            62:   pgrep.1.HTML      HTML version
        !            63:   pgrep.c           source of a grep utility that uses PCRE
        !            64:   perltest          Perl test program
        !            65:   testinput1        test data, compatible with Perl 5.004 and 5.005
        !            66:   testinput2        test data for error messages and non-Perl things
        !            67:   testinput3        test data, compatible with Perl 5.005
        !            68:   testinput4        test data for locale-specific tests
        !            69:   testoutput1       test results corresponding to testinput1
        !            70:   testoutput2       test results corresponding to testinput2
        !            71:   testoutput3       test results corresponding to testinput3
        !            72:   testoutput4       test results corresponding to testinput4
        !            73:   dll.mk            for Win32 DLL
        !            74:   pcre.def          ditto
        !            75: 
        !            76: To build PCRE on a Unix system, first edit Makefile for your system. It is a
        !            77: fairly simple make file, and there are some comments near the top, after the
        !            78: text "On a Unix system". Then run "make". It builds two libraries called
        !            79: libpcre.a and libpcreposix.a, a test program called pcretest, and the pgrep
        !            80: command. You can use "make install" to copy these, and the public header file
        !            81: pcre.h, to appropriate live directories on your system. These installation
        !            82: directories are defined at the top of the Makefile, and you should edit them if
        !            83: necessary.
        !            84: 
        !            85: For a non-Unix system, read the comments at the top of Makefile, which give
        !            86: some hints on what needs to be done. PCRE has been compiled on Windows systems
        !            87: and on Macintoshes, but I don't know the details as I don't use those systems.
        !            88: It should be straightforward to build PCRE on any system that has a Standard C
        !            89: compiler.
        !            90: 
        !            91: Some help in building a Win32 DLL of PCRE in GnuWin32 environments was
        !            92: contributed by Paul.Sokolovsky@technologist.com. These environments are
        !            93: Mingw32 (http://www.xraylith.wisc.edu/~khan/software/gnu-win32/) and
        !            94: CygWin  (http://sourceware.cygnus.com/cygwin/). Paul comments:
        !            95: 
        !            96:   For CygWin, set CFLAGS=-mno-cygwin, and do 'make dll'. You'll get
        !            97:   pcre.dll (containing pcreposix also), libpcre.dll.a, and dynamically
        !            98:   linked pgrep and pcretest. If you have /bin/sh, run RunTest (three
        !            99:   main test go ok, locale not supported).
        !           100: 
        !           101: To test PCRE, run the RunTest script in the pcre directory. This can also be
        !           102: run by "make runtest". It runs the pcretest test program (which is documented
        !           103: below) on each of the testinput files in turn, and compares the output with the
        !           104: contents of the corresponding testoutput file. A file called testtry is used to
        !           105: hold the output from pcretest. To run pcretest on just one of the test files,
        !           106: give its number as an argument to RunTest, for example:
        !           107: 
        !           108:   RunTest 3
        !           109: 
        !           110: The first and third test files can also be fed directly into the perltest
        !           111: script to check that Perl gives the same results. The third file requires the
        !           112: additional features of release 5.005, which is why it is kept separate from the
        !           113: main test input, which needs only Perl 5.004. In the long run, when 5.005 is
        !           114: widespread, these two test files may get amalgamated.
        !           115: 
        !           116: The second set of tests check pcre_info(), pcre_study(), pcre_copy_substring(),
        !           117: pcre_get_substring(), pcre_get_substring_list(), error detection and run-time
        !           118: flags that are specific to PCRE, as well as the POSIX wrapper API.
        !           119: 
        !           120: The fourth set of tests checks pcre_maketables(), the facility for building a
        !           121: set of character tables for a specific locale and using them instead of the
        !           122: default tables. The tests make use of the "fr" (French) locale. Before running
        !           123: the test, the script checks for the presence of this locale by running the
        !           124: "locale" command. If that command fails, or if it doesn't include "fr" in the
        !           125: list of available locales, the fourth test cannot be run, and a comment is
        !           126: output to say why. If running this test produces instances of the error
        !           127: 
        !           128:   ** Failed to set locale "fr"
        !           129: 
        !           130: in the comparison output, it means that locale is not available on your system,
        !           131: despite being listed by "locale". This does not mean that PCRE is broken.
        !           132: 
        !           133: PCRE has its own native API, but a set of "wrapper" functions that are based on
        !           134: the POSIX API are also supplied in the library libpcreposix.a. Note that this
        !           135: just provides a POSIX calling interface to PCRE: the regular expressions
        !           136: themselves still follow Perl syntax and semantics. The header file
        !           137: for the POSIX-style functions is called pcreposix.h. The official POSIX name is
        !           138: regex.h, but I didn't want to risk possible problems with existing files of
        !           139: that name by distributing it that way. To use it with an existing program that
        !           140: uses the POSIX API, it will have to be renamed or pointed at by a link.
        !           141: 
        !           142: 
        !           143: Character tables
        !           144: ----------------
        !           145: 
        !           146: PCRE uses four tables for manipulating and identifying characters. The final
        !           147: argument of the pcre_compile() function is a pointer to a block of memory
        !           148: containing the concatenated tables. A call to pcre_maketables() can be used to
        !           149: generate a set of tables in the current locale. If the final argument for
        !           150: pcre_compile() is passed as NULL, a set of default tables that is built into
        !           151: the binary is used.
        !           152: 
        !           153: The source file called chartables.c contains the default set of tables. This is
        !           154: not supplied in the distribution, but is built by the program dftables
        !           155: (compiled from dftables.c), which uses the ANSI C character handling functions
        !           156: such as isalnum(), isalpha(), isupper(), islower(), etc. to build the table
        !           157: sources. This means that the default C locale which is set for your system will
        !           158: control the contents of these default tables. You can change the default tables
        !           159: by editing chartables.c and then re-building PCRE. If you do this, you should
        !           160: probably also edit Makefile to ensure that the file doesn't ever get
        !           161: re-generated.
        !           162: 
        !           163: The first two 256-byte tables provide lower casing and case flipping functions,
        !           164: respectively. The next table consists of three 32-byte bit maps which identify
        !           165: digits, "word" characters, and white space, respectively. These are used when
        !           166: building 32-byte bit maps that represent character classes.
        !           167: 
        !           168: The final 256-byte table has bits indicating various character types, as
        !           169: follows:
        !           170: 
        !           171:     1   white space character
        !           172:     2   letter
        !           173:     4   decimal digit
        !           174:     8   hexadecimal digit
        !           175:    16   alphanumeric or '_'
        !           176:   128   regular expression metacharacter or binary zero
        !           177: 
        !           178: You should not alter the set of characters that contain the 128 bit, as that
        !           179: will cause PCRE to malfunction.
        !           180: 
        !           181: 
        !           182: The pcretest program
        !           183: --------------------
        !           184: 
        !           185: This program is intended for testing PCRE, but it can also be used for
        !           186: experimenting with regular expressions.
        !           187: 
        !           188: If it is given two filename arguments, it reads from the first and writes to
        !           189: the second. If it is given only one filename argument, it reads from that file
        !           190: and writes to stdout. Otherwise, it reads from stdin and writes to stdout, and
        !           191: prompts for each line of input.
        !           192: 
        !           193: The program handles any number of sets of input on a single input file. Each
        !           194: set starts with a regular expression, and continues with any number of data
        !           195: lines to be matched against the pattern. An empty line signals the end of the
        !           196: set. The regular expressions are given enclosed in any non-alphameric
        !           197: delimiters other than backslash, for example
        !           198: 
        !           199:   /(a|bc)x+yz/
        !           200: 
        !           201: White space before the initial delimiter is ignored. A regular expression may
        !           202: be continued over several input lines, in which case the newline characters are
        !           203: included within it. See the testinput files for many examples. It is possible
        !           204: to include the delimiter within the pattern by escaping it, for example
        !           205: 
        !           206:   /abc\/def/
        !           207: 
        !           208: If you do so, the escape and the delimiter form part of the pattern, but since
        !           209: delimiters are always non-alphameric, this does not affect its interpretation.
        !           210: If the terminating delimiter is immediately followed by a backslash, for
        !           211: example,
        !           212: 
        !           213:   /abc/\
        !           214: 
        !           215: then a backslash is added to the end of the pattern. This is done to provide a
        !           216: way of testing the error condition that arises if a pattern finishes with a
        !           217: backslash, because
        !           218: 
        !           219:   /abc\/
        !           220: 
        !           221: is interpreted as the first line of a pattern that starts with "abc/", causing
        !           222: pcretest to read the next line as a continuation of the regular expression.
        !           223: 
        !           224: The pattern may be followed by i, m, s, or x to set the PCRE_CASELESS,
        !           225: PCRE_MULTILINE, PCRE_DOTALL, or PCRE_EXTENDED options, respectively. For
        !           226: example:
        !           227: 
        !           228:   /caseless/i
        !           229: 
        !           230: These modifier letters have the same effect as they do in Perl. There are
        !           231: others which set PCRE options that do not correspond to anything in Perl: /A,
        !           232: /E, and /X set PCRE_ANCHORED, PCRE_DOLLAR_ENDONLY, and PCRE_EXTRA respectively.
        !           233: 
        !           234: Searching for all possible matches within each subject string can be requested
        !           235: by the /g or /G modifier. After finding a match, PCRE is called again to search
        !           236: the remainder of the subject string. The difference between /g and /G is that
        !           237: the former uses the startoffset argument to pcre_exec() to start searching at
        !           238: a new point within the entire string (which is in effect what Perl does),
        !           239: whereas the latter passes over a shortened substring. This makes a difference
        !           240: to the matching process if the pattern begins with a lookbehind assertion
        !           241: (including \b or \B).
        !           242: 
        !           243: If any call to pcre_exec() in a /g or /G sequence matches an empty string, the
        !           244: next call is done with the PCRE_NOTEMPTY flag set so that it cannot match an
        !           245: empty string again. This imitates the way Perl handles such cases when using
        !           246: the /g modifier or the split() function.
        !           247: 
        !           248: There are a number of other modifiers for controlling the way pcretest
        !           249: operates.
        !           250: 
        !           251: The /+ modifier requests that as well as outputting the substring that matched
        !           252: the entire pattern, pcretest should in addition output the remainder of the
        !           253: subject string. This is useful for tests where the subject contains multiple
        !           254: copies of the same substring.
        !           255: 
        !           256: The /L modifier must be followed directly by the name of a locale, for example,
        !           257: 
        !           258:   /pattern/Lfr
        !           259: 
        !           260: For this reason, it must be the last modifier letter. The given locale is set,
        !           261: pcre_maketables() is called to build a set of character tables for the locale,
        !           262: and this is then passed to pcre_compile() when compiling the regular
        !           263: expression. Without an /L modifier, NULL is passed as the tables pointer; that
        !           264: is, /L applies only to the expression on which it appears.
        !           265: 
        !           266: The /I modifier requests that pcretest output information about the compiled
        !           267: expression (whether it is anchored, has a fixed first character, and so on). It
        !           268: does this by calling pcre_info() after compiling an expression, and outputting
        !           269: the information it gets back. If the pattern is studied, the results of that
        !           270: are also output.
        !           271: 
        !           272: The /D modifier is a PCRE debugging feature, which also assumes /I. It causes
        !           273: the internal form of compiled regular expressions to be output after
        !           274: compilation.
        !           275: 
        !           276: The /S modifier causes pcre_study() to be called after the expression has been
        !           277: compiled, and the results used when the expression is matched.
        !           278: 
        !           279: The /M modifier causes the size of memory block used to hold the compiled
        !           280: pattern to be output.
        !           281: 
        !           282: Finally, the /P modifier causes pcretest to call PCRE via the POSIX wrapper API
        !           283: rather than its native API. When this is done, all other modifiers except /i,
        !           284: /m, and /+ are ignored. REG_ICASE is set if /i is present, and REG_NEWLINE is
        !           285: set if /m is present. The wrapper functions force PCRE_DOLLAR_ENDONLY always,
        !           286: and PCRE_DOTALL unless REG_NEWLINE is set.
        !           287: 
        !           288: Before each data line is passed to pcre_exec(), leading and trailing whitespace
        !           289: is removed, and it is then scanned for \ escapes. The following are recognized:
        !           290: 
        !           291:   \a     alarm (= BEL)
        !           292:   \b     backspace
        !           293:   \e     escape
        !           294:   \f     formfeed
        !           295:   \n     newline
        !           296:   \r     carriage return
        !           297:   \t     tab
        !           298:   \v     vertical tab
        !           299:   \nnn   octal character (up to 3 octal digits)
        !           300:   \xhh   hexadecimal character (up to 2 hex digits)
        !           301: 
        !           302:   \A     pass the PCRE_ANCHORED option to pcre_exec()
        !           303:   \B     pass the PCRE_NOTBOL option to pcre_exec()
        !           304:   \Cdd   call pcre_copy_substring() for substring dd after a successful match
        !           305:            (any decimal number less than 32)
        !           306:   \Gdd   call pcre_get_substring() for substring dd after a successful match
        !           307:            (any decimal number less than 32)
        !           308:   \L     call pcre_get_substringlist() after a successful match
        !           309:   \N     pass the PCRE_NOTEMPTY option to pcre_exec()
        !           310:   \Odd   set the size of the output vector passed to pcre_exec() to dd
        !           311:            (any number of decimal digits)
        !           312:   \Z     pass the PCRE_NOTEOL option to pcre_exec()
        !           313: 
        !           314: A backslash followed by anything else just escapes the anything else. If the
        !           315: very last character is a backslash, it is ignored. This gives a way of passing
        !           316: an empty line as data, since a real empty line terminates the data input.
        !           317: 
        !           318: If /P was present on the regex, causing the POSIX wrapper API to be used, only
        !           319: \B, and \Z have any effect, causing REG_NOTBOL and REG_NOTEOL to be passed to
        !           320: regexec() respectively.
        !           321: 
        !           322: When a match succeeds, pcretest outputs the list of captured substrings that
        !           323: pcre_exec() returns, starting with number 0 for the string that matched the
        !           324: whole pattern. Here is an example of an interactive pcretest run.
        !           325: 
        !           326:   $ pcretest
        !           327:   PCRE version 2.06 08-Jun-1999
        !           328: 
        !           329:     re> /^abc(\d+)/
        !           330:   data> abc123
        !           331:    0: abc123
        !           332:    1: 123
        !           333:   data> xyz
        !           334:   No match
        !           335: 
        !           336: If the strings contain any non-printing characters, they are output as \0x
        !           337: escapes. If the pattern has the /+ modifier, then the output for substring 0 is
        !           338: followed by the the rest of the subject string, identified by "0+" like this:
        !           339: 
        !           340:     re> /cat/+
        !           341:   data> cataract
        !           342:    0: cat
        !           343:    0+ aract
        !           344: 
        !           345: If the pattern has the /g or /G modifier, the results of successive matching
        !           346: attempts are output in sequence, like this:
        !           347: 
        !           348:     re> /\Bi(\w\w)/g
        !           349:   data> Mississippi
        !           350:    0: iss
        !           351:    1: ss
        !           352:    0: iss
        !           353:    1: ss
        !           354:    0: ipp
        !           355:    1: pp
        !           356: 
        !           357: "No match" is output only if the first match attempt fails.
        !           358: 
        !           359: If any of \C, \G, or \L are present in a data line that is successfully
        !           360: matched, the substrings extracted by the convenience functions are output with
        !           361: C, G, or L after the string number instead of a colon. This is in addition to
        !           362: the normal full list. The string length (that is, the return from the
        !           363: extraction function) is given in parentheses after each string for \C and \G.
        !           364: 
        !           365: Note that while patterns can be continued over several lines (a plain ">"
        !           366: prompt is used for continuations), data lines may not. However newlines can be
        !           367: included in data by means of the \n escape.
        !           368: 
        !           369: If the -p option is given to pcretest, it is equivalent to adding /P to each
        !           370: regular expression: the POSIX wrapper API is used to call PCRE. None of the
        !           371: following flags has any effect in this case.
        !           372: 
        !           373: If the option -d is given to pcretest, it is equivalent to adding /D to each
        !           374: regular expression: the internal form is output after compilation.
        !           375: 
        !           376: If the option -i is given to pcretest, it is equivalent to adding /I to each
        !           377: regular expression: information about the compiled pattern is given after
        !           378: compilation.
        !           379: 
        !           380: If the option -m is given to pcretest, it outputs the size of each compiled
        !           381: pattern after it has been compiled. It is equivalent to adding /M to each
        !           382: regular expression. For compatibility with earlier versions of pcretest, -s is
        !           383: a synonym for -m.
        !           384: 
        !           385: If the -t option is given, each compile, study, and match is run 20000 times
        !           386: while being timed, and the resulting time per compile or match is output in
        !           387: milliseconds. Do not set -t with -s, because you will then get the size output
        !           388: 20000 times and the timing will be distorted. If you want to change the number
        !           389: of repetitions used for timing, edit the definition of LOOPREPEAT at the top of
        !           390: pcretest.c
        !           391: 
        !           392: 
        !           393: 
        !           394: The perltest program
        !           395: --------------------
        !           396: 
        !           397: The perltest program tests Perl's regular expressions; it has the same
        !           398: specification as pcretest, and so can be given identical input, except that
        !           399: input patterns can be followed only by Perl's lower case modifiers. The
        !           400: contents of testinput1 and testinput3 meet this condition.
        !           401: 
        !           402: The data lines are processed as Perl double-quoted strings, so if they contain
        !           403: " \ $ or @ characters, these have to be escaped. For this reason, all such
        !           404: characters in testinput1 and testinput3 are escaped so that they can be used
        !           405: for perltest as well as for pcretest, and the special upper case modifiers such
        !           406: as /A that pcretest recognizes are not used in these files. The output should
        !           407: be identical, apart from the initial identifying banner.
        !           408: 
        !           409: The testinput2 and testinput4 files are not suitable for feeding to perltest,
        !           410: since they do make use of the special upper case modifiers and escapes that
        !           411: pcretest uses to test some features of PCRE. The first of these files also
        !           412: contains malformed regular expressions, in order to check that PCRE diagnoses
        !           413: them correctly.
        !           414: 
        !           415: Philip Hazel <ph10@cam.ac.uk>
        !           416: July 1999
E-mail: