现在的位置: 首页 > 综合 > 正文

[技术学习]GRETA: 一个vc++实现的支持Perl正则表达式的程序库

2012年07月06日 ⁄ 综合 ⁄ 共 16303字 ⁄ 字号 评论关闭

 

 

Copyright Eric Niebler, 2002

 

 

 

The purpose of this document is to describe how to use the GRETA Regular Expression Template Archive.  It describes the objects in the library, the methods defined on the objects, and the ways to use the objects and methods to perform regular expression pattern matching on strings in C++.  It does not describe regular expression syntax.  It is enough to say that the full Perl 5 syntax is supported.  If you are not familiar with Perl’s regular expression syntax, I recommend reading Chapter 2 of Programming Perl, 2nd Ed. (a.k.a. The Camel Book), one of the many fine books put out by O’Reilly publishers. 


GRETA:  The GRETA Regular Expression Template Archive. 1

Overview.. 3

A Word about Speed. 4

Notice to Users of Version 1.x. 5

The rpattern Object 5

rpattern::string_type. 6

rpattern::rpattern. 6

rpattern::match. 7

rpattern::substitute. 8

rpattern::count 9

rpattern::split 9

rpattern::set_substitution. 9

rpattern::cgroups. 10

match_results, subst_results and split_results. 10

match_results::cbackrefs. 10

match_results::backref. 10

match_results::rstart 10

match_results::rlength. 11

match_results::all_backrefs. 11

subst_results::backref_str 11

split_results::strings. 11

The Syntax Module. 12

register_intrinsic_charset 14

Customizing Your Search. 14

NOCASE.. 15

GLOBAL.. 15

MULTILINE.. 15

SINGLELINE.. 15

EXTENDED.. 15

RIGHTMOST. 15

NOBACKREFS. 15

ALLBACKREFS. 16

FIRSTBACKREFS. 16

NORMALIZE.. 16

Matching Modes. 16

MODE_FAST. 17

MODE_SAFE.. 17

MODE_MIXED.. 17

Known Issues and Perl Incompatibilities. 17

Embedded Code in a Regular Expression. 17

Pattern Modifier Scope. 17

Comment Blocks Before Quantifiers. 18

Variable Width Look-Behind Assertions. 18

Recursive Patterns. 18

Compile-Time Switches. 19

REGEX_WIDE_AND_NARROW... 19

REGEX_POSIX.. 19

REGEX_NO_PERL.. 19

REGEX_DEBUG.. 19

REGEX_DEBUG_HEAP. 19

REGEX_STACK_ALIGNMENT. 19

REGEX_FOLD_INSTANTIATIONS. 20

REGEX_TO_INSTANTIATE.. 20

Miscellaneous. 20

Static Const Patterns. 20

Thread-safety. 21

Stack Usage. 21

DBCS. 22

STL.. 22

VC7 and Managed Code. 22

Template Instantiation. 22

Contact Information. 23

Appendix 1: History. 24

Appendix 2: Implementation Details. 27

 

The regular expression template library contains objects and functions that make it possible to perform pattern matching and substitution on strings in C++.  They are:

  • rpattern: the pattern to use during the search.
  • match_results/subst_results: container for the results of a match/substitution.

To perform a search or replace operation, you will typically first initialize an rpattern object by giving it a string representing the pattern against which to match.  You will then call a method on the rpattern object (match() or substitute(), for instance), passing it a string to match against and a match_results objects to receive the results of the match.  If the match()/substitute() fails, the method returns false.  If it succeeds, it returns true, and the match_results object stores the resulting array of backreferences internally.  (Here, the term backreference has the same meaning as it does in Perl.  Backreferences provide extra information about what parts of the pattern matched which parts of the string.)  There are methods on the match_results object to make the backreference information available.  For example:

 

#include <iostream>

#include <string>

#include “regexpr2.h”

using namespace std;

using namespace regex;

 

int main() {

    match_results results;

    string str( “The book cost $12.34” );

    rpattern pat( “//$(//d+)(//.(//d//d))?” ); 

    // Match a dollar sign followed by one or more digits,

    // optionally followed by a period and two more digits.

    // The double-escapes are necessary to satisfy the compiler.

 

    match_results::backref_type br = pat.match( str, results );

    if( br.matched ) {

        cout << “match success!” << endl;

        cout << “price: ” << br << endl;

    } else {

        cout << “match failed!” << endl;

    }

    return 0;

}

 

The above program would print out the following:

 

match success!

price: $12.34

 

The following sections discuss the rpattern object in detail and how to customize your searches to be faster and more efficient.

 

Note: all declarations in the header file (regexpr2.h) are contained in the regex namespace.  To use any of the objects, methods or enumerations described in this document, you must prepend all declarations with “regex::” or you must have the “using namespace regex;” directive somewhere within the enclosing scope of your declarations.  For simplicity, I’ve left off the “regex::” prefixes in the rest of my code snippets.

 

Different regex engines are good on different types of patterns.  That said, I have found my regex engine to be pretty quick.  For a benchmark, I matched the pattern “^([0-9]+)(/-| |$)(.*)$” against the string “100- this is a line of ftp response which contains a message string”.  GRETA is about 7 times faster than the regex library in boost (http://www.boost.org), and about 10 times faster than the regular expression classes in ATL7.  For this input, GRETA is even faster than Perl, although Perl is faster for some other patterns.  Most regex engines I have seen build up an NFA (non-deterministic finite state automaton) and execute it iteratively, often with a big, slow switch statement.  I have a different approach: patterns are compiled into a directed, possibly cyclic graph, and matching happens by traversing this graph recursively.  In addition, the code makes heavy use of templates to freeze the state of the flags into the compiled pattern so that they don’t need to be checked at match time.  The result is a pretty lean blob of code that can match your pattern quickly.

 

Even the best algorithms have their weaknesses, though.  Matching regular expressions with backreferences is an NP-complete problem.  There are patterns that will make any backtracking regex engine take exponential time to finish.  (These usually involve nested quantifiers.)  If you have a performance critical app, you would be smart to test your patterns for speed, or profile your app to make sure you are not spending too much time thrashing around in the regex code.  You’ve been warned!

 

Also, see the section VC7 and Managed Code for some advice for compiling GRETA under VC7.

 

Many things have changed since version 1.x of the Regular Expression Template Archive.  If you have code which uses version 1.x, you will not be able to use version 2 without making changes to your code.  Sorry!  There were a number of unsafe, unintuitive interface features of version 1 that I felt were worth fixing for version 2.  If you need version 1, I have a copy and I’d be happy to give it to you.

 

Most notably, the regexpr object has gone away.  It was a subclass of std::string, with match() and substitute() methods, and it stored the results of the match/substitute internally.  Subclassing std::string is dangerous because std::string doesn’t have a virtual destructor.  Also, matching is conceptually a const operation, and it seemed wrong that it should change internal state.

 

The match/count/substitute methods have moved to the rpattern object.  The state that used to be stored in the regexpr object is now put in a match_results/subst_results container, which is passed as an out parameter to the match/substitute methods.

 

Also, the CSTRINGS flag has gone away.  It is no longer necessary to optimize a pattern for use with C-style NULL-terminated strings.  When you pass a C-style string to the rpattern::match method, the same optimization is used automatically.  (In early 2.X versions of the library, there was a basic_rpattern_c object for performing this optimization, but it is no longer necessary and has been deprecated.)

 

Another minor change involves the register_intrinsic_charset() method.  It used to be a part of rpattern’s interface, but it has moved to the syntax module.

 

Despite the sweeping interface changes, the majority of the back-end code is unchanged.  You should expect patterns that worked in version 1.x to continue to work in version 2.

The rpattern object contains the regular expression pattern against which to match.  It also exposes the match(), substitute(), and count() methods you will use to perform regular expression matches.  When you instantiate an rpattern object, the pattern is “compiled” into a structure that speeds up pattern matching.  Once compiled, you may reuse the same pattern for multiple match operations.

 

Here is how rpattern is declared:

 

template< typename CI,

          typename SY = perl_syntax<std::iterator_traits<CI>::value_type> >

class basic_rpattern {

};

typedef basic_rpattern<std::basic_string<TCHAR>::const_iterator> rpattern;

typedef basic_rpattern<TCHAR const *> rpattern_c;

 

The rpattern class is a template on iterator type.  It is also a template on the syntax module.  By default, the Perl syntax module is used, but you are free to write your own syntax and specify it as a template parameter.  See the section on the Syntax Module.

 

The following sections describe the methods available on the rpattern object.

rpattern::string_type

rpattern::string_type is a typedef that is used in many of the following function prototypes.  It is defined as follows:

 

typedef CI const_iterator;

typedef std::iterator_traits<const_iterator>::value_type char_type;

typedef std::basic_string<char_type> string_type;

 

The typedef is a little complicated, but its effect is what you would expect.  If the result of dereferencing a const_iterator is a char, then string_type is the same as std::string.  If dereferencing a const_iterator results in a wchar_t, then string_type is the same as std::wstring.

There are two constructors for instantiating an rpattern object.  Here are their prototypes:

 

rpattern::rpattern(

const string_type & pat,

REGEX_FLAGS flags=NOFLAGS,

REGEX_MODE mode=MODE_DEFAULT ); // throw(bad_alloc,bad_regexpr);

 

rpattern::rpattern(

const string_type & pat,

const string_type & sub,

REGEX_FLAGS flags=NOFLAGS,

REGEX_MODE mode=MODE_DEFAULT ); // throw(bad_alloc,bad_regexpr);

 

Both methods require you to specify a string containing the pattern to match.  Both methods allow you to specify some flags that customize your search and the mode with which the pattern should be matched (see Matching Modes).  The first method does not require you to specify a substitution string; the second one does.  If you do not specify a substitution string, it is assumed to be the null string.

 

Notice the (commented out) exception specification, “throw(bad_alloc, bad_regexpr),” on the constructors.  This means that the constructor of an rpattern object can throw an exception of type bad_regexpr or bad_alloc.  This can happen when the specified pattern contains illegal regular expression syntax, such as unbalanced parentheses.  It can also occur when the substitution string refers to a backref that does not exist (for instance, if the substitution string contains “$6”, and there are not 6 groups in the pattern). The bad_regexpr exception inherits from the std::invalid_argument exception.  The bad_regexpr::what() method returns a pointer to a C-style string that contains a description of the problem.

The match method is used to find patterns in strings.  There are a couple of different versions of the match method.  The first takes a std::basic_string as a parameter.  The second version takes two iterators, the beginning and end of the string to match.  Finally, there is a version of the match() method that takes a pointer to a NULL-terminated string.

 

template< typename CH, typename TR, typename AL >

rpattern::backref_type rpattern::match(

const std::basic_string<CH,TR,AL> & str,

match_results & results,

    size_type pos = 0,            // offset to the start of the substring to match

size_type len = npos ) const; // length of the substring to match

 

rpattern::backref_type rpattern::match(

const_iterator ibegin,        // start of the string to match

const_iterator iend,          // one past end of string to match

match_results & results ) const;

 

rpattern::backref_type rpattern::match(

const char_type * sz,

match_results_c & results ) const;

 

Notice that the match() method that takes a std::basic_string is a template on character, character traits and allocator.  This is necessary given that basic_rpattern is a template only on iterator, but it gives us a little more flexibility then we really want.  If you accidentally pass in the wrong string type (say, a std::wstring when the match method is really expecting a std::string) you will get a compile-time error informing you of your mistake.

 

The match() method returns an object of type rpattern::backref_type.   If the match fails, an “empty backref” will be returned.  An empty backref contains “false” in the matched member.  If the match succeeds, the backref contains information about where the pattern matched the string, and contains “true” in the matched member.  The following code demonstrates the usage of the match() method:

 

const char * sz = “Here is a string to match.”;

rpattern_c pat( “str.ng” ); // matches “string”, “strong”, etc.

match_results_c results;

rpattern_c::backref_type br = pat.match( sz, results );

if( br.matched ) {

    printf( “Backref: %.*s”, br.second – br.first, br.first );

} else {

    printf( “No match found.” );

}

 

In the above code, br.first is a pointer to the first character in the string that matched the pattern and br.second is a pointer to the last+1 character that matched the pattern.  Therefore, the length of the matched pattern is br.second – br.first.  (You’ll notice that this value is used as the precision of the string in the printf() format statement.)  The above code will generate the following output: “Backref: string”. 

 

Notice the use of rpattern_c and match_results_c in the above example.  These are typedefs for basic_rpattern<const TCHAR *> and basic_match_results<const TCHAR *>, respectively.

The substitute() method finds patterns and replaces the found substrings with insertion strings you specify.  It takes a std::string object as a parameter, and a subst_results container.  It also takes two optional integer parameters, which you can use to specify the offset of the first character to search and the length of the search string.  Here is the prototype:

 

template< typename CH, typename TR, typename AL >

size_t rpattern::substitute( std::basic_string<CH,TR,AL> & str,

                             subst_results & results,

                             size_type pos = 0,

                             size_type len = npos );

 

The return type is an unsigned integer representing the number of substitutions made.  If the pattern is not found, no substitutions are made and the return value is 0.  The following example finds all instances of /n not preceded by /r and replaces them with /r/n:

 

size_t insert_crlf( string & str ) {

static const rpattern s_crlf( “(?<!/r)/n”, “/r/n”, GLOBAL );

subst_results results;

    return s_crlf.substitute( str, results );

}

 

Later sections will describe the significance of “static const” and “GLOBAL” in the above example.  Note the use of the look-behind assertion, (?<!/r), in the example.  It evaluates to true when the preceding character is not a return character, but it doesn’t include the preceding character in the match.

 

When doing a global substitution, the next match operation begins where the last substitution left off.  Thus, if the search string contains the word “foo”, pattern is “o”, and the substitution is “oo”, the resulting string will be, “foooo”.

Use the count() method when you want to know the number of times a given pattern matches a string, but you don’t care about where the matches occur or what they are exactly.  The count() method has three variants, which are analogous to the three match() methods.  Here’s one if its prototypes:

 

size_t rpattern::count(

const_iterator ibegin,        // start of the string to match

const_iterator iend ) const;  // one past end of string to match

 

Like the substitute() method, the next match begins where the previous one ended.  Thus, if you want to know how many times the pattern “oo” shows up in the string “fooooo”, the answer is 2, even though this pattern could conceivably match the string in 4 unique places.  Overlapping matches are not counted.

Use the split() method when you have a string that you want to split into substrings, using a regular expression as a delimiter.  Here’s one if its prototypes:

 

size_t rpattern::split(

const_iterator ibegin,    // start of the string to match

const_iterator iend,      // one past end of string to match

split_results & results,

int limit = 0 ) const;

 

The split() function takes an optional “int limit” parameter which defaults to 0.  If it is greater than 0, it is the upper limit on the number of fields to split the string into.  If it is 0, then there is no limit, except that trailing empty fields are dropped.  If it is less that 0, then there is no limit, and empty trailing fields are kept.  As in perl, an empty leading field is always dropped.

 

If your regular expression has capturing groups, these back-references become fields in the split_results.

 

The split_results struct behaves like an STL container of std:strings.

Sometimes you may want to set the substitution string after the rpattern object has been instantiated.  For this, you would use the set_substitution() method. 

 

void rpattern::set_substitution( const string_type & sub );

// throw(bad_alloc,bad_regexpr);

 

The set_substitution() method can be called as many times as you like.  Each time it is called, it replaces the old substitution string with the new one.  Notice that the set_substitution() method can also throw a bad_regexpr exception for ill-formed substitution strings.

This method returns the count of groups in the pattern string.  For a successfully parsed pattern, this number is always at least one, because the entire pattern is considered group number 0.

 

size_t rpattern::cgroups() const;

match_results, subst_results and split_results

The results of a match(), substitute(), or split() operation are stored in the match_results, subst_results, and split_results containers, respectively.  After a successful match(), substitute() or split(), they contain useful information that can be queried for using the following methods.  All the match_results methods are also available on subst_results.

This method returns the count of the internally saved backrefs.  It takes no parameters.  Here is the prototype:

 

size_t match_results::cbackrefs() const;

 

After a successful match() operation, cbackrefs() will always return at least one.  That is because the part of the string that matched the entire pattern is always saved in backref number 0 (like $& in Perl).

This method returns the requested backref.  It takes one parameter, which is the integer representing the backref to return.  The prototype for this method looks like:

 

const backref_type & match_results::backref( size_t cbackref ) const;

 

As in Perl, backreferences are created as a side effect of grouping.  The substring that matched the entire pattern is saved in backref number 0; the substring that matched the patt

抱歉!评论已关闭.