Module: Essential Tools Module Group: String Processing Classes
Does not inherit
Char getStatus() index() iterator |
matchAt() match_iterator operator replace() |
RString RWCRegularExpression RWTRegex() RWTRegexStatus |
search() subCount() virtual ~ |
#include <rw/tools/regex.h> RWTRegex<char> re0(".*\\.doc"); // Matches filenames with suffix ".doc" RWCRegularExpression re1("a+"); // Matches one or more 'a' RWWRegularExpression re2(L"b+"); // Matches one or more wide-character, 'b'
RWTRegex<T> is the primary template for the new regular expression interface. It provides most of the POSIX.2 standard for regular expression pattern matching and may be used for both narrow (8-bit) and for wide (wchar_t) character strings.
It enhances and replaces RWCRegexp and RWCRExpr, which have been deprecated in this release.
However, if your regular expression search requires the usage of backreferences, you will need to use RWCRegexp, rather than RWTRegex<T>. Backreferencing is not supported in extended regular expressions (EREs) but only in basic regular expressions (BREs).
RWTRegex<T> can represent both a simple and an extended regular expression such as those found in lex and awk. The constructor "compiles" the expression into a form that can be used more efficiently. The results can then be used for string searches using class RWCString. Regular expressions (REs) can be of arbitrary size, limited by memory. The extended regular expression features found here are a subset of those found in the POSIX.2 standard (ANSI/IEEE Std. 1003.2, ISO/IEC 9945-2).
RWTRegex<T> differs from the POSIX.2 standard in the following ways:
RWTRegex<T> follows the RWCRegexp and RWCRExpr tradition by treating all RE special characters as special, unless escaped (prefixed with a \). (The POSIX standard dictates that some RE special characters are escaped when used to form a pattern.)
RWTRegex<T> does not currently support locale-based constructs, such as collating symbols, equivalence classes, or character classes.
Constructing a regular expression
To match a single character RE
Any character that is not a special character matches itself.
A backslash (\) followed by any special character matches the literal character itself; that is, its use "escapes" the special character. For example, \* matches "*" without applying the syntax of the * special character.
The "special characters" are:
+ * ? . [ ] ^ $ ( ) { } | \
The period (.) matches any character. For example, ".umpty" matches either "Humpty" or "Dumpty."
A set of characters enclosed in brackets ([ ]) is a one-character RE that matches any of the characters in that set. This means that [akm] matches either an "a", "k", or "m". A range of characters can be indicated with a dash, as in [a-z], which matches any lower-case letter. However, if the first character of the set is the caret (^), then the RE matches any character except those in the set. It does not match the empty string. For example: [^akm] matches any character except "a", "k", or "m". The caret loses its special meaning if it is not the first character of the set.
To match a multicharacter RE
Parentheses (( )) group parts of regular expressions together into subexpressions that can be treated as a single unit. For example, (ha)+ matches one or more "ha"s.
An asterisk (*) following a one-character RE or a parenthesized subexpression matches zero or more occurrences of the RE. Hence, [a-z]* and (ha)* matches zero or more lower-case characters.
A plus (+) following a one-character RE or a parenthesized subexpression matches one or more occurrences of the RE. Hence, [a-z]+ and (ha)+ matches one or more lower-case characters.
A question mark (?) is an optional element. The preceding RE can occur zero or once in the string -- no more. For example, xy?z matches either xyz or xz.
The concatenation of REs is a RE that matches the corresponding concatenation of strings. For example, [A-Z][a-z]* matches any capitalized word.
The OR character ( | ) allows a choice between two regular expressions. For example, jell(y|ies) matches either "jelly" or "jellies".
Braces ({ }) following a one-character RE matches the preceding element according to the number indicated. For example, a{2,3} matches either "aa" or "aaa."
All or part of the regular expression can be "anchored" to either the beginning or end of the string being searched.
If the caret (^) is at the beginning of the (sub)expression, then the matched string must be at the beginning of the string being searched. For example, you could use "t^hat" to return all occurrences of "hat" but avoid returning "that".
If the dollar sign ($) is at the end of the (sub)expression, then the matched string must be at the end of the string being searched. For example, "know$" would match "I know what I know" but not "He knows what he knows."
Overriding the backslash special character
A common pitfall with regular expression classes is overriding the backslash special character (\). The C++ compiler and the regular expression constructor will both assume that any backslashes they see are intended to escape the following character. Thus, to specify a regular expression that exactly matches "a\a", you would have to create the regular expression using four backslashes as follows: the regular expression needs to see "a\\a", and for that to happen, the compiler would have to see "a\\\\a".
RWTRegex reg("a\\\\a"); ^|^| 1 2
The backslashes marked with a ^ are an escape for the compiler, and the ones marked with | will thus be seen by the regular expression parser. At that point, the backslash marked 1 is an escape, and the one marked 2 will actually be put into the regular expression.
Similarly, if you really need to escape a character, such as a ".", you will have to pass two backslashes to the compiler:
RWCRExpr regDot("\\.") ^|
Once again, the backslash marked ^ is an escape for the compiler, and the one marked with | will be seen by the regular expression constructor as an escape for the following ".".
None
#include <rw/tools/regex.h> #include <rw/cstring.h> #include <iostream> using std::cout; using std::endl; int main() { RWCString aString("Hark! Hark! The lark"); // This regular expression matches any lowercase word // or end of a word starting with "l" RWTRegex<char> re("l[a-z]*"); RWTRegexResult<char> result; if (result = re.search(aString)) cout << result.subString(aString) << endl; //Prints "lark" return 0; }
Program output:
lark
Related classes include:
RWTRegexMatchIterator<T> which iterates over matches of a pattern in a given string.
RWTRegexResult<T> which encapsulates the results of a pattern matching operation.
RWTRegexTraits<T> which defines the character traits for a specific type of regular expression character and includes methods for returning these values.
RWRegexErr which reports errors from within RWTRegex<T>.
typedef RW_TYPENAME RWTRegexTraits<T>::Char RChar; typedef RWTRegex<char> RWCRegularExpression; typedef RWTRegexMatchIterator<RChar> iterator;
A typedef based on the same character type as the instantiation of RWTRegex<T>. For example, for RWTRegex<char>::iterator is a typedef for RWTRegexMatchIterator<char>.
typedef RWTRegexMatchIterator<RChar> match_iterator;
NOTE -- The program provides match_iterator and iterator. RWTRegex::iterator will be a match iterator. If you need to add new iterator types, you must give them a descriptive prefix, as in match_iterator.
typedef std::basic_string<RChar> RString;
typedef RWTRegex<char> RWCRegularExpression; typedef RWTRegex<wchar_t> RWWRegularExpression;
RWTRegex(const Rchar* str, size_t length = size_t(-1));
Initializes a RWTRegex<T> object to represent the pattern specified in the str parameter. The length parameter specifies the length, in characters, of the pattern string.
The parameter str specifies the pattern string for the regular expression.
The parameter length specifies the length, in characters, of the pattern string. If the length is not specified, then it is calculated as the number of characters preceding the first occurrence of a NULL character, according to its character traits. (The traits for each type of character are defined in RWTRegexTraits<T>.)
Throws RWTRegexErr if a pattern error is encountered.
RWTRegex(const std::basic_string<E>& str, size_t length = size_t(-1));
Initializes a RWTRegex<T> object to represent the pattern specified in the str parameter. The length parameter specifies the length, in characters, of the pattern string.
The parameter str specifies the pattern string for the regular expression.
The parameter length specifies the length, in characters, of the pattern string. If length is not specified, the length of the input string object str is used.
Throws RWRegexErr if a pattern error is encountered.
RWTRegex(const RWTRegex& source);
Copy constructor. The pattern represented by the "source" object is copied to this RWTRegex object. This copying operation is performed without recompiling the original pattern.
The parameter source is the source RWTRegex object for the copy operation.
RWTRegex();
Default constructor. Objects initialized with this constructor represent uninitialized patterns. These objects should be assigned a valid pattern before use.
virtual ~RWTRegex();
Destructor. Releases any allocated memory.
RWTRegex& operator= (const RWTRegex& rhs);
Assignment operator. Replicates the RWTRegex object specified by rhs, placing the copy in this RWTRegex object. The copy is performed without recompiling the original pattern.
The parameter rhs is the "right hand side" of the assignment expression, and is the source for the copy operation.
Returns a reference to this newly assigned RWTRegex object.
bool operator< (const RWTRegex& rhs) const;
Compares this RWTRegex object to the right hand side RWTRegex object by performing an element-by-element comparison of the characters in each object's pattern string. Character comparisons are performed as defined by the lt method on the RWTRegexTraits<T> class implemented for the type of character in use.
This object is considered less than the right hand side pattern if it contains the lesser of the first two unequal characters, from left to right, or if there are no unequal characters, but this pattern string is shorter than the right hand side pattern string.
The parameter rhs is the "right hand side" RWTRegex object in the comparison expression.
Returns true if this RWTRegex is less than the right hand side RWTRegex, as defined above.
bool operator==(const RWTRegex& rhs) const;
Compares this RWTRegex object to the right hand side RWTRegex object by performing an element-by-element comparison of the characters in each object's pattern string. Character comparisons are performed as defined by the eq method on the RWTRegexTraits<T> class implemented for the type of character in use. This object is considered equal to the right hand side pattern if it contains the same number of characters, and each corresponding pair of characters in the patterns are equal to one another.
The parameter rhs is the "right hand side" RWTRegex object in the comparison expression.
Returns true if this RWTRegex is equal to the right hand side RWTRegex, as defined above.
enum RWTRegexStatus
Defines allowable status codes. These codes are accessed by RWRegexErr.
Ok, MissingEscapeSequence, InvalidHexNibble, InsufficientHex8Data, InsufficientHex16Data, MissingClosingBracket, MissingClosingCurlyBrace, MissingClosingParen, UnmatchedClosingParen, InvalidSubexpression, InvalidDataAfterOr, InvalidDataBeforeOr, ConsecutiveCardinalities, InvalidCardinalityRange, LeadingCardinality, InvalidDecimalDigit, UnmatchedClosingCurly, NumberOfStatusCodes };
const RWRegexErr& getStatus() const;
Used to query the last-pattern compilation status. This method is useful primarily in exception-disabled environments in which the default error handler for the Essential Tools Module error framework has been replaced with a function that does not abort. Otherwise, the regular expression object will not be available for this query.
Returns the regular expression status for the last compilation.
size_t index(const RChar* str, size_t* mLen = 0, size_t start = size_t(0), size_t length = size_t(-1));
Searches an input string for the first occurrence of a match for this regular expression pattern. The search begins with the character at a specified start character position in the supplied input string. It then continues, one character at a time, until either a match is found, or the end of the string is reached.
The length of the input string can be specified with the length argument.
If a match is found, the method returns the index into the string at which the first match was found, starting from the beginning of the string. The length of the match is returned in the mLen argument.
If no match is found, the method returns RW_NPOS.
The parameter str is the string to be searched for a match.
The parameter mLen is a return parameter, and returns the length of any match found during this operation. If not supplied (NULL), the length is not returned, but is available through getLength().
The parameter start is the character position where the search for a match will start.
The parameter length specifies the length, in characters, of the entire input string. If the length is not specified, then it is calculated as the number of characters preceding the first occurrence of a NULL character, as defined by the traits specific to this type of character.
Returns the starting character position, from the beginning of the string, of a match. If no match is found, RW_NPOS is returned.
size_t index(const RString& str, size_t* mLen = 0, size_t start = size_t(0), size_t length = size_t(-1));
Searches an input string for the first occurrence of a match for this regular expression pattern. The search begins with the character at a specified start character position in the supplied input string, and continues, one character at a time, until either a match is found, or the end of the string is reached.
The length of the input string can be specified with the length argument.
If a match is found, the method returns the index into the string at which the first match was found, starting from the beginning of the string. The length of the match is returned in the mLen argument.
If no match is found, the method returns RW_NPOS.
The parameter str is the string to be searched for a match.
The parameter mLen is a return parameter, returning the length of any match found during this operation. If not supplied (NULL), the length is not returned, but is available through getLength().
The parameter start is the character position at which to begin searching for a match.
The parameter length specifies the length, in characters, of the entire input string. If the length is not specified, it is assigned the length of the input string object.
Returns the starting character position, from the beginning of the string, of a match. If no match is found, RW_NPOS is returned.
RWTRegexResult<T> matchAt(const RChar* str, size_t start = size_t(0), size_t length = size_t(-1));
Searches an input string for a match against the pattern string represented by this RWTRegex object. The match must start at the specified character in the input string. (This is similar to anchoring the pattern at the beginning of the string using the circumflex character ^.)
If a match is found, the method returns true, and the match information returned through getStart() and getLength() will represent the longest match starting from the first character in the string.
If no match is found, the method returns false.
The parameter length supplies the number of input string characters to be considered.
The parameter str is the string to be searched for a match.
The parameter start is the character position where the search for a match will start.
The parameter length specifies the length, in characters, of the entire input string. If the length is not specified, then it is calculated as the number of characters preceding the first occurrence of a NULL character, as defined by the traits specific to this type of character.
Returns true if a match is found starting with the first character in the input string.
RWTRegexResult<T> matchAt(const RString& str, size_t start = size_t(0), size_t length = size_t(-1));
An overload of the above search method in which the string is given as a std::basic_string.
Searches an input string for a match against the pattern string represented by this RWTRegex object. The match must start at the specified character in the input string. (This is similar to anchoring the pattern at the beginning of the string using the circumflex character ^.)
If a match is found, the method returns true, and the match information returned through getStart() and getLength() will represent the longest match starting from the first character in the string.
If no match is found, the method returns false.
The parameter length supplies the number of input string characters to be considered.
The parameter str is the string to be searched for a match.
The parameter start is the character position where the search for a match will start.
The parameter length specifies the length, in characters, of the entire input string. If the length is not specified, it is assigned the length of the input string object.
Returns true if a match is found at starting with the first character in the input string.
size_t replace(RString& str, const RString& replacement, size_t count = 1, size_t matchID = 0, size_t start = size_t(0), size_t length = size_t(-1), bool replaceEmptyMatches = true);
Replaces occurrences of the regular expression pattern in str with a replacement string, replacement. The number of replacements is identified by count. The default value for count is 1, which replaces only the first occurrence of the pattern.
Zero-length matches are replaced only if replaceEmptyMatches is true. The search begins at the start character position. The length, in characters, of the original string is identified by length. If no length is given, then the length is assigned to length of the input str object. The input str is updated as part of this operation.
Returns the total number of occurrences replaced.
The parameter str is the string to be searched for a match.
The parameter replacement is the string to replace all occurrences of the pattern in str.
The parameter count is the number of matches to replace. If 0 is specified, all matches are replaced.
The parameter matchID specifies the match identifier of the sub-expression to be replaced. The default value of 0 replaces the overall match with specified replacement text.
The parameter start is the character position where the search for a match will start.
The parameter length specifies the length, in characters, of the entire input string. If the length is not specified, then it is assigned the length of the input string object.
If the boolean replaceEmptyMatches is true, zero-length matches are replaced, as well as all other matches. Otherwise only matches with length greater than zero are replaced.
Returns the starting character position, from the beginning of the string, of a match. If no match is found, RW_NPOS is returned.
RWTRegexResult<T> search(const RChar* str, size_t start = size_t(0), size_t length = size_t(-1));
Searches an input string for the first occurrence of a match for this regular expression pattern. The search begins with the character at a specified start character position in the supplied input string, and continues, one character at a time, until either a match is found, or until the end of the string is reached.
If a match is found, the method returns true, and the match information returned through getStart() and getLength() will represent the longest match starting from the first position at which a match is found.
If no match is found, the method returns false. The length parameter defines the number of input string characters to be considered.
The parameter str is the string to be searched for a match.
The parameter start is the character position where the search for a match will start.
The parameter length specifies the length, in characters, of the entire input string. If the length is not specified, it is calculated as the number of characters preceding the first occurrence of a NULL character, as defined by this character's traits.
Returns true if a match is found at some point within the input string.
RWTRegexResult<T> search(const RString& str, size_t start = size_t(0), size_t length = size_t(-1));
An overload of the above search method. Searches an input string for the first occurrence of a match for this regular expression pattern. The search begins with the character at a specified start character position in the supplied input string, and continues, one character at a time, until either a match is found, or until the end of the string is reached.
If a match is found, the method returns true, and the match information returned through getStart() and getLength() will represent the longest match starting from the first position at which a match is found.
If no match is found the method returns false. The parameter length defines the number of input string characters to be considered.
The parameter str is the string to be searched for a match.
The parameter start is the character position where the search for a match will start.
The parameter length specifies the length, in characters, of the entire input string. If the length is not specified, then it is assigned the length of the input string object.
Returns true if a match is found at some point within the input string.
size_t subCount() const;
Used to query the number of parenthesized subexpressions in a regular expression object.
Returns the number of parenthesized subexpressions in this regular expression.
© Copyright Rogue Wave Software, Inc. All Rights Reserved.
Rogue Wave and SourcePro are registered trademarks of Rogue Wave Software, Inc. in the United States and other countries. All other trademarks are the property of their respective owners.
Contact Rogue Wave about documentation or support issues.