Regular Expression String Searching

Internationalization Module User's Guide
Rogue Wave web site: Home Page | Main Documentation Page

8.4 Regular Expression String Searching

A regular expression is a string pattern composed of normal characters and special characters. Special characters are used to denote an arrangement of the other characters in the regular expression pattern. A regular expression can be used to search for, and perhaps replace, occurrences of the regular expression pattern in strings.

Regular expression syntax describes how to arrange normal characters and special characters to form a valid regular expression pattern. The regular expression syntax for RWURegularExpression is similar to that of the POSIX 2 extended regular expression (ERE) specification, in addition to Unicode extensions. For more information on the POSIX ERE standard, see Section 8.4.2.

The Internationalization Module extends the POSIX 2 ERE syntax to provide support for Unicode basic and tailored regular expressions through the class RWURegularExpression.

Basic Unicode regular expression support corresponds to Level 1 support, as described in the Unicode Regular Expression Guidelines (Unicode Technical Report #18 (UTR-18) Version 5.1 at http://www.unicode.org/reports/tr18/tr18-5.1.html). Basic Unicode regular expressions are useful for the majority of Unicode strings. They add the following Unicode extensions to the POSIX ERE standard:

Hexadecimal notation

Character categories

Subtraction

Simple word boundaries

Simple loose matches

Line breaks

For more information on basic regular expressions, see Section 8.4.3.1.

Tailored regular expressions extend the basic regular expression functionality, corresponding to Level 2 and Level 3 support, also described in UTR-18 Version 5.1. In addition to some minor extensions, the tailored extensions include support for:

Treating surrogate pairs as single characters

Using the script property

Matching canonically equivalent character representations

Specifying grapheme clusters

As always, added power comes at a cost in processing time and space, so if you don't need the power of tailored regular expressions, the default behavior of RWURegularExpression is to use the basic regular expression engine.

For more information on tailored regular expressions, see Section 8.4.3.2 and Section 8.4.3.3.

8.4.1 A Note on Support by UTR Version Number

The Internationalization Module's support for regular expressions is based primarily on UTR-18 Version 5.1. However, the module also provides support for much of UTR-18 Version 6 (http://www.unicode.org/reports/tr18/).

Support for Version 6 includes all Level 1, 2, and 3 features, except for other properties, intersection, tailored properties, element-level loose matches, and fine-grained Level 3 support. While Version 6 places support for surrogates at Level 1, The Internationalization Module provides that support at Level 2, in keeping with the guidelines from UTR-18 Version 5.1.

8.4.2 POSIX Extended Regular Expression Syntax

Although UTR-18 Version 6 suggests use of a Perl-like pattern syntax, the regular expression support in the Internationalization Module uses the POSIX 2 extended regular expression (ERE) pattern syntax, with Unicode extensions, suggested by UTR-18 Version 5.1. That syntax is described in Table 2.

The special characters used by RWURegularExpression are as follows:

Table 2: RWURegularExpression special characters based on POSIX2 syntax;

Character	Meaning
+	Matches one or more occurrences of the preceding item, except in a bracket expression. For example, `a+` matches `a`, `aa`, `aaa`, and so on.
*	Matches zero or more occurrences of the preceding item, except in a bracket expression. For example, `a*` matches the empty string, `a`, `aa`, and so on.
?	Matches zero or one occurrence(s) of the preceding item, except in a bracket expression. For example, `a?` matches the empty string and `a`.
{ and }	Specify a cardinality range, formed as follows: `{m,n}`. This construct matches between `m` and `n` occurrences of the preceding item. For example, `a{2,3}` matches `aa` and `aaa`. This construct can also be formed using `{m,}` and `{m}`. The first matches `m` or more occurrences of the preceding item. For example, `a{2,}` matches `aa`, `aaa`, `aaaa`, and so on. The second matches exactly `m` occurrences of the preceding item. For example, `a{2}` matches `aa`. Note: `{` is treated differently in a bracket expression. In this context, `{` denotes the beginning of a Unicode character category, as described in Section 8.4.3.
[ and ]	Create a bracket expression. Bracket expressions create a set of items, any of which may be matched. For example, `[abc]` matches `a`, or `b`, or `c`. Within a bracket expression all regular expression special characters are treated as normal, non-special characters, except:: `-` specifies a range of character values, based on their bit pattern. For example, `[A-Za-z]` matches all uppercase and lowercase English characters. To indicate `-` as a character in the bracket expression, it must be the first or last character in the set; for example, `[-a-z]` or `[A-Z-]`. `ˆ` is special only when placed in the first character position within the bracket set. Using `ˆ` in the first position complements the set of items to be matched. For example, `[ˆa-z]` matches all characters except for lowercase English letters. `{` denotes the beginning of a Unicode character category (see Section 8.4.3). To use `{` in a bracket expression, escape it by preceding it with the `\` character as follows: `[\{]`. Finally, in order to include a `]` as a character in the bracket set, you must include it as the first character in the set, as in `[]abc]` or `[ˆ]abc]`.
( and )	Group regular expression items into subexpressions, which are treated as a single unit. For example, whereas `ab` matches `a`, `ab`, `abb`, and so on, `(ab)` matches the empty string, `ab`, `abab`, and so on. `(` and `)` are not treated as special characters inside a bracket expression.
\	Escapes a regular expression character, causing it to be treated as a regular character. For example, whereas `(ab)` indicates a subexpression consisting of `ab`, `$ab$` denotes the sequence of characters `(`, `a`, `b`, and `)`. Note: To specify the `\` character in C++ source code, you must specify `\\`, as the C++ compiler treats the `\` character as special, denoting the beginning of an escape sequence embedded in the C++ source code. In data files, or text controls in dialog boxes, however, the double backslash is not necessary.
ˆ	Indicates that a regular expression or subexpression is anchored at the beginning of the input string. For example, `ˆab` matches `ab` and `abc`, but not `cab`. Recall that `ˆ` is treated differently in bracket expressions.
$	Indicates that a regular expression or subexpression is anchored at the end of the input string. For example, `ab$` matches `ab` and `cab`, but not `abc`.
\|	Denotes alternation, or the creation of a set of equally valid, alternate expressions or subexpressions, each of which can be matched. For example, `ab\|cd` matches `ab` or `cd`.
.	Matches any code unit, except for those which indicate the logical end of a line, as outlined in Unicode Technical Report #18: `\u2028`, `\u2029`, `\u000A`, `\u000B`, `\u000C`, `\u000D`, `\u0085`.

All of the above regular expression special characters are treated as special unless escaped. This differs slightly from the POSIX Extended Regular Expression standard, in which some characters are treated as special when escaped, while others are treated as special unless escaped.

8.4.3 Unicode Regular Expressions

This section describes the extensions to the POSIX ERE standard that are part of the RWURegularExpression syntax allowing for basic and tailored regular expressions.

8.4.3.1 Basic Unicode Regular Expression Extensions

This section details the extensions to the POSIX ERE standard that support basic Unicode regular expressions in RWURegularExpression. Basic Unicode regular expression support corresponds to Level 1 Unicode regular expression support as described in Version 5.1 of UTR-18 (http://www.unicode.org/reports/tr18/tr18-5.1.html).

All regular expression pattern strings and search strings are treated as UTF-16 character sequences. UTF-16 is the only encoding supported through the pattern matching interface to RWURegularExpression. All pattern strings are accepted as RWUString objects, or are converted from a specified encoding to RWUString objects internally before being compiled. All search strings are taken as RWUString objects. Subexpression match strings are returned as RWUString objects.

Basic Unicode regular expressions do not recognize UTF-16 surrogate pairs (Unicode code points, or characters, represented as a sequence of two 16-bit code units). Each 16-bit code unit is treated as an individual character. Character properties are obtained from the Unicode character database. Characters are compared based on their bit patterns; no collation is performed. As such, basic Unicode regular expressions are useful for the majority of Unicode strings, and are more efficient than they would be if support for surrogates and collation were required. However, if support for surrogates or collation is required, then basic regular expression support may not meet these needs.

If support for canonical equivalence is required, normalize all strings before passing them to RWURegularExpression. For more information on normalization, see RWUNormalizer.

Basic Unicode regular expression syntax extensions

Hexadecimal notation

The \u syntax allows for the specification of 16-bit Unicode code units. For example, the range expression [\u0020-\u007f] matches any UTF-16 code unit with a numeric value from hexadecimal 20 through hexadecimal 7f.

Character categories

Character categories allow for a more efficient means of expressing characters from such a wide range as the Unicode character set. RWURegularExpression supports both abbreviated and long character category names, and the names are case-sensitive.

Character categories must appear within a bracket set, and are denoted by the text {Category}, where Category is the name of a category to be matched. For example, [{L}{Zs}]* matches zero or more occurrences of any character that is either a letter (L) or a space separator (Zs).

The following two tables list all of the character category names supported by RWURegularExpression. Table 3 includes character categories based on UTR-18. Table 4 includes Rogue Wave-specific character category extensions.

An exception is thrown if any other text appears as a category name.

Table 3: RWURegularExpression character categories based on UTR-18

Category	Description	Category	Description
`L`	All Letters	`Pf`	Final Quote Punctuation
`Lu`	Uppercase Letters	`Po`	Other Punctuation
`Ll`	Lowercase Letters	`S`	All Symbols
`Lt`	Titlecase Letters	`Sm`	Math Symbols
`Lm`	Modifier Letters	`Sc`	Currency Symbols
`Lo`	Other Letters	`Sk`	Modifier Symbols
`M`	All Marks	`So`	Other Symbols
`Mn`	Non-Spacing Marks	`Z`	All Separators
`Mc`	Spacing Combining Marks	`Zs`	Space Separators
`Me`	Enclosing Marks	`Zl`	Line Separator
`N`	All Numbers	`Zp`	Paragraph Separator
`Nd`	Number, Decimal Digit	`C`	"Other" Characters. Same as the union of `Cc`, `Cf`, `Cs`, `Co`, and `Cn`.
`Nl`	Number, Letter	`Cc`	Other, Control
`No`	Number, Other	`Cf`	Other, Format
`P`	All Punctuation Characters	`Cs`	Other, Surrogate
`Pc`	Connector Punctuation	`Co`	Other, Private Use
`Pd`	Dash Punctuation	`ALL`	Matches All Code Units
`Ps`	Open Punctuation	`ASSIGNED`¹	Matches All Assigned Code Units
`Pe`	Close Punctuation	`UNASSIGNED`	Matches All Unassigned Code Units (the opposite of `ASSIGNED`)
`Pi`	Initial Quote Punctuation

A code point is "assigned" if it has a category other than RWUCharTraits::Unassigned. All code points assigned a category, as well as the blocks of code points allocated for private use, are "assigned."

The following table contains Rogue Wave-specific extensions to the set of character categories outlined in UTR-18.

Table 4: Rogue Wave-specific extensions to character categories

Character	Description
`WB` ¹	Matches Word Breaks. Matches a word boundary, much like the `\b` construct in Perl.
`CB`	Matches Character Breaks
`LB`	Matches Line Breaks
`SB`	Matches Sentence Breaks
`BOL1`	Matches at the beginning of a line. Matches at the beginning of a string, or any of the following: `\u2028`, `\u2029`, `\u000D\u000A`, `\u000A`, `\u000B`, `\u000C`, `\u000D`, or `\u0085`.
`EOL1`	Matches at the end of a line. This matches at the end of a string, or any of the following: `\u2028`, `\u2029`, `\u000D\u000A`, `\u000A`, `\u000B`, `\u000C`, `\u000D`, or `\u0085`.

If this category appears in a bracket set, then that bracket set, or any enclosing subexpression without additional data, must not have + or * cardinality, or the pattern is flagged as an invalid pattern, and an exception of type InfiniteEmptyMatch is thrown.

Subtraction

Subtraction allows a regular expression pattern to express the removal of a set of items from an existing bracket set. The syntax for such a construct is: [OriginalSet-[SubtractedSet]], where OriginalSet is a bracket set, and SubtractedSet is a bracket set of items to remove from the OriginalSet. For example, [{L}-[{Lu}]] matches all letters except for uppercase letters. Similarly, [{ASSIGNED}-[{C}]] matches all assigned Unicode characters, except for any characters that fall into the "Other" category.

Simple word boundaries

This feature of basic (Level 1) Unicode regular expressions is available through the use of the WB category, described in Table 4.

Simple loose matches

The only type of loose matches for basic Unicode regular expressions described in UTR-18 are caseless matches. Caseless matching is available in RWURegularExpression through the use of the IgnoreCase option to the constructor.

Line breaks

Line breaks can be matched using RWURegularExpression through the use of the {BOL} and {EOL} extended categories. ˆ and $ are not used to denote the beginning and ending of lines, as this conflicts with the POSIX requirements for these characters. POSIX requires that these characters anchor only at the beginning and ending of an entire string.

8.4.3.2 Tailored Unicode Regular Expression Extensions

Tailored regular expression support extends basic regular expressions. Tailored regular expression support adds Level 2 and Level 3 regular expression support as described in UTF-18 Version 5.1. (http://www.unicode.org/reports/tr18/tr18-5.1.html)

Tailored regular expression support extends basic regular expression support in the following ways.

Tailored Unicode regular expression syntax extensions

Treating surrogate pairs as characters

Tailored support recognizes surrogate pairs during pattern compilation and during pattern matching. For example, consider the pattern, \uD800\uDC00*. With basic regular expressions, the pattern compiler does not recognize \uD800\uDC00 as a surrogate pair, and interprets the pattern as \uD800 followed by zero or more occurrences of \uDC00. However, with tailored support, \uD800\uDC00 is recognized as a single code point, and the pattern is interpreted as zero or more occurrences of the code point, \uD800\uDC00. During matching, full code points are extracted for testing against ".", categories, bracket sets, and all other constructs. Further, during search operations, only code point boundaries are considered as potential match starting positions.

The use of the script property

Tailored regular expressions allow for testing a code point for a script property. The script property uses a syntax similar to that of general categories. The syntax is as follows:

[{Script}]

As with categories, a script specification must appear in a bracket set, and must be surrounded by curly braces. Within the curly braces is the name of a script, which is case-sensitive. The following table lists scripts that are supported by tailored regular expressions.

Table 5: Script properties supported by tailored regular expressions

Property	Property
Common	Inherited
Arabic	Armenian
Bengali	Bopomofo
Cherokee	Coptic
Cyrillic	Deseret
Devanagari	Ethiopic
Georgian	Gothic
Greek	Gujarati
Gurmukhi	Han
Hangul	Hebrew
Hiragana	Kannada
Katakana	Khmer
Lao	Latin
Malayalam	Mongolian
Myanmar	Ogham
OldItalic	Oriya
Runic	Sinhala
Syriac	Tamil
Telugu	Thaana
Thai	Tibetan
Ucas	Yi

For example, the following pattern matches one or more occurrences of a character in the Thai script: [{Thai}]+

The ability to specify code points using \v syntax

The \v syntax is given as \vXXXXXX, where each X is a valid hexadecimal digit. The \v must be followed by exactly six valid hexadecimal digits. For example, the surrogate pair, \uD800\uDC00 could be specified as \v010000. \v escape sequences can appear anywhere in a pattern, including bracket expressions. Recall that, as with any escape sequence, the \v must be double-escaped when specified in C++ source code, \\v010000. The first escape is for the C++ compiler.

Matching canonical equivalents

Tailored regular expressions match canonical equivalents. For example, the pattern, a\u0308 matches against botha\u0308and ä.

Specifying grapheme clusters

Tailored regular expressions allow for the specification of grapheme clusters using the \g syntax. The syntax for grapheme clusters is \g{grapheme}, where \g starts the grapheme cluster specification. The { and } must surround the grapheme cluster. Within the curly braces, the grapheme is specified. Grapheme clusters can appear anywhere in the pattern, including bracket sets. For example, the pattern, ab\g{ch}d, matches the string, abchd. With the traditional Spanish locale, the pattern, [\g{ch}-d], matches ch and d, but does not match c or e. Recall that, as with any escape sequence, the \v must be double-escaped when specified in C++ source code, \\v010000. The first escape is for the C++ compiler.

Performing all comparisons using collation

With tailored regular expressions, all comparisons are performed using Unicode collation. The type of collation can be specified using the setCollationStrength() method, and queried using the getCollationStrength() method. These methods may be used only with tailored regular expressions, and throw an unsupported error exception with basic regular expressions.

The collation support in RWURegularExpression is coarse-grained, meaning that it applies to the entire pattern. At this time, no fine-grained collation is supported.

If no collation strength is specified, then the default collation strength for the specified locale is used. For many locales, the default strength is Tertiary. For example in the en locale, the following pattern would use tertiary collation strength by default: résumé. At this default level, the string, résumé, would match. However, resume and Résumé would not match. On the other hand, if the collation strength for the pattern is changed to Primary, then all of the following would match: resume, résumé, and Résumé.

Tailored regular expressions, by default, do not recognize graphemes (other than those specified with \g) during pattern compilation, or when matching the "." (or any other element).

As such, the pattern a\u0308+ would match an a followed by one or more occurrence of \u0308. Similarly, "." would match only the "a" in a\u0308. As an alternative, the InterpretGraphemes option can be used with tailored regular expressions. If this option is given as a constructor argument for a tailored regular expression, then the pattern a\u0308+ above would be interpreted as one or more occurrence of a\u0308, or ä, or any other equivalent.

Similarly, "." would match all of a\u0308.

The "InterpretGraphemes" option is ignored for basic regular expressions.

8.4.3.3 How to Use Tailored Regular Expressions

To allow RWURegularExpression to use the tailored regular expression features, you may pass RWURegularExpression::Tailored as the second argument of the constructor as follows:

RWURegularExpression re(SomeRWUString,RWURegularExpression::Tailored);

or you may construct first, then set the level:

re.setLevel(RWURegularExpression::Tailored);

For more information on creating a regular expression, see Section 8.4.4.

8.4.4 How to Create an RWURegularExpression

RWURegularExpression objects are constructed from pattern strings. The pattern string can be a string literal, an RWCString, or an RWUString. For example, this code creates an RWURegularExpression that could be used to search for a bold item encoded in the ASCII range of characters in an HTML document:

RWUConversionContext context("ascii");

RWUString pattern("<b>([\\u0020-\\u007f]*)</b>");  
RWURegularExpression r(pattern);

If an RWURegularExpression is constructed from a string literal or RWCString, the pattern data is expected to be NULL-terminated, and is converted to Unicode using the given converter. (See Chapter 4 for more information on converting between encodings.) If no converter is supplied, the converter managed by the current to-Unicode conversion context is used. Any escape sequences are unescaped.

Other optional arguments to the constructors include:

Options for pattern matching; currently only caseless matches are supported
The level of Unicode regular expression conformance, either basic or tailored (See Section 8.4.3 for more information on supported levels.)
The converter to use for character conversion

The locale to use (See Chapter 10 for more information on locales.)

The regular expression instance uses the locale to determine locale-specific behavior in a tailored regular expression (Locales have little effect on basic regular expressions). Grapheme clusters, character sets, and the break locations for words, sentences and lines may change depending on locale. For example, the Spanish character "ch" is found in the character set "[b-d]" in Spanish locales, but not in English.

You may also set the locale using the setLocale() method.

For example, the following code creates an RWURegularExpression that could be used to search for the characters abc at the end of line, without regard to case:

RWUConversionContext context("ascii");

RWUString pattern("abc$");
RWURegularExpression r(pattern, 
                       RWURegularExpression::Basic,
                       RWURegularExpression::IgnoreCase);

Similarly, this pattern uses character categories to search for line breaks in accordance with the conventions of the zh_TW locale:

RWUString  
   pattern("ˆ[{L}{Zs}]+[{BOL}][{L}{Zs}]+[{EOL}][{L}{Zs}]+$");
RWURegularExpression r(pattern, 
                       RWURegularExpression::Basic,
                       RWURegularExpression::Normal,
                       RWULocale(zh_TW));

8.4.5 Searching for Pattern Matches

RWURegularExpression provides two interfaces for searching strings for occurrences of regular expression pattern matches: matchAt() and search().

The overloaded matchAt() methods test whether a match starts at a specified position in the input string. Positions are specified using RWUConstStringIterator instances. For example, assuming pattern is an RWUString representing a regular expression pattern and str is an RWUString representing the input string, this code tests for a match of pattern at position 3 in str:

RWURegularExpression r(pattern); 
RWUConstStringIterator pos = 
   str.beginCodePointIterator().advanceCodePoints(3);

RWURegexResult result = r.matchAt(str, pos);

Matches that may begin before or after position 3 are not reported. Similarly, this code tests for a match of pattern at position 3 in str, and not extending beyond position 8:

RWURegularExpression r(pattern); 
RWUConstStringIterator pos = 
   str.beginCodePointIterator().advanceCodePoints(3);
RWUConstStringIterator end = 
   str.beginCodePointIterator().advanceCodePoints(8);

RWURegexResult result = r.matchAt(str, pos, end);

Similarly, the overloaded search() methods search an input string for an occurrence of a regular expression pattern. For instance:

RWURegularExpression r(pattern); 
RWURegexResult result = r.search(str);

By default, the search begins at the beginning of the string, and continues until either the end of the string is reached, or a match is found. Optional arguments allow you to specify other start and end positions for the search. For example, this code begins searching at position 5, and continues until either position 21 is reached, or a match is found:

RWURegularExpression r(pattern); 
RWUConstStringIterator start = 
   str.beginCodePointIterator().advanceCodePoints(5);
RWUConstStringIterator end = 
   str.beginCodePointIterator().advanceCodePoints(21);

RWURegexResult result = r.search(str, start, end);

8.4.6 Manipulating Match Results

Match results from the RWURegularExpression methods search() and matchAt() are returned as RWURegexResult objects. These instances can be used later to obtain details concerning the regular expression match.

For example, this class contains a conversion to bool, which indicates whether the search() or matchAt() operation found a match. Thus:

RWURegularExpression r(pattern); 
RWURegexResult result = r.search(str);

if (result) {
   // Do something here
}

You can obtain standard iterators to the beginning and ending of the overall match, or of a subexpression match, from an RWURegexResult using the begin() and end() methods, respectively. You can also use the provided getStart() and getLength() methods to find the extent of a match. For example:

RWURegularExpression r(pattern); 

if (r.search(str)) {
    std::cout << "Match at offset: " << result.getStart() << "\n"
              << "Match length: " << result.getLength()
              << std::endl;
}

The provided subString() method returns the substring for a match as an immutable RWUConstSubString.

8.4.7 Replacing Pattern Matches

The overloaded replace() methods replace occurrences of a regular expression pattern in an input string with a given replacement string. A count argument allows you to specify how many matches are replaced. By default, only the first match is replaced; specifying a count of 0 replaces all occurrences of the pattern. A matchID argument names the subexpression match that is replaced; the default value is 0, which replaces the overall match. For instance, this code replaces the first five occurrences of pattern in str with replacement:

RWURegularExpression r(pattern);
size_t num = r.replace(str, replacement, 5, 0);

The function replace() returns the number of occurrences of the pattern that are actually replaced. For example, if str contains only three occurrences of pattern, then num equals 3 in the code above.

Overloads of replace() enable you to specify the start and end positions in the input string for the replace operation. Positions are specified using RWUConstStringIterator instances. Thus, this code replaces all occurrences of pattern in str from the beginning of the string to position 25:

RWURegularExpression r(pattern);
RWUConstStringIterator start = str.beginCodePointIterator();
RWUConstStringIterator end = 
   str.beginCodePointIterator().advanceCodePoints(25);

size_t num = r.replace(str, replacement, 0, 0, start, end);

Finally, the Boolean replaceEmptyMatches argument allows you to specify whether or not empty (zero-length) matches should be replaced. The default is true. For example, this code sets replaceEmptyMatches to false:

size_t num = r.replace(str,
                       replacement,
                       0,
                       0,
                       str.beginCodePointIterator(),
                       str.endCodePointIterator(),
                       false);

8.4.8 Iterating Over Pattern Matches

RWURegexMatchIterator provides a convenient interface for finding all successive matches of a particular regular expression pattern in a string. RWURegexMatchIterator is a forward iterator, allowing forward searches over the specified string using pre-increment and post-increment operators. For example:

RWURegularExpression r(pattern);
 
for (RWURegexMatchIterator iter(r, str);
     iter != RWURegexMatchIterator(); ++iter) {
   std::cout << "Match at offset: " << iter->getStart()
             << std::endl;
}

Note the use of the default RWURegexMatchIterator constructor, which creates an invalid iterator that can be used to test for the end-of-iteration condition.

As with many iterators, changing the item(s) being iterated over invalidates the match iterator. If the regular expression pattern or search string used by an RWURegexMatchIterator is changed, then the match iterator is invalidated.

The Rogue Wave name and logo, and SourcePro, are registered trademarks of Rogue Wave Software. All other trademarks are the property of their respective owners.
Provide feedback to Rogue Wave about its documentation.