RWURegularExpression

Internationalization Module Reference Guide
Rogue Wave web site: Home Page | Main Documentation Page

RWURegularExpression

Module: Internationalization Module Group: Unicode String Processing


Does Not Inherit

Local Index
Header File
Description
Example
Related Classes
Public Enums
Public Constructors
Public Destructor
Public Member Operators
Public Member Functions

Local Index

Members

getCollationStrength()
getLevel()
getLocale()
getOptions()
getPattern()

matchAt()
operator<()
operator=()
operator==()
Options

replace()
RWURegularExpression()
search()
setCollationStrength()
setLevel()

setLocale()
Status
subCount()
UnicodeConformanceLevel

Header File

#include <rw/i18n/RWURegularExpression.h>

Description

RWURegularExpression supports regular expressions with Unicode extensions.

A regular expression is a string pattern composed of normal characters and special characters. Special characters are used to denote an arrangement of the other characters in the regular expression pattern. A regular expression can be used to search for, and perhaps replace, occurrences of the regular expression pattern in strings.

Regular expression syntax describes how to arrange normal characters and special characters to form a valid regular expression pattern. The regular expression syntax for RWURegularExpression is similar to that of the POSIX 2 extended regular expression (ERE) specification. For more information see Section 8.4.2, "POSIX Extended Regular Expression Syntax," in the Internationalization Module User's Guide.

RWURegularExpression extends the POSIX 2 ERE syntax to provide support for Unicode basic and tailored regular expressions.

Basic Unicode regular expression support corresponds to Level 1 support, as described in the Unicode Regular Expression Guidelines (Unicode Technical Report #18 (UTR-18) Version 5.1 at http://www.unicode.org/reports/tr18/tr18-5.1.html). Basic Unicode regular expressions are useful for the majority of Unicode strings, and extend the POSIX ERE standard with the following Unicode extensions:

Hexadecimal notation
Character categories
Subtraction
Simple word boundaries
Simple loose matches
Line breaks

Tailored Unicode regular expressions extend the basic regular expression functionality, corresponding to Level 2 and Level 3 support, also described in UTR-18 Version 5.1. In addition to some minor additions, tailored extensions include support for:

Treating surrogate pairs as single characters
Using the script property
Matching canonically equivalent character representations
Specifying grapheme clusters

For more information on basic and tailored regular expression support in the Internationalization Module, Section 8.4.3, "Unicode Regular Expressions," in the Internationalization Module User's Guide.

The Role of the Locale in a Regular Expression

RWURegularExpression accepts an RWULocale argument in its constructor, or via the setLocale() method.The regular expression instance uses the locale to determine locale-specific behavior in a tailored regular expression (Locales have little effect on basic regular expressions). Grapheme clusters, character sets, and the break locations for words, sentences and lines may change depending on locale. For example, the Spanish character "ch" is found in the character set "[b-d]" in Spanish locales, but not in English.

For more information on creating regular expressions, Section 8.4.4, "How to Create an RWURegularExpression," in the Internationalization Module User's Guide.

Example

#include <rw/i18n/RWURegularExpression.h>
#include <rw/i18n/RWUConversionContext.h>
#include <rw/i18n/RWUString.h>
#include <iostream> 

using std::cout;
using std::endl;

int
main()
{
  // Indicate string literals are encoded as US-ASCII strings.
  RWUConversionContext context("US-ASCII");

  // Create a string in which to search.
  RWUString text("The quick brown fox.");

  // Create a regular expression to search for "own" as a
  // distinct word.  The character category [{WB}] will be
  // interpreted in terms of the default locale.  Use
  // RWURegularExpression::setLocale() to intepret breaks
  // in terms of a different locale.
  RWURegularExpression regexp("[{WB}]own[{WB}]");

  // This search should fail because "own" appears only
  // within the word "brown" and not as a distinct word.
  RWURegexResult result = regexp.search(text);
  if (result) {
    cout << "Overall match at offset " << int32_t(result.begin(text))
         << " with length " << result.getLength() << "." << endl;
  } else {
    cout << "No match" << endl;
  } // else
  
  // Create a regular expression to search for "quick" as
  // a distinct word.  
  regexp = RWURegularExpression("[{WB}]quick[{WB}]");

  // This search should succeed.
  result = regexp.search(text);
  if (result) {
    cout << "Overall match at offset " << int32_t(result.begin(text))
         << " with length " << result.getLength() << "." << endl;
  } else {
    cout << "No match" << endl;
  } // else
 
  return 0;
} // main

Results:
========

No match
Overall match at offset 4 with length 5.

Public Enums

enum Options { Normal,
               IgnoreCase,
               InterpretGraphemes
};

Lists options for changing the behavior of RWURegularExpression pattern matching. The Normal value specifies normal pattern matching operations, with no special options enabled. The ignoreCase value indicates that characters in the pattern string and search string should be compared without regard to case.

The InterpretGraphemes option is valid only with tailored regular expressions. It causes the pattern compiler to recognize graphemes such as "a\u0308" as a single unit. This changes, for instance, how cardinalities are applied. For example, with this setting, "a\u0308*" matches 0 or more occurrences of anything equivalent to "a\u0308;" whereas without this option, the pattern would match an a, followed by zero or more occurrences of "\u0308".

Further, InterpretGraphemes changes the behavior of ".". With this option, "." matches any logical character including graphemes except for the end of a logical line. Without this option, "." matches any code point except for one which indicates the end of a logical line. (For a list of specific characters excepted, Section 8.4.2, "POSIX Extended Regular Expression Syntax," in the Internationalization Module User's Guide.)

enum Status { Ok,
              MissingEscapeSequence,
              InvalidHexNibble,
              InsufficientHex8Data,
              InsufficientHex16Data,
              MissingClosingBracket,
              MissingClosingCurlyBrace,
              MissingClosingParen,
              UnmatchedClosingParen,
              InvalidSubexpression,
              InvalidDataAfterOr,
              InvalidDataBeforeOr,
              ConsecutiveCardinalities,
              InvalidCardinalityRange,
              LeadingCardinality,
              InvalidDecimalDigit,
              UnmatchedClosingCurly,
              NeverEndingCategoryName,
              InvalidCategoryName,
              InfiniteEmptyMatch,
              ASCIIConversionError,
              InvalidGraphemeCluster,
              NumberOfStatusCodes
};

Lists regular expression pattern error codes that could be reported during regular expression pattern compilation. These error codes are reported through an exception of type RWRegexErr. The values have the following meanings:

Ok indicates that the pattern has been successfully compiled.
MissingEscapeSequence indicates a missing escape sequence, as in "ab\".
InvalidHexNibble indicates an invalid hexadecimal escape sequence, as in "ab\u00fg".
InsufficientHex8Data indicates an insufficient number of hex nibbles in an 8-bit hexadecimal escape sequence, as in "ab\x0".
InsufficientHex16Data indicates an insufficient number of hex nibbles in a 16-bit hexadecimal escape sequence, as in "ab\u00f".
MissingClosingBracket indicates a missing closing bracket on a bracket expression, as in "ab[cd".
MissingClosingCurlyBrace indicates a missing closing curly brace in a cardinality specification, as in "(abc){2,3".
MissingClosingParen indicates a missing closing parenthesis in a subexpression definition, as in "ab(c(d)ef".
UnmatchedClosingParen indicates that a closing parenthesis was found for which there is no opening parenthesis, as in "ab(cd)e)f".
InvalidSubexpression indicates that an invalid subexpression specification has been encounted, as in "ab(*cd)".
InvalidDataAfterOr indicates that the character following an alternation symbol | was considered invalid, as in" ab|*cd" or "ab||cd".
InvalidDataBeforeOr indicates that the data preceding an alternation symbol | was considered invalid, as in "|", "|bc", or "ab(|cd)".
ConsecutiveCardinalities indicates that consecutive cardinality specifiers were found in the pattern, as in "a*+" or "ab{2,3}*".
InvalidCardinalityRange indicates that an invalid cardinality range was specified, as in "ab{,}" or "a{}".
LeadingCardinality indicates that a leading cardinality specifier was encountered, as in "*a".
InvalidDecimalDigit indicates that an invalid decimal digit was encountered in a pattern string, as in "ab{3,a}".
UnmatchedClosingCurly indicates that a closing curly brace was encountered for which there was no matching opening curly brace, as in "ab2,3}".
NeverEndingCategoryName indicates that a category name was started, but that no closing curly brace was found to end the category name, as in "[{L]+123".
InvalidCategoryName indicates that an unrecognized category name was specified in a bracket expression, as in "[{Smile}]".
InfiniteEmptyMatch indicates that a category that could produce a zero-length match was found with inifinite cardinality. Such categories include: Word Break (WB), Character Break (CB), Line Break (LB), Sentence Break (SB), Beginning of Line (BOL), and End of Line (EOL). For example, the following are invalid: "[{WB}]*" and "ab([{WB}])*cd".
ASCIIConversionError indicates that a problem was encountered while converting an ASCII pattern string to UTF16. This can occur only when using the RWCString conversion constructor.
InvalidGraphemeCluster indicates that an invalid grapheme cluster specification was found. This implies that the grapheme cluster did not follow the syntax "\g{...}", where ... is any sequence of code units. For example, "\gab}" is invalid because of a missing opening curly brace.
NumberOfStatusCodes indicates the number of status codes potentially reported during the compiliation of regular expression patterns. For internal use only.

enum UnicodeConformanceLevel { Basic, Tailored };

Sets the level of Unicode regular expression support available through RWURegularExpression. Two levels are available: basic (Level 1), and tailored (Levels 2 and 3), both described in Version 5.1 of Unicode Technical Report #18 (UTR-18), available at: http://www.unicode.org/reports/tr18/tr18-5.1.html. Also, Section 8.4.3, "Unicode Regular Expressions," in the Internationalization Module User's Guide.

Basic: Specifies basic Unicode regular expression support
Tailored: Specifies tailored Unicode regular expression support, which adds full support for surrogates, and locale-based handling of graphemes and string collation.

Public Constructors

RWURegularExpression();

Default constructor. Creates an empty regular expression pattern object that does not match any input string.

explicit RWURegularExpression(const char* pattern,
           UnicodeConformanceLevel level = Basic,
           int32_t options = int32_t(Normal),
           const RWULocale& locale = RWULocale::getDefault(),
           RWUToUnicodeConverter& converter = 
             RWUToUnicodeConversionContext::
              getContext().getConverter());

Constructs an RWURegularExpression from the null-terminated char* pattern. The argument pattern is converted to Unicode using the specified converter. The default encoding for the system is used in the absence of a specified converter. Any \u escape sequences are handled as for RWUString::unescape().

The conformance level indicates the desired level of Unicode Regular Expression conformance. The default is Basic.
The argument options is a int32_t bit-mask of Options, specifying special options for pattern matching. The default value is Normal, indicating no special matching options are used.

Throws std::bad_alloc if memory resources are exhausted during pattern compilation. Throws RWRegexErr to report pattern compilation errors.

explicit RWURegularExpression(const RWCString& pattern,
           UnicodeConformanceLevel level = Basic,
           int32_t options = int32_t(Normal),
           const RWULocale& locale = RWULocale::getDefault(),
           RWUToUnicodeConverter& converter =
             RWUToUnicodeConversionContext::
              getContext().getConverter());

Constructs an RWURegularExpression from the RWCString pattern. The argument pattern is converted to Unicode using the specified converter. The default encoding for the system is used in the absence of a specified converter. Any \u escape sequences are handled as for RWUString::unescape().

The conformance level indicates the desired level of Unicode Regular Expression conformance. The default is Basic.
The argument options is a int32_t bit-mask of Options, specifying special options for pattern matching. The default value is Normal, indicating no special matching options are used.

Throws std::bad_alloc if memory resources are exhausted during pattern compilation. Throws RWRegexErr to report pattern compilation errors.

explicit RWURegularExpression(const RWUString& pattern,
           UnicodeConformanceLevel level = Basic,
           int32_t options = int32_t(Normal),
           const RWULocale& locale = RWULocale::getDefault());

Constructs an RWURegularExpression from the RWUString pattern.

The conformance level indicates the desired level of Unicode Regular Expression conformance. The default is Basic.
The argument options is a int32_t bit-mask of Options, specifying special options for pattern matching. The default value is Normal, indicating no special matching options are used.

Throws std::bad_alloc if memory resources are exhausted during pattern compilation. Throws RWRegexErr to report pattern compilation errors.

RWURegularExpression(const RWURegularExpression& source);

Copy constructor. Creates a copy of the source RWURegularExpression object. Throws std::bad_alloc if memory resources are exhausted during pattern compilation.

Public Destructor

~RWURegularExpression();

Destructor.

Public Member Operators

RWURegularExpression&
operator=(const RWURegularExpression& rhs);

Assigns the rhs regular expression object to self.

bool
operator<(const RWURegularExpression& rhs);

Compares two regular expression objects. The comparison is performed using RWUString::operator< to compare the pattern strings stored in each regular expression. Returns true if self's pattern is less than the rhs pattern; otherwise, false.

bool
operator==(const RWURegularExpression& rhs);

Compares two regular expression objects. The comparison is performed using RWUString::operator== to compare the pattern strings stored in each regular expression. Returns true if self's pattern is equal to the rhs pattern; otherwise, false.

Public Member Functions

RWUCollator::CollationStrength
getCollationStrength() const;

Returns the collation strength for the collator used in pattern matching with self. This method applies only to Tailored regular expressions. Throws RWUException if invoked on a basic regular expression.

UnicodeConformanceLevel
getLevel() const;

Returns the current level of Unicode regular expression support associated with self.

RWULocale
getLocale() const;

Returns a copy of the locale used by self.

int32_t
getOptions() const;

Returns the pattern matching Options associated with self as an int32_t bit-mask.

RWUString
getPattern() const;

Returns the RWUString pattern string currently associated with self.

RWURegexResult
matchAt(const RWUString& str) const;

Tests for a match for this regular expression at the first character position in input string str. Does not find matches that begin after this position.

RWURegexResult
matchAt(const RWUString& str,
        const RWUConstStringIterator& start) const;

Tests for a match for this regular expression at the specified start character position in input string str. Does not find matches that begin other than at this position.

RWURegexResult
matchAt(const RWUString& str,
        const RWUConstStringIterator& start,
        const RWUConstStringIterator& end) const;

Tests for a match for this regular expression at the specified start character position in input string str. Does not find matches at other than the start position or that end after the end position.

size_t
replace(RWUString& str,
        const RWUString& replacement,
        size_t count = size_t(1),
        int32_t matchID = 0) const;

Replaces substrings in str that match this regular expression with the specified replacement string. Up to count occurrences are replaced. The default count is 1. Specifying a count of 0 replaces all occurrences of the pattern. Returns the number of replacements. Empty (zero-length) matches are replaced.

size_t
replace(RWUString& str, 
        const RWUString& replacement, size_t count,
        int32_t matchID,
        const RWUConstStringIterator& start) const;

Replaces substrings in str that match this regular expression with the specified replacement string. Up to count occurrences are replaced. The default count is 1. Specifying a count of 0 replaces all occurrences of the pattern. The search for pattern matches begins at the specified start position. Returns the number of replacements. Empty (zero-length) matches are replaced.

size_t
replace(RWUString& str, 
        const RWUString& replacement, size_t count,
        int32_t matchID, const RWUConstStringIterator& start,
        const RWUConstStringIterator& end,
        bool replaceEmptyMatches = true) const;

Replaces substrings in str that match this regular expression with the specified replacement string. Up to count occurrences are replaced. Specifying a count of 0 replaces all occurrences of the pattern. The search for pattern matches begins at a specified start position. No match that extends beyond the specified end position is replaced. The method also allows you to specify whether or not empty (zero-length) matches should be replaced; the default is true.

RWURegexResult 
search(const RWUString& str) const;

Searches input string str for substrings that match this regular expression. The search begins at the beginning of the string, and continues until either the end of the string is reached, or a match is found. Returns an instance of RWURegexResult to report the result of the operation.

RWURegexResult
search(const RWUString& str,
       const RWUConstStringIterator& start) const;

Searches input string str for substrings that match this regular expression. The search begins at the specified start position, and continues until either the end of the string is reached, or a match is found. Returns an instance of RWURegexResult to report the result of the operation.

RWURegexResult
search(const RWUString& str, 
       const RWUConstStringIterator& start,
       const RWUConstStringIterator& end) const;

Searches input string str for substrings that match this regular expression. The search begins at the specified start position, and continues until either the specified end position is reached, or a match is found. No match that extends beyond the specified end position is found. Returns an instance of RWURegexResult to report the result of the operation.

void
setCollationStrength(RWUCollator::CollationStrength);

Sets the collation strength for the collator used in pattern matching with self. This method applies only to Tailored regular expressions. Throws RWUException if this method is invoked on a basic regular expression.

void
setLevel(UnicodeConformanceLevel level = Basic);

Sets the Unicode conformance level for self to the specified level. The default is Basic.

NOTE -- When setLevel() is called, the regular expression pattern is recompiled into a form that more efficiently allows for the specified level of Unicode support.

void
setLocale(const RWULocale& loc);

Imbues a locale on the regular expression object. The locale is used internally in the detection of breaks in the text.

size_t
subCount() const;

Returns the count of parenthesized subexpressions contained in the regular expression pattern associated with self. For example, in the pattern: a(b(c)d)e, there are two parenthesized subexpressions.

© Copyright Rogue Wave Software, Inc. All Rights Reserved.
Rogue Wave and SourcePro are registered trademarks of Rogue Wave Software, Inc. in the United States and other countries. All other trademarks are the property of their respective owners.
Contact Rogue Wave about documentation or support issues.