Internationalization Module User’s Guide : Chapter 8 Pattern Matching : Regular Expression String Searching
Regular Expression String Searching
A regular expression is a string pattern composed of normal characters and special characters. Special characters are used to denote an arrangement of the other characters in the regular expression pattern. A regular expression can be used to search for, and perhaps replace, occurrences of the regular expression pattern in strings.
Regular expression syntax describes how to arrange normal characters and special characters to form a valid regular expression pattern. The regular expression syntax for RWURegularExpression is similar to that of the POSIX 2 extended regular expression (ERE) specification, in addition to Unicode extensions. For more information on the POSIX ERE standard, see “POSIX Extended Regular Expression Syntax”.
The Internationalization Module extends the POSIX 2 ERE syntax to provide support for Unicode basic and tailored regular expressions through the class RWURegularExpression.
Basic Unicode regular expression support corresponds to Level 1 support, as described in the Unicode Regular Expression Guidelines (Unicode Technical Report #18 (UTR-18) Version 5.1 at http://www.unicode.org/reports/tr18/tr18-5.1.html). Basic Unicode regular expressions are useful for the majority of Unicode strings. They add the following Unicode extensions to the POSIX ERE standard:
Hexadecimal notation
Character categories
Subtraction
Simple word boundaries
Simple loose matches
Line breaks
For more information on basic regular expressions, see “Basic Unicode Regular Expression Extensions.”
Tailored regular expressions extend the basic regular expression functionality, corresponding to Level 2 and Level 3 support, also described in UTR-18 Version 5.1. In addition to some minor extensions, the tailored extensions include support for:
Treating surrogate pairs as single characters
Using the script property
Matching canonically equivalent character representations
Specifying grapheme clusters
As always, added power comes at a cost in processing time and space, so if you don't need the power of tailored regular expressions, the default behavior of RWURegularExpression is to use the basic regular expression engine.
For more information on tailored regular expressions, see “Tailored Unicode Regular Expression Extensions” and “How to Use Tailored Regular Expressions.”
A Note on Support by UTR Version Number
The Internationalization Module’s support for regular expressions is based primarily on UTR-18 Version 5.1. However, the module also provides support for much of UTR-18 Version 6 (http://www.unicode.org/reports/tr18/).
Support for Version 6 includes all Level 1, 2, and 3 features, except for other properties, intersection, tailored properties, element-level loose matches, and fine-grained Level 3 support. While Version 6 places support for surrogates at Level 1, The Internationalization Module provides that support at Level 2, in keeping with the guidelines from UTR-18 Version 5.1.