Codereview request for 7014640: To add a metachar \R for line ending and character classes for vertical/horizontal ws \v \V \h \H

Xueming Shen xueming.shen at
Tue May 1 18:06:55 UTC 2012


Just noticed that webrev url was pointing to the blenderrev. The webrev 
is at

Btw, this one has been approved by CCC.


On 04/21/2012 12:56 AM, Xueming Shen wrote:
> Hi
> Here are the webrev and blenderrev for the proposed change to add 5 
> new regex constructs \R \v \V \h \V.
> \R:  recommended by Unicode Regex TR#18 for matching all line ending
>     characters and sequences, is equivalent to
>     ( \u000D\u000A | [\u000A\u000B\u000C\u000D\u0085\u2028\u2029] )
> \h, \v, \H and \V:
>      matches any character considered to (not) be horizontal/vertical 
> whitespace.
> Webrev:
> Blenderrev:
> new Pattern api
> Here are couple notes regarding the spec/implementation.
> (1) \v was implemented as \u000B ('\013'), but not documented (did not 
> appear in our API
> doc as one supported construct, such as \t \r \n...). To define \v as 
> a "general" construct for
> all vertical whitespace characters might trigger some compatibility 
> concerns (more characters
> are now matched by \v). But given this is a never documented 
> implementation detail and the
> \u000B is still being matched by \v, I would consider this as an 
> acceptable behavior change.
> (2) a predefined character class can appear inside another character 
> class, for example
> you can have [...\v...], however, since it represents a "class" of 
> character, so it can't be
> a start or end code point of a range inside a class, so you can have 
> [a-b], but you can't
> have [\h-...] or [...-\h] (exception will be thrown). But for \v,  
> since it was implemented
> as \u000B (VT), you were able to put it as a start or end value of a 
> range, I feel it'd be
> better still keep it the way it worked before, so [\v-\v] works and 
> will match the VT in
> this implementation.
> (3) The newly added \h\v\H\V constructs are all "Unicode version" of 
> character classes,  the
> rest of the "predefined character class" family (\d\D\s\S\w\W) are 
> ASCII only, you will have to
> turn on flag UNICODE_CHARACTER_CLASS to get the Unicode version of 
> these constructs. This
> is kinda of inconsistent. Perl's corresponding constructs work in a 
> similar way, all Perl's \d\D\s\S
> \w\W\v\V\h\H work in Unicode version, and to have a \a modifier to 
> turn the \d\D\s\S\w\W
> back to ASCII mode but not for \h\v\H\V. We had the discussion back 
> into JDK7 regarding the
> ASCII vs Unicode for these constructs, the decision then was to keep 
> these predefined character
> classes (and POSIX character classes) ASCII by default, to have a flag 
> to turn them into Unicode version. Given there is NOT an ASCII version 
> in Perl and we didn't
> have ASCII version in Java regex to trigger compatibility concern, I 
> feel it might be better to
> just have a simple Unicode version of \h\v\H\V.
> (4)\R is not a character class, since it matched \r\n.
> This one will need to go through ccc process.
> Thanks,
> -Sherman

More information about the core-libs-dev mailing list