RFR JDK-8013254: Constructor \w need update to add the support of \p{Join_Control}

Xueming Shen xueming.shen at oracle.com
Tue Apr 30 17:01:44 UTC 2013


It appears we dropped the ball on u+200c and u+200d when we updated
the "simple word boundaries" back to jdk7 [1]. You can find most of the
related discussion here [2]. These 2 code points are listed as one of the
issues we were trying to fix but obviously the final doc and implementation
don't address them. Mainly because the \p{Join_Control} was not explicitly
listed in TR#18 "compatibility" section back then (the earlier version) [3],
though these 2 code points are explicitly mentioned at section RL1.4 Simple
Word Boundaries [4]. The \p{Join_Control} (u+200c and u+200d) has been
added/listed in the "compatibility" section in the latest version of TR#18 [5].

The proposed change here is to
(1) add these two code points back to the collection of \w
(2) list them explicitly into the \w definition as \p{Join_Control}
(3) list Join_Control as one of the supported binary properties.


The webrev for RegExTest.java above includes the change for 8013252
which is being reviewed as well, I'm not separating them out just for
convenience. The regression/unit tests may not that "direct", here is
a direct version to verify the fix.

         Matcher wordU = Pattern.compile("\\w", Pattern.UNICODE_CHARACTER_CLASS).matcher("");


[1] http://ccc.us.oracle.com/7039066
[2] http://mail.openjdk.java.net/pipermail/i18n-dev/2011-April/000381.html
[3] http://www.unicode.org/reports/tr18/tr18-13.html#Compatibility_Properties
[4] http://www.unicode.org/reports/tr18/tr18-13.html#Simple_Word_Boundaries
[5] http://www.unicode.org/reports/tr18/#Compatibility_Properties

More information about the core-libs-dev mailing list