<i18n dev> Fwd: RFR JDK-8013254: Constructor \w need update to add the support of \p{Join_Control}

Xueming Shen xueming.shen at oracle.com
Tue Apr 30 10:03:10 PDT 2013

-------- Original Message --------
Message-ID: 	<517FF8F8.3080208 at oracle.com>
Date: 	Tue, 30 Apr 2013 10:01:44 -0700
From: 	Xueming Shen <xueming.shen at oracle.com>
User-Agent: 	Mozilla/5.0 (X11; U; Linux i686; en-US; rv: Gecko/20110414 Thunderbird/3.1.10
MIME-Version: 	1.0
To: 	core-libs-dev core-libs-dev <core-libs-dev at openjdk.java.net>
Subject: 	RFR JDK-8013254: Constructor \w need update to add the support of \p{Join_Control}
Content-Type: 	text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 	7bit


It appears we dropped the ball on u+200c and u+200d when we updated
the "simple word boundaries" back to jdk7 [1]. You can find most of the
related discussion here [2]. These 2 code points are listed as one of the
issues we were trying to fix but obviously the final doc and implementation
don't address them. Mainly because the \p{Join_Control} was not explicitly
listed in TR#18 "compatibility" section back then (the earlier version) [3],
though these 2 code points are explicitly mentioned at section RL1.4 Simple
Word Boundaries [4]. The \p{Join_Control} (u+200c and u+200d) has been
added/listed in the "compatibility" section in the latest version of TR#18 [5].

The proposed change here is to
(1) add these two code points back to the collection of \w
(2) list them explicitly into the \w definition as \p{Join_Control}
(3) list Join_Control as one of the supported binary properties.


The webrev for RegExTest.java above includes the change for 8013252
which is being reviewed as well, I'm not separating them out just for
convenience. The regression/unit tests may not that "direct", here is
a direct version to verify the fix.

         Matcher wordU = Pattern.compile("\\w", Pattern.UNICODE_CHARACTER_CLASS).matcher("");


[1] http://ccc.us.oracle.com/7039066
[2] http://mail.openjdk.java.net/pipermail/i18n-dev/2011-April/000381.html
[3] http://www.unicode.org/reports/tr18/tr18-13.html#Compatibility_Properties
[4] http://www.unicode.org/reports/tr18/tr18-13.html#Simple_Word_Boundaries
[5] http://www.unicode.org/reports/tr18/#Compatibility_Properties

More information about the i18n-dev mailing list