Unicode script support in Regex and Character class

Martin Buchholz martinrb at google.com
Sat Apr 24 18:21:20 UTC 2010

Providing script support is obvious and non-controversial,
because other regex programming environments provide it.
Check that the behavior and syntax of the extension is
consistent with e.g. ICU, python, and especially perl
(5.12 just released!)


I would add some documentation to the three special script values;
their meaning is not obvious.

For implementation, the character matching problem is in general
equivalent to the problem of compiling a switch statement, which is
known to be non-trivial.  Guava contains a CharMatcher class that
tries to solve related problems.


I'm thinking scripts and blocks should know about which ranges they contain.
In particular, \p{BlockName} should not need binary search at
regex compile time or runtime.

There is one place you need to change
key word => keyword
InMongolian => {@code InMongolian}
I notice current Unicode block support in JDK is not updated to the
latest standard.
E.g. Samaritan is missing.


On Thu, Apr 22, 2010 at 01:01, Xueming Shen <xueming.shen at oracle.com> wrote:
> Hi,
> Here is the webrev of the proposal to add Unicode script support in regex
> and j.l.Character.
> http://cr.openjdk.java.net/~sherman/script/webrev
> and the corresponding blenderrev
> http://cr.openjdk.java.net/~sherman/script/blenderrev.html
> Please comment on the APIs before I submit the CCC, especially
> (1) to use enum for the j.l.Character.UnicodeScript (compared to the
> traditional j.l.c.Subset)
> (2) the piggyback method j.l.c.getName() :-)
> (3) the syntax for script constructs. In addition to the "normal"
>    \p{InScriptName} and \P{InScriptName} for the script support
>    I'm also adding
>   \p{script=ScriptName} \P{script=ScriptName}  for the new script support
>   \p{block=BlockName} \P{block=BlockName}  for the "existing" block support
>   \p{general_category=CategoryName} \P{general_category=CategoryName} for
> the "existing" gc
>   Perl recently also started to accept this  \p{propName=propValue} Unicode
> style.
>   It opens the door for future "expanding", for example \p{name=XYZ} :-)
> (4)and of course, the wording.
> Thanks,
> Sherman

More information about the core-libs-dev mailing list