Unicode script support in Regex and Character class

Xueming Shen xueming.shen at oracle.com
Thu Apr 22 22:44:57 UTC 2010

Yuri Gaevsky wrote:
> Hi Sherman,
> A couple of minor comments:
>   - There is a typo (Uniocde) in Character.UnicodeScript.forName(java.lang.String):
>         "Returns the UnicodeScript with the given Uniocde script name or the script
>          name alias. "
>   - Shouldn't the method be more specific in respect of inner spaces, underscores
>     and so on (as [1] does)?
> Regards,
> -Yuri
> [1] http://java.sun.com/javase/6/docs/api/java/lang/Character.UnicodeBlock.html#forName(java.lang.String)

Thanks Yuri.

Typo has been fixed and webrev has been updated.

The difference of block name and script name is that the block name 
defined by Unicode in Blocks.txt uses space
character and hyphen as the separator (instead of the underscore)  for 
example, the "Latin-1 Supplement", which
makes it impossible to use the name as a identifier in Java directly. 
The UnicodeBlock.forName() then has too accept
both the original/canonical  block name and the "text representation" of 
the UnicoeBlock identifer.

For the script name, while the tr24[2] states that " the presence of 
hyphen or underscore is optional", the Scripts.txt[3]
strictly only uses underscore for the script name. I was considering if 
I should also allow "loose-match" for the script
name to accept those names that use space or hyphen in place of "_",  
but decided to stick with the canonical name
(actually there are only several few names that need this).  Well, I'm 
still open on this one, if people think the
"loose-match" is important.

I added "The en_US locale's case mapping rules are used to provide 
case-insensitive string comparisons for script
name validation", as suggested.

I also replaced the "Character.UnicodeScript object/instance" with 
"constant" in several places to be consistent with
the inherited methods valueOf(0 and values()

The rfe ids are
4860714: Make Unicode scripts available for use in regular expressions
6945564: Unicode script support in Character class


[1] http://www.unicode.org/Public/5.2.0/ucd/Blocks.txt
[2] http://www.unicode.org/reports/tr24/
[3] http://www.unicode.org/Public/UNIDATA/Scripts.txt

More information about the core-libs-dev mailing list