Unicode script support in Regex and Character class
xueming.shen at oracle.com
Thu Apr 22 15:44:57 PDT 2010
Yuri Gaevsky wrote:
> Hi Sherman,
> A couple of minor comments:
> - There is a typo (Uniocde) in Character.UnicodeScript.forName(java.lang.String):
> "Returns the UnicodeScript with the given Uniocde script name or the script
> name alias. "
> - Shouldn't the method be more specific in respect of inner spaces, underscores
> and so on (as  does)?
>  http://java.sun.com/javase/6/docs/api/java/lang/Character.UnicodeBlock.html#forName(java.lang.String)
Typo has been fixed and webrev has been updated.
The difference of block name and script name is that the block name
defined by Unicode in Blocks.txt uses space
character and hyphen as the separator (instead of the underscore) for
example, the "Latin-1 Supplement", which
makes it impossible to use the name as a identifier in Java directly.
The UnicodeBlock.forName() then has too accept
both the original/canonical block name and the "text representation" of
the UnicoeBlock identifer.
For the script name, while the tr24 states that " the presence of
hyphen or underscore is optional", the Scripts.txt
strictly only uses underscore for the script name. I was considering if
I should also allow "loose-match" for the script
name to accept those names that use space or hyphen in place of "_",
but decided to stick with the canonical name
(actually there are only several few names that need this). Well, I'm
still open on this one, if people think the
"loose-match" is important.
I added "The en_US locale's case mapping rules are used to provide
case-insensitive string comparisons for script
name validation", as suggested.
I also replaced the "Character.UnicodeScript object/instance" with
"constant" in several places to be consistent with
the inherited methods valueOf(0 and values()
The rfe ids are
4860714: Make Unicode scripts available for use in regular expressions
6945564: Unicode script support in Character class
More information about the core-libs-dev