Unicode script support in Regex and Character class

Xueming Shen xueming.shen at oracle.com
Tue May 11 00:05:41 UTC 2010


My apology for distracting you to that "smaller size alternative", as I 
said in my previous email
please only "review" the bits at

It's fine if you are interested in the stuff I experimented at
but please keep it separated from the code I'm proposing to putback.


Ulf Zibis wrote:
> Some additional thoughts:
> - out.writeShort((short)(num & 0xffff)); ---short form--->  
> out.writeShort((short)num);
> - use Arrays.binarySearch() in Character.UnicodeBlock.of().
> -  "if (notFirst)" could be saved if you would first append the first 
> word to sb outside the while loop.
> - StringBuilder sb could be initialized by the maximum name length 
> (=83) to avoid resizing;
> - we could reuse the same Stringbuilder for multiple invokations of 
> Character.getName(cp)?
> -- make CharacterName.get(cp) instance method and save CharacterName 
> object as ThreadLocal from Character.getName(cp).
> -- synchronize Character.getName(cp).
> - Instead using StringBuilder we could use ByteBuffer, omit the char[] 
> and build the final String by new String(bb.toArray(), "ASCII").
> -- saves the twice bigger char[] for the pool.
> -- I imagine, ByteBuffer would perform better than StringBuilder.
> - save UnicodeBlocks, BlockStarts and scriptStarts in a file instead 
> statically in classfile.
> -- e.g. init of scriptStarts is a big waste of byte code (7/11 bytes 
> per short/integer entry).
> Am 08.05.2010 23:49, schrieb Xueming Shen:
>> Hi,
>> The API  proposals for Unicode script support below have been approved.
>> 6945564: Unicode script support in Character class
>> 6948903: Make Unicode scripts available for use in regular expressions
>> (2)Testing result suggests there is not too much runtime benefit of 
>> keeping a huge string
>> data pool + an access hashmap for getName() implementation. The 
>> latest implementation now
>> takes Ulf's suggestion to keep a relatively small byte[] pool and 
>> generate the names at runtime.
>> (there is "even smaller" implementation, which consumes about 300K 
>> memory at runtime
>> http://cr.openjdk.java.net/~sherman/script/webrev.00/
>> but it has a "scalability" problem need to address when string pool 
>> grows beyond 64k and it
>> is little slow)
> I'm investigating in that.
> For 1st, my string pool has size of only 35243.
> -Ulf

More information about the core-libs-dev mailing list