Codereview request: CR 7040220 java/char_encodin Optimize UTF-8 charset for String.getBytes()/toCharArray()

Xueming Shen xueming.shen at
Mon May 2 17:15:45 UTC 2011

  On 5/2/2011 7:31 AM, Alan Bateman wrote:
> Xueming Shen wrote:
>>  Hi
>> This is motivated by Neil's request to optimize common-case UTF8 path 
>> for native ZipFile.getEntry calls [1].
>> As I said in my replying email [2] I believe a better approach might 
>> be to "patch" UTF8 charset directly to
>> implement sun.nio.cs.ArrayDecoder/Encoder interface to speed up the 
>> coding operation for array based
>> encoding/decoding under certain circumstance, as we did for all 
>> single byte charsets in #6636323 [3]. I
>> have a old blog [4] that has some data for this optimization.
>> The original plan was to do the same thing for our new UTF8 [5] as 
>> well in JDK7, but then (excuse, excuse)
>> I was just too busy to come back to this topic till 2 days ago. After 
>> two days of small tweaking here and there
>> and testing those possible corner cases I can think of, I'm happy 
>> with the result and think it might be
>> worth sending it out for a codereview for JDK7, knowing we only have 
>> couple days left.
>> The webrev is at
> I went through the changes and the approach looks good to me - thanks 
> for jumping on this. Thanks Ulf for helping review. Also thanks Neil 
> for reporting this and testing with Sherman's change to verify that it 
> addresses the performance regression.
> Sherman - just a couple of minor comments:

Thanks Alan!

Webrev has been updated accordingly.

I renamed the getBBuffer to getByteBuffer, now it "looks" better:-)


> It would be good to put a blank line after the ASCII-only loops so 
> that future maintainers can easily distinguish the loops (this would 
> make it consistent with decodeArrayLoop for example).
> In several places the test is "if (CodingErrorAction.REPLACE != 
> malformedInputAction())". Personally I would swap this to "if 
> (malformedInputAction() != CodingErrorAction.REPLACE)". This reminds, 
> if a couple of these slipped through into the zip code recently, eg: 
> "if (false == inf.ended())",  "if (false == streams.isEmpty())".
> The caching of the ByteBuffer and getBBuffer for the malformed case 
> isn't as nice as the original code. Minimally I would move the 
> declaration of bb so that it's not in the middle of dl and dp, and 
> rename getBBuffer. Alternatively I would just get rid of it as it 
> shouldn't be performance critical.
> In TestStringCodeUTF8 it might be cleaner to put the body of main into 
> its own method and then call it once, set the security manager, and 
> then call it again.
> That's all I have,
> -Alan.

More information about the core-libs-dev mailing list