Codereview request: CR 7040220 java/char_encodin Optimize UTF-8 charset for String.getBytes()/toCharArray()
Ulf.Zibis at gmx.de
Thu Apr 28 13:12:16 UTC 2011
According to comments in 6795537
I additionally assume
else if (b1< (byte)0xc2)
should be little faster than
else if ((b1>> 5) == -2)
if (isMalformed2(b1, b2))
could be replaced by
Am 28.04.2011 14:44, schrieb Ulf Zibis:
> Interesting results!
> Some days ago we had the discussion about constants for standard Charsets.
> Looking at your results, I see, that using *charset names constants*, the conversion mostly
> performs little better (up to 25 %), than using *charset constants*.
> So again my question: Why do we need those charset constants?
> IMO, we more need de/encoder constants, and array-based API for Charset class.
> In malformed(byte src, int sp, int nb) I think you could cache the ByteBuffer bb, instead
> instantiating a new one all the time. For this the method should not be static to ensure
> As you are there, did you refer to:
> 6795537 -UTF_8$Decoder returns wrong results
> 6798514 - Charset UTF-8 accepts CESU-8 codings
> Am 28.04.2011 08:34, schrieb Xueming Shen:
>> This is motivated by Neil's request to optimize common-case UTF8 path for native ZipFile.getEntry
>> calls .
>> As I said in my replying email  I believe a better approach might be to "patch" UTF8 charset
>> directly to
>> implement sun.nio.cs.ArrayDecoder/Encoder interface to speed up the coding operation for array based
>> encoding/decoding under certain circumstance, as we did for all single byte charsets in #6636323
>> . I
>> have a old blog  that has some data for this optimization.
>> The original plan was to do the same thing for our new UTF8  as well in JDK7, but then
>> (excuse, excuse)
>> I was just too busy to come back to this topic till 2 days ago. After two days of small tweaking
>> here and there
>> and testing those possible corner cases I can think of, I'm happy with the result and think it
>> might be
>> worth sending it out for a codereview for JDK7, knowing we only have couple days left.
>> The webrev is at
>> Those tests are supposed to make sure the coding result from the new paths for String.getBytes()/
>> toCharArray() matches the result from the existing implementation.
>> The performance results of running StrCodingBenchmarkUTF8 (included in webrev) on my linux
>> box in -client and -server mode respectively are included at
>> The microbenchmark measures 1-byte, 2-byte, 3-byte and 4 bytes utf8 bits separately with different
>> length of data (from 12 bytes to thousands)
>>  http://mail.openjdk.java.net/pipermail/core-libs-dev/2011-April/006710.html
>>  http://mail.openjdk.java.net/pipermail/core-libs-dev/2011-April/006726.html
>>  http://cr.openjdk.java.net/~sherman/6636323_6636319/webrev
>>  http://blogs.sun.com/xuemingshen/entry/faster_new_string_bytes_cs
>>  http://blogs.sun.com/xuemingshen/entry/the_big_overhaul_of_java
More information about the core-libs-dev