Codereview request: CR 7040220 java/char_encodin Optimize UTF-8 charset for String.getBytes()/toCharArray()

Ulf Zibis Ulf.Zibis at
Thu Apr 28 12:44:06 UTC 2011

Interesting results!

Some days ago we had the discussion about constants for standard Charsets.

Looking at your results, I see, that using *charset names constants*, the conversion mostly performs 
little better (up to 25 %), than using *charset constants*.
So again my question: Why do we need those charset constants?
IMO, we more need de/encoder constants, and array-based API for Charset class.

In malformed(byte[] src, int sp, int nb) I think you could cache the ByteBuffer bb, instead 
instantiating a new one all the time. For this the method should not be static to ensure thread-safety.

As you are there, did you refer to:
6795537 -UTF_8$Decoder returns wrong results 
6798514 - Charset UTF-8 accepts CESU-8 codings 


Am 28.04.2011 08:34, schrieb Xueming Shen:
>  Hi
> This is motivated by Neil's request to optimize common-case UTF8 path for native ZipFile.getEntry 
> calls [1].
> As I said in my replying email [2] I believe a better approach might be to "patch" UTF8 charset 
> directly to
> implement sun.nio.cs.ArrayDecoder/Encoder interface to speed up the coding operation for array based
> encoding/decoding under certain circumstance, as we did for all single byte charsets in #6636323 
> [3]. I
> have a old blog [4] that has some data for this optimization.
> The original plan was to do the same thing for our new UTF8 [5] as well in JDK7, but then (excuse, 
> excuse)
> I was just too busy to come back to this topic till 2 days ago. After two days of small tweaking 
> here and there
> and testing those possible corner cases I can think of, I'm happy with the result and think it 
> might be
> worth sending it out for a codereview for JDK7, knowing we only have couple days left.
> The webrev is at
> Those tests are supposed to make sure the coding result from the new paths for String.getBytes()/
> toCharArray() matches the result from the existing implementation.
> The performance results of running StrCodingBenchmarkUTF8 (included in webrev) on my linux
> box in -client and -server mode respectively are included at
> The microbenchmark measures 1-byte, 2-byte, 3-byte and 4 bytes utf8 bits separately with different
> length of data (from 12 bytes to thousands)
> Thanks!
> -Sherman
> [1]
> [2]
> [3]
> [4]
> [5]

More information about the core-libs-dev mailing list