JDK 9 RFR of 8039474: sun.misc.CharacterDecoder.decodeBuffer should use getBytes(iso8859-1)

Xueming Shen xueming.shen at oracle.com
Thu Apr 10 19:17:22 UTC 2014

On 04/10/2014 12:03 PM, Chris Hegarty wrote:
> On 10 Apr 2014, at 19:50, Xueming Shen<xueming.shen at oracle.com>  wrote:
>> On 04/10/2014 11:38 AM, Mike Duigou wrote:
>>> On Apr 10 2014, at 11:08 , Chris Hegarty<chris.hegarty at oracle.com>   wrote:
>>>>> On 10 Apr 2014, at 18:40, Mike Duigou<mike.duigou at oracle.com>   wrote:
>>>>>> On Apr 10 2014, at 03:21 , Chris Hegarty<chris.hegarty at oracle.com>   wrote:
>>>>>>> On 10 Apr 2014, at 11:03, Ulf Zibis<Ulf.Zibis at CoSoCo.de>   wrote:
>>>>>>> Hi Chris,
>>>>>>> Am 10.04.2014 11:04, schrieb Chris Hegarty:
>>>>>>>> Trivially, you could ( but of not have to ) use java.nio.charset.StandardCharsets.ISO_8859_1 to avoid the cost of String to CharSet lookup.
>>>>>>> In earlier tests Sherman and I have found out, that the cost of initialization of a new charsets object is higher than the lookup of an existing object in the cache.
>>>>>>> And it's even better to use the same String instance for the lookup which was used to cache the charset.
>>>>>> Interesting… thanks for let me know.  Presumably, there is an assumption is StandardCharsets is not initialized elsewhere, by another dependency.
>>>>> Generally it's safe to assume that StandardCharsets will already be initialized. If it isn't initialized we should consider it an amortized cost.
>>>> I'm which case why would the string version be more performant than the version that already takes the Charset? Doesn't the string version need to do a lookup?
>>> There is a cache in StringCoder that is only used in the byte[] getBytes(String charsetName) but not in the byte[] getBytes(Charset charset) case. The rationale in StringCodding::decode(Charset cs, byte[] ba, int off, int len) may need to be revisited as it is certainly surprising that the string constant charset name usage is faster than the CharSet constant.
>> It's a surprising :-) In theory you can't cache the de/encoder of a charset from
>> external world, as the same charset might return a different de/encoder next
>> time. So it is decided to not cache the de/encoder for a coming charset back
>> then. It might be reasonable to cache those from the StandardCharsets though.
> I would say that it is more than reasonable. ;-) And it is surprising to me too that this usage is not as fast as a constant string.

Charset.equals() does explicitly mention "same canonical name" as below

      * Tells whether or not this object is equal to another.
      * <p> Two charsets are equal if, and only if, they have the same canonical
      * names.  A charset is never equal to any other type of object. </p>
      * @return <tt>true</tt> if, and only if, this charset is equal to the
      *          given object

But it is very reasonable :-) to assume someone might pass in a home-made
charset implementation with the same canonical name as the one in our/jdk
charset repository. Then we have another debate on which one should be
used in this case.


More information about the core-libs-dev mailing list