RFR : CharsetEncoder.maxBytesPerChar() should return 4 for UTF-8
martinrb at google.com
Tue Sep 23 15:26:22 UTC 2014
Again, it's maxBytes per java "char", not maxBytes per Unicode character.
Allocating a big enough buffer is pretty much the only reason
for maxBytesPerChar' existence.
On Tue, Sep 23, 2014 at 7:58 AM, Salter, Thomas A <Thomas.Salter at unisys.com>
> This response confuses me. Are you saying that the UTF8 encoder is not
> really producing UTF8? RFC 2279 and 3629 both clearly state that
> surrogates must be combined to form a 32-bit value which is then encoded as
> a 4-byte sequence. In fact, the RFCs refer to the alternate encoding
> CESU_8 definition which encodes each half of the surrogate pair as a 3-byte
> UTF-8 sequence.
> I guess returning 3.0 for maxBytesPerChar works for the purpose of
> allocating a big enough byte buffer, but the explanation in this thread is
> Tom Salter
> Date: Tue, 23 Sep 2014 11:37:07 +0400
> From: Ivan Gerasimov <ivan.gerasimov at oracle.com>
> To: Xueming Shen <xueming.shen at oracle.com>, Martin Buchholz
> <martinrb at google.com>
> Cc: nio-dev at openjdk.java.net, core-libs-dev
> <core-libs-dev at openjdk.java.net>
> Subject: Re: RFR : CharsetEncoder.maxBytesPerChar() should
> return 4 for UTF-8
> Message-ID: <54212323.5080907 at oracle.com>
> Content-Type: text/plain; charset=UTF-8; format=flowed
> Martin, Sherman thanks for clarification!
> Closing the bug as not a bug.
> > The "character" in the nio Charset and CharDe/Encoder is specified as
> > "sixteen-bit Unicode
> > code unit", so it is reasonable to interpret the "character" in the
> > "maximum number of bytes
> > that will be produced for each character of input" to be the Java
> > "char" as well. In case of
> > UTF8, each 4-byte form supplementary character is always coded into 2
> > surrogate chars,
> > it's "2 byte per char".
> > Do we have a real escalation that complains about this?
> Yes, the link in on the bug page:
> I'm going to try to explain what I've just realized about this function :-)
> Sincerely yours,
More information about the core-libs-dev