RFR : CharsetEncoder.maxBytesPerChar() should return 4 for UTF-8
martinrb at google.com
Mon Sep 22 21:44:18 UTC 2014
Much of the documentation (especially the early stuff when supplementary
characters were rarer/nonexistent) doesn't distinguish between "character
(codepoint)" and "char" clearly enough. Fixing that in all the docs would
be a fine thing to do.
On Mon, Sep 22, 2014 at 2:34 PM, Mark Thomas <markt at apache.org> wrote:
> On 22/09/2014 22:23, Martin Buchholz wrote:
> > I think you are mistaken. It's maxBytesPerChar, not maxBytesPerCodepoint!
> You are going to have to explain that some more. The Javadoc for
> CharsetEncoder.maxBytesPerChar() is explicit:
> Returns the maximum number of bytes that will be produced for each
> character of input.
> For UTF-8 that number is 4, not 3. A quick look at the source for the
> default UTF-8 encoder confirms that there are cases where it will output
> 4 bytes for a single input character.
> > changeset: 3116:b44704ce8a08
> > user: sherman
> > date: 2010-11-19 12:58 -0800
> > 6957230: CharsetEncoder.maxBytesPerChar() reports 4 for UTF-8; should be
> > Summary: changged utf-8's CharsetEncoder.maxBytesPerChar to 3
> > Reviewed-by: alanb
> > On Mon, Sep 22, 2014 at 1:14 PM, Ivan Gerasimov <
> ivan.gerasimov at oracle.com>
> > wrote:
> >> Hello!
> >> The UTF-8 encoding allows characters that are 4 bytes long.
> >> However, CharsetEncoder.maxBytesPerChar() currently returns 3.0, which
> >> not always enough.
> >> Would you please review the simple fix for this issue?
> >> BUGURL: https://bugs.openjdk.java.net/browse/JDK-8058875
> >> WEBREV: http://cr.openjdk.java.net/~igerasim/8058875/0/webrev/
> >> Sincerely yours,
> >> Ivan
More information about the core-libs-dev