RFR : CharsetEncoder.maxBytesPerChar() should return 4 for UTF-8
markt at apache.org
Mon Sep 22 21:34:53 UTC 2014
On 22/09/2014 22:23, Martin Buchholz wrote:
> I think you are mistaken. It's maxBytesPerChar, not maxBytesPerCodepoint!
You are going to have to explain that some more. The Javadoc for
CharsetEncoder.maxBytesPerChar() is explicit:
Returns the maximum number of bytes that will be produced for each
character of input.
For UTF-8 that number is 4, not 3. A quick look at the source for the
default UTF-8 encoder confirms that there are cases where it will output
4 bytes for a single input character.
> changeset: 3116:b44704ce8a08
> user: sherman
> date: 2010-11-19 12:58 -0800
> 6957230: CharsetEncoder.maxBytesPerChar() reports 4 for UTF-8; should be 3
> Summary: changged utf-8's CharsetEncoder.maxBytesPerChar to 3
> Reviewed-by: alanb
> On Mon, Sep 22, 2014 at 1:14 PM, Ivan Gerasimov <ivan.gerasimov at oracle.com>
>> The UTF-8 encoding allows characters that are 4 bytes long.
>> However, CharsetEncoder.maxBytesPerChar() currently returns 3.0, which is
>> not always enough.
>> Would you please review the simple fix for this issue?
>> BUGURL: https://bugs.openjdk.java.net/browse/JDK-8058875
>> WEBREV: http://cr.openjdk.java.net/~igerasim/8058875/0/webrev/
>> Sincerely yours,
More information about the core-libs-dev