Codereview request for 7096080: UTF8 update and new CESU-8 charset

Xueming Shen xueming.shen at
Thu Sep 29 22:27:46 UTC 2011

On 09/29/2011 02:16 PM, Ulf Zibis wrote:
> Please use spaces with ternary operators: Lines 155, 216
> For short you could use sr instead srcRemaining, consistent to sa, sp, sl.
>  420         // returns -1 if there is malformed byte(s) and the
> better:
>  420         // returns -1 if there is/are malformed byte(s) and the
>  466                             sp -=3;
> There should be a space:  sp -= 3;

Webrev has been updated accordingly.

>  280                     if (Character.isSurrogate(c))
>  281                         return malformedForLength(src, sp, dst, 
> dp, 3);
> Shouldn't we return cr.length() = 1to allow remaining 2 bytes to be 
> interpreted again ?

Actually I don't know the answer. My reading of D93a/D93b suggests that 
we might
interpret it as a whole, given the bytes are actually in well-formed 
byte pattern range
listed in Table 3.7, but "ill-formed" simply because they are surrogate 
value not scale
value, so I would interpret the whole 3 bytes as a maximal subpart. 
Given D93a/b is
"best practices for Using U+fffd", either way should be fine. We do have 
Unicode expert
on the list, so maybe they can share their opinion on what is the 
behavior in this case, from Standard point view?

> Am 29.09.2011 05:27, schrieb Xueming Shen:
>> Hi,
>> On 9/28/2011 3:44 PM, Ulf Zibis wrote:
>>> 5. IMHO charset CESU-8 should be hosted in extended-charsets, 
>>> otherwise it should be added to java.nio.StandardCharsets
>> We have lots of charsets provided via the "standard charset provider" 
>> (in rt.jar) but not listed in StandardCharsets.
> Yes, but the reasonable to add CESU-8 to StandardCharsets was the 
> supposed demand to treat all unicode charsets equivalent.
> Otherwise there is no obstacle to host CESU-8 in extended-charsets.
> IMHO, CESU-8 addresses corner case compatibility issues, but not 
> "standard" requirements.

To put CESU-8 into "standard charset provider" (it is only an 
implementation details) does
not mean it is a "standard" requirement, it just means it is bundled 
into rt.jar. The reason
I put it there is to make sure it is together with the UTF-8, with the 
assumption is that you
might need it around when using the updated UTF-8, which no longer 
handles those 3/6-byte


More information about the core-libs-dev mailing list