Which CoderResult for malformed surrogate pairs ?

Martin Buchholz martinrb at google.com
Wed Sep 10 19:55:37 UTC 2008

There is another reason, aside from our Beloved Compatibility,
to prefer returning length == 1.  It is likely that the calling code will
delete the malformed chars and present the rest to a human.
The second char *might* be valid, so why hide it?


On Wed, Sep 10, 2008 at 08:22, Ulf Zibis <Ulf.Zibis at gmx.de> wrote:
> Hi Martin,
> thanks for the quick first answer.
> You are right, both chars could be corrupt.
> IMO, if CoderResult.malformedForLength(2) would be returned, this would
> be more informative, and the SW developer could decide by himself, if he
> would consider the CoderResult.length().
> Why having this differentiation by length, if nobody makes use of it?
> There is no other cause, which would entail a length other than 1 from
> CharsetEncoder.
> So do you think, it would be against spec to return a
> CoderResult.malformedForLength(2) in such cases, even if
> CoderResult.malformedForLength(1) isn't a bug.
> BTW:
> The chance to erroneously receive a high surrogate in range
> \uD800..\uDBFF is 1.56 %
> The chance to erroneously receive a char out of range \uDC00..\uDFFF
> after a correct high surrogate is 99.84 %
> -Ulf
> Am 09.09.2008 23:58, Martin Buchholz schrieb:
>> I think when encountering a single high surrogate,
>> it is correct to return a length of either 1 or 2.
>> A thought experiment: a cosmic ray that mangled exactly one char
>> could have caused this situation if the original sequence was
>> of length either 1 or 2, depending on which char was mangled.
>> Not a Defect.
>> Martin

More information about the core-libs-dev mailing list