naoto.sato at naoto.sato at
Fri Sep 20 20:25:38 UTC 2019


I am looking at the following bug:

and hoping someone who is familiar with the encoder will clear things 
out. As in the bug report, the method description reads:

Returns the maximum number of bytes that will be produced for each 
character of input. This value may be used to compute the worst-case 
size of the output buffer required for a given input sequence.

Initially I thought it would return the maximum number of encoded bytes 
for an arbitrary input "char" value, i.e. a code unit of UTF-16 
encoding. For example, any UTF-16 Charset (UTF-16, UTF-16BE, and 
UTF-16LE) would return 2 from the method, as the code unit is a 16 bit 
value. In reality, the encoder of UTF-16 Charset returns 4, which 
accounts for the initial byte-order mark (2 bytes for a code unit, plus 
size of the BOM). This is justifiable because it is meant to be the 
worst case scenario, though. I believe this implementation has been 
there since the inception of java.nio, i.e., JDK1.4.

Obviously I can clarify the spec of maxBytesPerChar() to account for the 
conversion independent prefix (or suffix) bytes, such as BOM, but I am 
not sure the initial intent of the method. If it intends to return pure 
max bytes for a single input char, UTF-16 should also have been 
returning 2. But in that case, caller would not be able to calculate the 
worst case byte buffer size as in the bug report.


More information about the core-libs-dev mailing list