RFR: 8043592: The basic XML parser based on UKit fails to read XML files encoded in UTF-16BE or LE
huizhe.wang at oracle.com
Fri May 23 05:15:02 UTC 2014
Hi Sherman, Lance,
Thanks for reviews.
It appears resetting InputStream is not reliable since not every
InputStream will support reset. I've modified the code. For other
changes, pls see inline comments.
On 5/22/2014 10:25 AM, Xueming Shen wrote:
> (1) Do we really need those shift at line ln#2989/90 and 2994/95? it
> appears to me
> those bytes have been decided to be ZERO already, we are talking
> mChar = '<' and mChar = '?' here, right?
Fixed. No need indeed.
> (2) for test, maybe we should just do p.loadFromXML(in) ? that path
> should verify the
> fix as well (the real use scenario), right?
I've removed the test and updated LoadAndStoreXM instead, as Alan
suggested, to cover UTF-16BE/LE.
> (3) do we have tests for utf16 bom? if not, I would suggest to throw
> in UTF-16BE/LE-BOM
> into the charset, just in case.
java.nio.charset states that it writes BOM when encoding in UTF-16, but
not for BE or LE. That is why the tests behaved differently, that is,
detecting BOM in the case of UTF-16, but not for UTF-16BE/LE.
I added tests to manually append BOM in the case of UTF-16BE/LE to
verify that the code is capable of handling these cases (although
normally they won't come with BOM).
> On 05/22/2014 09:30 AM, huizhe wang wrote:
>> Refer to 8042889, while verifying/testing 8042889, we noticed that
>> the tiny XML parser failed on UTF-16BE or LE. The cause of the
>> failure was that the parser was actually implemented to abide by the
>> XML specification that required entities encoded in UTF-16 to begin
>> with BOM. The test we used sent a byte array to the parser without
>> BOM, thus failed.
>> Since it's not uncommon for a XML to not have BOM, I borrowed the
>> technique used in Xerces to add an additional check for UTF-16
>> encoding. Please review.
More information about the core-libs-dev