Wrong encoding after XML identity transformation
n-roeser at gmx.net
Sat Mar 29 16:01:12 UTC 2014
I’m using IcedTea 2.4.5 for OpenJDK 7 on Gentoo. This includes JAXP
revision 8fe156ad49e2 (in the IcedTea repos) which again seems to
contain jdk7u51-b31 – which again is revision 626e76f127a4 in the
OpenJDK repo jdk7u, as far as I can see. Oh well, you’ll probably know
better about all these version numbers than I do.
Anyway, after a painful debugging session I found that the default XML
transformer implementation (via XSLTC) handles encodings improperly when
writing in-memory DOM Documents (which had an encoding other than UTF-8
specified when being parsed) to a stream.
I’m attaching my test code, which I hope is correct and readable. What
• read a document with encoding="ISO-8859-1" from an input stream into a
DOM Document. The input document itself does not contain any characters
outside US-ASCII, which is a subset of ISO-8859-1.
• Add a text node with text “schön” (=nice in German) to the document.
The “ö” in “schön” is LATIN SMALL LETTER O WITH DIAERESIS (U+00F6). This
can, of course, be stored in the in-memory document tree, but may need
character conversions when storing it later.
• Use a Transformer with output properties set to XML in UTF-8 for
writing the document into a stream using an identity transformation.
I compared Xalan-J 2.7.1 and the internal implementation (older Xalan?)
in my JRE installed with my version of OpenJDK (see above). External
Xalan produces documents with XML encoding="UTF-8", while the
JRE-internal Xalan keeps encoding="ISO-8859-1", *but writes the “ö”
encoded in UTF-8*! This produces wrong content in the document when
processing it with an XML parser later.
The transformer should use UTF-8, as I requested in the code. If I did
not specificially request anything, it might also have used ISO-8859-1
if transcoding all characters into that encoding.
In order to use the attached test program, put xalan.jar and xsltc.jar
from Xalan-J into your classpath. Even XSLTC from Xalan-J 2.7.1 works,
just not the JRE-internal one.
My default locale has UTF-8 encoding, in case that matters.
More information about the core-libs-dev