possible problem with JNI GetStringUTFChars
stuart.marks at oracle.com
Tue Jan 29 23:24:06 UTC 2019
> In case you missed my previous message, there is a use case for file paths using macOS APIs.
Hm, Martin had mentioned that macOS uses something more restrictive than UTF-8.
It seems to me that a filesystem-specific encoding is called for here.
> If you search the JDK repo for GetStringUTFChars, you will find several uses
> that do not appear to involve serialization or data input/output.
To clarify, I was talking about uses of modified UTF-8 from *Java* code. The
only places modified UTF-8 should appear in Java code are (I think) in
serialization and in Data*Stream.
Native code needs to use modified UTF-8 because it's required for various JVM
> It is not obvious whether these uses are correct or not.
> Consider test/jdk/java/nio/channels/FileChannel/directio/libDirectIO.c
> where GetStringUTFChars is used to convert a file path to pass to open().
> At the very least, anyone using GetStringUTFChars as a short cut for true UTF-8
> conversion should document why this short cut is correct, as is done in
> awt_InputMethod, for example.
Correct. If there are places that use GetStringUTFChars is used where real UTF-8
is required, then that's quite possibly a bug.
The use in libDirectIO.c is certainly suspicious. Note that this is test code,
and the only strings that are passed to it are temp file names from
Files.createTempFile(). It seems likely that such strings contain non-null BMP
characters, for which modified UTF-8 and real UTF-8 are the same, so this is
unlikely to be a problem in practice.
Still, you're right, if there are places where the JDK uses GetStringUTFChars
where real UTF-8 is required, those would be bugs.
Anyway, I think it's unfortunate, but in the JNI world we're saddled with
modified UTF-8. If you need real UTF-8, I recommend you do the conversion in
Java before you get down to native. The reason is that there are some edge cases
with codeset conversion (e.g., malformed sequences such as unpaired surrogates)
that would require a bunch of additional facilities that aren't readily
available from native code, as far as I know.
>> On Jan 28, 2019, at 2:10 PM, Stuart Marks <stuart.marks at oracle.com
>> <mailto:stuart.marks at oracle.com>> wrote:
>> (From Java code, the Charset encoders/decoders handle real UTF-8, which seems
>> to cover most cases. Modified UTF-8 occurs only within serialization and
More information about the core-libs-dev