<i18n dev> Reading Linux filenames in a way that will map back the same on open?

Dan Stromberg dstromberglists at gmail.com
Tue Sep 9 17:39:14 PDT 2008

Sorry if this is the wrong list for this question.  I tried asking it
on comp.lang.java, but didn't get very far there.

I've been wanting to expand my horizons a bit by taking one of my
programs and rewriting it into a number of other languages.  It
started life in python, and I've recoded it into perl
Next on my list is java.  After that I'll probably do Haskell and

So the python and perl versions were pretty easy, but I'm finding that
the java version has a somewhat solution-resistant problem with
non-ASCII filenames.

The program just reads filenames from stdin (usually generated with
the *ix find command), and then compares those files, dividing them up
into equal groups.

The problem with the java version, which manifests both with OpenJDK
and gcj, is that the filenames being read from disk are 8 bit, and the
filenames opened by the OpenJDK JVM or gcj-compiled binary are 8 bit,
but as far as the java language is concerned, those filenames are made
up of 16 bit characters.  That's fine, but going from 8 to 16 bit and
back to 8 bit seems to be non-information-preserving in this case,
which isn't so fine - I can clearly see the program, in an strace,
reading with one sequence of bytes, but then trying to open
another-though-related sequence of bytes.  To be perfectly clear: It's
getting file not found errors.

By playing with $LC_ALL, $LC_CTYPE and $LANG, it appears I can get the
program to handle files with one encoding, but not another.  I've
tried a bunch of values in these variables, including ISO-8859-1, C,
POSIX, UTF-8, and so on.

Is there such a thing as a filename encoding that will map 8 bit
filenames to 16 bit characters, but only using the low 8 bits of those
16, and then map back to 8 bit filenames only using those low 8 bits

Is there some other way of making a Java program on Linux able to read
filenames from stdin and later open those filenames?


