<i18n dev> Java and Unicode
martijnverburg at gmail.com
Sat Dec 11 09:58:50 PST 2010
Just to add a little background here :). I'm Martijn (I help run the London
JUG FWIW) and I ran across an answer from Tom on StackOverflow about Unicode
issues in Java - I quickly deleted my answer!
It was one of _those_ answers which really impressed all of us Java
developers on that thread, especially those who knew a little about Unicode
(I don't really count myself as one of them!). So I asked Tom if he'd mind
volunteering some of his time here as I knew there was some Unicode 6.0 work
going on and as he has a PERL and Unicode background I thought he would be
able to contribute in the discussions and work here (unlike someone like me
who's eyes glaze over if I have to do anything more complicated than setting
a character encoding).
I met a few of the OpenJDK advocates at Devoxx and that's inspired me so I'm
happy to try and help out Tom on the Java side where I can (or more
importantly try to get enthusiastic volunteers from my JUG to help out ;p).
twitter - @karianna & @java7developer
On Sat, Dec 11, 2010 at 5:38 PM, Tom Christiansen <tchrist at perl.com> wrote:
> Good morning,
> I'm Tom Christiansen; some of you may know me from my work in the Perl
> Community. I'm here at the urging of Martijn Verburg, who thought that my
> recent discoveries should be heard by your group.
> I've been professionally programming for more than 25 years now, mostly in
> C and Perl. I recently joined the biomedical text-mining group at the
> University of Colorado, where the bulk of our code base is in Java.
> I've been responsible for working with large text corpora entirely in
> Unicode. For example, one corpus comprises almost 200,000 papers and 11
> gigabytes, while another is a single file of 6 gigabytes. I'm not new to
> Unicode, having worked with it a great deal over the last decade.
> Although most of our code base is in Java, we also have a considerable
> portion of Perl code and some Python code, too. This code often first
> tokenizes the input stream before moving on to more sophisticated semantic
> processing. I was quite surprised to learn how differently Java treated
> Unicode text than how the same text is treated by Perl and Python, even
> using identical regular expressions. This has proved to be a significant
> barrier to fully adopting Java for our Unicode work.
> This prompted me to make a comprehensive study of Unicode issues in Java,
> focusing on regular expressions but also exploring other areas. I've
> identified about two dozen individual areas that I feel deserve to be
> looked at. These range from mismatches between documentation and behavior,
> to unfortunate or inconvenient defaults (e.g. "documented not to work"), to
> genuine bugs and international standards violations.
> Taken as a whole, these problem areas make Java a very difficult choice for
> the sort of text processing my group needs to use it for. Surely many
> others all around the world are in a similar position.
> I've searched the archives for this mailing list, and have found no mention
> of these troubles either there, or indeed anywhere at all on the web. For
> I have working code that fixes what for us is the most egregious of these
> problems: that regexes were unusable on Unicode. One fundamental bug is
> that Java has misunderstood the connection between \b and \w regexes, so
> that now a string like "élève" is not matched by the pattern "\b\w+\b" at
> any point in the string.
> Other very serious problems include Java's unjustifiable demotion of legal
> Unicode whitespace characters from the set of whitespace characters
> (breaking tokenization), using Unicode property names in ways contrary to
> what the spec says they do, and in general supporting no Unicode properties
> any later than 3.0: even the critical Unicode 3.1 properties are ignored by
> Java. These are very serious problems. Java almost cannot be said to
> support Unicode--at least any Unicode release from the last ten
> years--until these critical deficiencies are fixed.
> You can find a brief synopsis of these specific troubles as well as a link
> to the Java code that fixes them here:
> I don't by any means think this is the best way to go about this. It's
> just a band-aide we needed quickly to allow us to move on with our work.
> I'd like to offer it as a starting point for discussion of the issues that
> prompted its creation.
> As I mentioned, I have a couple dozen different Java Unicode issues, and
> this addresses just one or two of them. When I get time, I'll try to bring
> up the others here in separate threads.
> If you could advise me how best to contribute to helping out here, I would
> be grateful.
> Thank you,
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the i18n-dev