<i18n dev> Java encoder errors

Mark Davis ☕ mark at macchiato.com
Mon Sep 19 17:46:50 PDT 2011

They are really "super private use" characters, available for definition
within a given implementation or domain.

For example, in CLDR collation tables:

The code point U+FFFF is tailored to have a weight higher than all other
characters. This allows reliable specification of a range, such as “Sch” ≤ X
≤ “Sch\uFFFF” to include all strings starting with "sch" or equivalent.

The code point U+FFFE is tailored to have a weight lower than all other
characters. This allows for
code point space.

So you can sort the following, and have it work nicely.
sortKey = LastNameField + '\uFFFE' + FirstNameField

If someone happens to include an FFFE in one of these fields to be collated,
you'll get an odd ordering, but not a disaster. If you really care about
that, you can ensure that you don't allow FFFE in those database fields,
just as, for example, you might prevent U+0001 from being in the field.

But you really can't block at a low level, otherwise I couldn't serialize
out sortKey above into UTF8, a perfectly legitimate thing to do.

*— Il meglio è l’inimico del bene —*

On Mon, Sep 19, 2011 at 15:26, Tom Christiansen <tchrist at perl.com> wrote:

> Mark Davis ☕ <mark at macchiato.com> wrote
>   on Mon, 19 Sep 2011 14:41:49 PDT:
> > I agree with the first part, disallowing the irregular code sequences.
> Finding that Java allowed surrogates to sneak through in their UTF-8
> streams like that was quite odd.
> > As to the noncharacters, it would be a horrible mistake to disallow them.
> > Tom, a Java code converter is far too low a level for C9; if the
> > converter can't handle them, it screws up all perfectly legitimate
> > *internal*interchange. C9 is really for a very high level, eg don't
> > put them into interchanged plain text, like a web page. I agree that
> > it needs more clarification.
> Mark, thanks for taking the time to unravel that.  It wasn't clear from
> the specs where or perhaps even whether you should or should not disallow
> the 66 noncharacter code points.  A bit more clarity there would help.
> You bring up an interesting point.  If you read a web page and want to use
> some of the noncharacter code points as sentinels per their suggested use
> during your internal processing, you have to be able to know that they
> weren't there to start with.  Yes, you can check, one at a time, till you
> (hopefully!) find enough that aren't there that you can use them.  But if
> that were what you had to do, then you could do that with any set of code
> points not just noncharacter ones.  So that doesn't seem to make sense.
> People using UTF-8 or UTF-32 implementations can always steal non-Unicode
> code points from above 0x1FFFFF for their own internal use *provided* they
> never try to pass those along, but that won't work for UTF-16 even
> internally.
> Is there anything that they can dependably use?  It appears there is not.
> It's an interesting problem, and I see that it isn't as easily solved as
> I had hoped it might be.  If you can't guarantee that even the 66
> noncharacter code points won't be in your data stream, I'm thinking this
> isn't going to be solvable at this level.  It does make me wonder what
> those 66 noncharacters code points really are for, then, so it's back to
> rereading the specs again for me.
> thanks very much,
> --tom
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/i18n-dev/attachments/20110919/50896705/attachment.html 

More information about the i18n-dev mailing list