<i18n dev> Java encoder errors
Mark Davis ☕
mark at macchiato.com
Mon Sep 19 17:46:50 PDT 2011
They are really "super private use" characters, available for definition
within a given implementation or domain.
For example, in CLDR collation tables:
The code point U+FFFF is tailored to have a weight higher than all other
characters. This allows reliable specification of a range, such as “Sch” ≤ X
≤ “Sch\uFFFF” to include all strings starting with "sch" or equivalent.
The code point U+FFFE is tailored to have a weight lower than all other
characters. This allows for
code point space.
So you can sort the following, and have it work nicely.
sortKey = LastNameField + '\uFFFE' + FirstNameField
If someone happens to include an FFFE in one of these fields to be collated,
you'll get an odd ordering, but not a disaster. If you really care about
that, you can ensure that you don't allow FFFE in those database fields,
just as, for example, you might prevent U+0001 from being in the field.
But you really can't block at a low level, otherwise I couldn't serialize
out sortKey above into UTF8, a perfectly legitimate thing to do.
*— Il meglio è l’inimico del bene —*
On Mon, Sep 19, 2011 at 15:26, Tom Christiansen <tchrist at perl.com> wrote:
> Mark Davis ☕ <mark at macchiato.com> wrote
> on Mon, 19 Sep 2011 14:41:49 PDT:
> > I agree with the first part, disallowing the irregular code sequences.
> Finding that Java allowed surrogates to sneak through in their UTF-8
> streams like that was quite odd.
> > As to the noncharacters, it would be a horrible mistake to disallow them.
> > Tom, a Java code converter is far too low a level for C9; if the
> > converter can't handle them, it screws up all perfectly legitimate
> > *internal*interchange. C9 is really for a very high level, eg don't
> > put them into interchanged plain text, like a web page. I agree that
> > it needs more clarification.
> Mark, thanks for taking the time to unravel that. It wasn't clear from
> the specs where or perhaps even whether you should or should not disallow
> the 66 noncharacter code points. A bit more clarity there would help.
> You bring up an interesting point. If you read a web page and want to use
> some of the noncharacter code points as sentinels per their suggested use
> during your internal processing, you have to be able to know that they
> weren't there to start with. Yes, you can check, one at a time, till you
> (hopefully!) find enough that aren't there that you can use them. But if
> that were what you had to do, then you could do that with any set of code
> points not just noncharacter ones. So that doesn't seem to make sense.
> People using UTF-8 or UTF-32 implementations can always steal non-Unicode
> code points from above 0x1FFFFF for their own internal use *provided* they
> never try to pass those along, but that won't work for UTF-16 even
> Is there anything that they can dependably use? It appears there is not.
> It's an interesting problem, and I see that it isn't as easily solved as
> I had hoped it might be. If you can't guarantee that even the 66
> noncharacter code points won't be in your data stream, I'm thinking this
> isn't going to be solvable at this level. It does make me wonder what
> those 66 noncharacters code points really are for, then, so it's back to
> rereading the specs again for me.
> thanks very much,
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the i18n-dev