<i18n dev> RL1.7 Code Points
tchrist at perl.com
Sun Jan 23 12:22:23 PST 2011
I am somewhat uncertain, but I believe that Java
*almost* meets this requirement.
1.7 Code Points
A fundamental requirement is that Unicode text be interpreted
semantically by code point, not code units.
RL1.7 Supplementary Code Points
To meet this requirement, an implementation shall handle the full
range of Unicode code points, including values from U+FFFF to
U+10FFFF. In particular, where UTF-16 is used, a sequence
consisting of a leading surrogate followed by a trailing surrogate
shall be handled as a single code point in matching.
Java tries to make things work this way, and always does so on well-formed
input. The reason I say almost is because of the way the regex engine will
sometimes match individual code units on ill-formed UTF-16 sequences. I
believe this behaviour to be contrary to the fundamental requirement for
Level 1 compliance that Unicode text never be interpreted as code units.
Fortunately, this does not seem too difficult to fix, though.
More information about the i18n-dev