<i18n dev> RL1.2 Properties (part 1 of 2)

Xueming Shen xueming.shen at oracle.com
Sun Jan 23 00:07:53 PST 2011


The Unicode/java version of lowercase, uppercase, withespace and letter 
character classes are
provided via \p{javaXYZ}, and the \p{Lower/Upper/Alpha/Space} are 
for POSIX version, which is clearly documented in the API document. I 
would not use "worst"
for this. I don't think the "conformance" requests the implementation to 
use exactly the
name specified in standard.

The following classes/properties are actually supported/implemented, 
while only the \p{javaLowerCase},
\p{javaUpperCase}, \p{javaWhitespace} and \p{javaMirrored} are 
explicitly documented in Pattern
API, the rest are covered by notes as "Categories that behave like the 
java.lang.Character boolean
ismethodname methods are available through the same \p{prop} syntax..."


It appears the "noncharacter_cp and "default_ignorable_cp" are missing 
from the list, will take a
look later, but I guess these 2 are really not that "significant".


On 1.22.2011 10:22, Tom Christiansen wrote:
> Java does not meet the requirement of RL1.2.  It provides only 3 of the 11
> require properties; 4 it omits altogether, while 4 others it implements in
> a fashion contrary to the standard.  Java also neglects the strongly
> recommended aspects of this section, which is quite a pity.
>  From tr18:
>      RL1.2       Properties
>      To meet this requirement, an implementation shall provide at
>      least a minimal list of properties, consisting of the following:
>          General_Category
>          Script
>          Alphabetic
>          Uppercase
>          Lowercase
>          White_Space
>          Noncharacter_Code_Point
>          Default_Ignorable_Code_Point
>          ANY
>          ASCII
>          ASSIGNED
> Of those listed above as *shall provide*, Java indeed provides
> these three required properties from that minimum set:
>      + The ASCII property.
>      + The General_Categories like \p{Lu}, although only in their
>        short forms; it does not provide the long forms.
>      + The Script categories like \p{Greek}, a very *VERY*
>        welcome addition for Unicode 6.0.
> Java does not provide these four required properties:
>      - Noncharacter_Code_Point
>      - Default_Ignorable_Code_Point
>      - ANY
>      - ASSIGNED
> The worst part is that Java gives non-Unicode meanings to
> these four Unicode properties (I'll give details on these
> lapses in a separate message):
>      * Alphabetic
>      * Uppercase
>      * Lowercase
>      * White_Space
> I would like to see all of that addressed that is give above,
> and I do not understand how you can claim Level 1 conformance
> without doing so.
> There are also "strongly recommended" things that you do not
> implement, like loose matching of property names.  That would
> not cost you much, I feel.
> tr18's section 1.2 also lists several "recommended" properties,
> not all of which are binary.
> Properties that are not absolutely required for compliance of
> RL1.2, but which I find especially useful, include these binary
> properties:
>      \p{Dash}
>      \p{Quotation_Mark}
>      \p{Diacritic}
>      \p{Math}
> If you are going to do \X for extended grapheme clusters instead
> of legacy grapheme clusters, then you will need access to Hangul
> Syllable Types, which is not a binary property.
> The best place to read up on the full set of UCD properties is at
>      http://www.unicode.org/reports/tr44/tr44-4.html#Properties
> There are several tables of properties there; at the top of the
> file, though, it says:
>      1 Introduction
>      The Unicode Standard is far more than a simple encoding of characters.
>      The standard also associates a rich set of semantics with each encoded
>      character--properties that are required for interoperability and
>      correct behavior in implementations, as well as for Unicode
>      conformance. These semantics are cataloged in the Unicode Character
>      Database (UCD), a collection of data files which contain the Unicode
>      character code points and character names. The data files define the
>      Unicode character properties and mappings between Unicode characters
>      (such as case mappings).
> That shows how important properties are. The conformance document
> also includes this statement:
>      http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf
>      Interpretation of characters is a more complex issue for the Unicode
>      Standard. It includes the core issue of interpreting code points
>      used as characters according to the names and representative glyphs
>      shown in the code charts, of course. However, the Unicode Standard
>      also specifies character properties, behavior, and interactions
>      between characters. Such information about characters is considered
>      an integral part of the "character semantics established by this
>      standard."
>      Information about the properties, behavior, and interactions between
>      Unicode characters is provided in the Unicode Character Database and
>      in the Unicode Standard Annexes.
> That again stresses the importance of properties and interactions between
> characters.  Java giving properties the same names that Unicode does but
> gives them behaviours that are something else entirely is particularly
> vexing.  I cannot see how that is conformant, either.  You have to do
> what they say you have to do with the property names they give you.  If
> you want your own behaviours, you can choose different property names.
> But theirs are reserved to behave as they define them to behave.
> I will therefore address the errors I believe Java makes in the
> Alphabetic, Uppercase, Lowercase, and White_Space properties in
> my next message, part 2 of RL1.2 Properties.
> --tom

More information about the i18n-dev mailing list