<i18n dev> Proposed update to UTS#18
tchrist at perl.com
Thu Apr 14 23:50:10 PDT 2011
I've been trying to think about what to say to it.
I'd like to more about what is planned in the "canonical matching" area.
I do understand why reordering makes exact matching impossible. However,
I should think one of several sort of loose matching might still be done.
Maybe that require level 3, though.
Mostly though I've been thinking about case insensitivitity. I feel that
the current Unicode case mapping strategy is much weaker than what the
spirit of the thing really calls far. It's weak because it doesn't do as
much as it could.
I have played around with one approach that gives user-desirable results,
and also addresses the canonical issue. The synopsis is that I think RL3.4
would cut the Gordian Knot of combining marks (at level 1 they're ignored)
and do something genuinely useful by creating much more the sort of case
insensitivity at a level 1 comparison than anything currently available.
That's what RL3.4 Tailored Loose Match is about:
To meet this requirement, an implementation shall provide for loose
matches based on a locale's collation order, with at least 3 levels.
And tr10's section 8 on Searching and Matching and 8.1 Collation Folding
also talks about these things.
Matching can be done by using the collation elements, directly, as
discussed above. However, because matching does not use any of the
ordering information, the same result can be achieved by a folding.
That is, two strings would fold to the same string if and only if they
would match according to the (tailored) collation. For example, a
folding for a Danish collation would map both "Gård" and "gaard" to
the same value. A folding for a primary-strength folding would map
"Resume" and "résumé" to the same value. That folded value is
typically a lowercase string, such as "resume".
I actually had do this because I have a dataset that has things like
"undeaðlich" nad "smørrebrød", and I wanted to allow the user to
head-match with "undead" and "smor", respectively. There is no
decomposition of "ð" that includes "d", nor any of "ø" that includes "o".
But the UCA primary strenths are the same. It worked very well.
It's a very useful feature, and I'm glad that tr18 includes mention of it.
I just wish we could get it into our regex engines so I didn't have to
do it all by hand. :)
More information about the i18n-dev