<i18n dev> Proposed update to UTS#18

Andy Heninger aheninger at google.com
Fri Apr 15 10:27:31 PDT 2011

On Fri, Apr 15, 2011 at 8:01 AM, Mark Davis ☕ <mark at macchiato.com> wrote:

> The biggest issue is that for any transformation that changes the number of
> characters, or rearranges them is problematic, for the reasons outlined in
> the PRI.
> An example might be /(a|b|c*(?=...)|...)(d|...|a)/, which for Danish (under
> a collation tranform, stength 2) should match any of {aa, aA,...å, Å,
> Å,...}, as should  /(a|b|c*(?=...)|...)(d|...|\x{308})/
> What *is* relatively straightforward is to do is to construct a regex
> targeted at a known transformation (like NFC), and then transform the input
> text. There will be some difficulties in mapping between indexes for
> grouping, however. Most regex engines can't handle in their API
> discontiguous groups.

I suspect a match where the fundamental atomic unit of matching was grapheme
clusters, or combining sequences, would produce useful results.
 No discontinuous results.  Results independent of normalization form, or
lack of normalization, of the input.  No ability of the match to look inside
of, or partially match, combining sequences.

I also think that we should avoid making recommendations that haven't been
implemented and proved to be useful and practical.

 - Andy

> Mark
> *— Il meglio è l’inimico del bene —*
> On Thu, Apr 14, 2011 at 23:50, Tom Christiansen <tchrist at perl.com> wrote:
>> Thanks, Mark.
>> I've been trying to think about what to say to it.
>> I'd like to more about what is planned in the "canonical matching" area.
>> I do understand why reordering makes exact matching impossible.  However,
>> I should think one of several sort of loose matching might still be done.
>> Maybe that require level 3, though.
>> Mostly though I've been thinking about case insensitivitity.  I feel that
>> the current Unicode case mapping strategy is much weaker than what the
>> spirit of the thing really calls far.  It's weak because it doesn't do as
>> much as it could.
>> I have played around with one approach that gives user-desirable results,
>> and also addresses the canonical issue.  The synopsis is that I think
>> RL3.4
>> would cut the Gordian Knot of combining marks (at level 1 they're ignored)
>> and do something genuinely useful by creating much more the sort of case
>> insensitivity at a level 1 comparison than anything currently available.
>> That's what RL3.4 Tailored Loose Match is about:
>>    To meet this requirement, an implementation shall provide for loose
>>    matches based on a locale's collation order, with at least 3 levels.
>> And tr10's section 8 on Searching and Matching and 8.1 Collation Folding
>> also talks about these things.
>>    Matching can be done by using the collation elements, directly, as
>>    discussed above. However, because matching does not use any of the
>>    ordering information, the same result can be achieved by a folding.
>>    That is, two strings would fold to the same string if and only if they
>>    would match according to the (tailored) collation. For example, a
>>    folding for a Danish collation would map both "Gård" and "gaard" to
>>    the same value. A folding for a primary-strength folding would map
>>    "Resume" and "résumé" to the same value. That folded value is
>>    typically a lowercase string, such as "resume".
>> I actually had do this because I have a dataset that has things like
>> "undeaðlich" nad "smørrebrød", and I wanted to allow the user to
>> head-match with "undead" and "smor", respectively.  There is no
>> decomposition of "ð" that includes "d", nor any of "ø" that includes "o".
>> But the UCA primary strenths are the same.  It worked very well.
>> It's a very useful feature, and I'm glad that tr18 includes mention of it.
>> I just wish we could get it into our regex engines so I didn't have to
>> do it all by hand. :)
>> -tom
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/i18n-dev/attachments/20110415/eb41132e/attachment.html 

More information about the i18n-dev mailing list