<i18n dev> Fwd: Re: Codereview Request: 7039066 j.u.rgex does not match TR#18 RL1.4 Simple Word Boundaries and RL1.2 Properties
xueming.shen at oracle.com
Mon Apr 25 05:38:48 UTC 2011
Let's go with UNICODE_PROPERTY, if there is no objection.
On 4/24/2011 9:00 PM, Mark Davis ☕ wrote:
> There are pluses and minuses to any of them: UNICODE_SPEC,
> UNICODE_PROPERTY, UNICODE_CLASS, UNICODE_PROPERTIES,
> or UNICODE_CLASSES, although any would work in a pinch.
> I'd favor a bit the singular over the plural, given the usage.
> The term 'class' is not used much in Unicode, just for two properties
> (see below). So someone could possibly think it just meant those two
> properties, and it could cause a bit of confusion with 'class' meaning
> OO. So for that reason I don't think CLASS(ES) would be optimal.
> bc ; Bidi_Class
> ccc ; Canonical_Combining_Class
> /— Il meglio è l’inimico del bene —/
> On Sun, Apr 24, 2011 at 11:22, Xueming Shen <xueming.shen at oracle.com
> <mailto:xueming.shen at oracle.com>> wrote:
> Two more names, UNICODE_PROPERTIES and UNICODE_CLASSES, are suggested.
> any opinion?
> On 4/23/2011 6:50 PM, Xueming Shen wrote:
>> Forwarding...forgot to include the list.
>> -------- Original Message --------
>> Subject: Re: Codereview Request: 7039066 j.u.rgex does not match
>> TR#18 RL1.4 Simple Word Boundaries and RL1.2 Properties
>> Date: Sat, 23 Apr 2011 17:53:42 -0700
>> From: Xueming Shen <xueming.shen at oracle.com>
>> <mailto:xueming.shen at oracle.com>
>> To: Tom Christiansen <tchrist at perl.com> <mailto:tchrist at perl.com>
>> Mark, Tom,
>> I agree with Mark that UNICODE_SPEC is a better name than
>> UNICODE_CHARSET. We will have to deal with
>> the "compatibility" issue Tom mentioned anyway anyway should Java go
>> higher level of Unicode Regex support
>> someday. New option/flag will have to be introduced to let the developer
>> to have the choice, just like what we
>> are trying to do with the ASCII only or Unicode version for those classes.
>> I also agree we should have an embedded flag. was thinking we can add it
>> later, for example the JDK8, if we
>> can get this one in jdk7, but the Pattern usage in String class is
>> The webrev, specdiff and Pattern doc have been updated to use
>> UNICODE_SPEC as the flag and (?U) as the
>> embedded flag. It might be a little confused, compared to we use (?u)
>> for UNICODE_CASE, but feel it might
>> feel "nature" to have uppercase "U" for broader Unicode support.
>> The webrev is at
>> http://cr.openjdk.java.net/~sherman/7039066/webrev/ <http://cr.openjdk.java.net/%7Esherman/7039066/webrev/>
>> j.u.regex.Pattern API:
>> http://cr.openjdk.java.net/~sherman/7039066/Pattern.html <http://cr.openjdk.java.net/%7Esherman/7039066/Pattern.html>
>> http://cr.openjdk.java.net/~sherman/7039066/specdiff/diff.html <http://cr.openjdk.java.net/%7Esherman/7039066/specdiff/diff.html>
>> Tom, it would be appreciated if you can at lease give the doc update a
>> quick scan to see if I miss anything.
>> And thanks for the suggestions for the Perl related doc update, I will
>> need go through it a little later and address
>> it in a separate CR.
>> On 4/23/2011 10:48 AM, Tom Christiansen wrote:
>> > Mark Davis ☕<mark at macchiato.com> <mailto:mark at macchiato.com> wrote
>> > on Sat, 23 Apr 2011 09:09:55 PDT:
>> >> The changes sound good.
>> > They sure do, don't they? I'm quite happy about this. I think it is more
>> > important to get this in the queue than that it (necessarily) be done for
>> > JDK7. That said, having a good tr18 RL1 story for JDK7's Unicode 6.0 debut
>> > makes it attractive now. But if not now, then soon is good enough.
>> >> The flag UNICODE_CHARSET will be misleading, since
>> >> all of Java uses the Unicode Charset (= encoding). How about:
>> >> UNICODE_SPEC
>> >> or something that gives that flavor.
>> > I hadn't thought of that, but I do see what you mean. The idea is
>> > that the semantics of \w etc change to match the Unicode spec in tr18.
>> > I worry that UNICODE_SPEC, or even UNICODE_SEMANTICS, might be too
>> > broad a brush. What then happens when, as I imagine it someday shall,
>> > Java gets full support for RL2.3 boundaries, the way with ICU one uses
>> > or (?w) or UREGEX_UWORD for?
>> > Wouldn't calling something UNICODE_SPEC be too broad? Or should
>> > UNICODE_SPEC automatically include not just existing Unicode flags
>> > like UNICODE_CASE, but also any UREGEX_UWORD that comes along?
>> > If it does, you have back-compat issue, and if it doesn't, you
>> > have a misnaming issue. Seems like a bit of a Catch22.
>> > The reason I'd suggested UNICODE_CHARSET was because of my own background
>> > with the names we use for this within the Perl regex source code (which is
>> > itself written in C). I believe that Java doesn't have the same situation
>> > as gave rise to it in Perl, and perhaps something else would be clearer.
>> > Here's some background for why we felt we had to go that way. To control
>> > the behavior of \w and such, when a regex is compiled, a compiled Perl
>> > gets exactly one of these states:
>> > REGEX_UNICODE_CHARSET
>> > REGEX_LOCALE_CHARSET
>> > REGEX_ASCII_RESTRICTED_CHARSET
>> > REGEX_DEPENDS_CHARSET
>> > That state it normally inherits from the surrounding lexical scope,
>> > although this can be overridden with /u and /a, or (?u) and (?a),
>> > either within the pattern or as a separate pattern-compilation flag.
>> > REGEX_UNICODE_CHARSET corresponds to out (?u), so \w and such all get the
>> > full RL1.2a definitions. Because Perl always does Unicode casemapping --
>> > and full casemapping, too, not just simple -- we didn't need (?u) for what
>> > Java uses it for, which is just as an extra flavor of (?i); it doesn't
>> > do all that much.
>> > (BTW, the old default is *not* some sort of non-Unicode charset
>> > semantics, it's the ugly REGEX_DEPENDS_CHARSET, which is Unicode for
>> > code points> 255 and "maybe" so in the 128-255 range.)
>> > What we did certainly isn't perfect, but it allows for both backwards
>> > compat and future growth. This was because people want(ed) to be able to
>> > use regexes on both byte arrays yet also on character strings. Me, I think
>> > it's nuts to support this at all, that if you want an input stream in (say)
>> > CP1251 or ISO 8859-2, that you simply set that stream's encoding and be
>> > done with it: everything turns into characters internally. But there's old
>> > byte and locale code out there whose semantics we are loth to change out
>> > from under people. Java has the same kind of issue.
>> > The reason we ever support anything else is because we got (IMHO nasty)
>> > POSIX locales before we got Unicode support, which didn't happen till
>> > toward the end of the last millennium. So we're stuck supporting code
>> > well more than a decade old, perhaps indefinitely. It's messy, but it
>> > is very hard to do anything about that. I think Java shares in that
>> > perspective.
>> > This corresponds, I think, to Java needing to support pre-Unicode
>> > regex semantics on \w and related escapes. If they had started out
>> > with it always means the real thing the way ICU did, they wouldn't
>> > need both.
>> > I wish there were a pragma to control this on a per-lexical-scope basis,
>> > but I'm don't enough about the Java compilers internals to begin to know
>> > how to go about implementing some thing like that, even as a
>> > -XX:+UseUnicodeSemantics CLI switch for that compilation unit.
>> > One reason you want this is because the Java String class has these
>> > "convenience" methods like matches, replaceAll, etc, that take regexes
>> > but do not provide an API that admits Pattern compile flags. If there
>> > is no way to embed a (?U) directive or some such, nor any way to pass
>> > in a Pattern.UNICODE_something flag. The Java String API could also
>> > be broadened through method signature overloading, but for now, you
>> > can't do that.
>> > No matter what the UNICODE_something gets called, I think there needs to be
>> > a corresponding embeddable (?X)-style flag as well. Even if String were
>> > broadened, you'd want people to be able to specify *within the regex* that
>> > that regex should have full Unicode semantics. After all, they might read
>> > the pattern in from a file. That's why (most) Pattern.compile flags need
>> > to be able to embedded, too. But you knew that already. :)
>> > --tom
More information about the core-libs-dev