[10] RFR 8134512 : provide Alpha-Numeric (logical) Comparator

Jason Mehrens jason_mehrens at hotmail.com
Fri Sep 29 14:42:01 UTC 2017

Makes sense on the sub-sequences having to be string.  Looks good.


From: Ivan Gerasimov <ivan.gerasimov at oracle.com>
Sent: Thursday, September 28, 2017 3:19 PM
To: Jason Mehrens; core-libs-dev
Subject: Re: [10] RFR 8134512 : provide Alpha-Numeric (logical) Comparator

Thank you Jason for valuable input!

On 9/28/17 7:22 AM, Jason Mehrens wrote:
> Ivan,
> Am I correct that subtraction of code points is safe from underflow due to the range of Character.MIN_CODE_POINT and Character.MAX_CODE_POINT?  That would explain why you are not using Integer.compare.
Right.  Since codepoints are all positive, it should be safe to subtract

> One alternative to calling CharSequence.subSequence is to use CharBuffer.wrap(CharSequence csq, int start, int end).  The sub-sequences are short lived so that may be faster in the String case.
Yes, it's a good point, given how expensive String.subSequence() is.
The disadvantage of using CharBuffer.wrap() is that it wouldn't be
possible to use the standard String comparators as the custom
With the current implementation, it is possible (with additional casts,
of course) because we know that String.subSequence returns a String object.

To be honest, I see no easy way to optimize the String case with the use
of a custom comparator because we need to pass to it portions of the
input as String objects.
One possibility might be to recognize common cases, like using the
standard comparators (for example, String.CASE_INSENSITIVE_ORDER) in
place of the custom alphaComparator, and then execute special variant of
the routine, which doesn't create explicit subsequences, but works on
the entire sequence instead.

At this point, however, it's probably too early to do such optimizations.

> Admittedly this is more of a Claes Redestad faster boot up type change but, the use of the AlphaDecimalComparator constructor in Comparator might force the bytecode verifier to force load AlphaDecimalComparator just to check the return type.
Hm, interesting.
We can introduce another factory method in the Comparators class, so
that AlphaDecimalComparator won't be mentioned in the Comparator at all.

Please see the updated webrev:

With kind regards,

> If you make factory methods in AlphaDecimalComparator you can usually craft the return type to match and the verify of Comparator will not force load the AlphaDecimalComparator.
> Otherwise, the more ugly solution is to wrap 'new' with Comparator.class.cast.
> Jason
> ________________________________________
> From: core-libs-dev <core-libs-dev-bounces at openjdk.java.net> on behalf of Ivan Gerasimov <ivan.gerasimov at oracle.com>
> Sent: Monday, September 25, 2017 12:49 PM
> To: core-libs-dev
> Subject: Re: [10] RFR 8134512 : provide Alpha-Numeric (logical) Comparator
> Hello!
> Could you please review at your convenience?
> In the latest webrev I took all suggestions into account (unless I
> missed something.)
> http://cr.openjdk.java.net/~igerasim/8134512/04/webrev/
> I think, if the suggested comparator is found useful by the users, then
> it may make sense to create the String-oriented variant, which can be
> implemented through the CharSequence-oriented one as:
> class String {
>       ...
>       @SuppressWarnings("unchecked")
>       public static <T extends String> Comparator<T>
>       comparingAlphaDecimal(Comparator<? super String> alphaComparator) {
>           return (Comparator<T>) (Comparator)
>                   new
> Comparators.AlphaDecimalComparator<>(Objects.requireNonNull(
>                           (Comparator<CharSequence>) alphaComparator),
> false);
>       }
> }
> This will be safe, since the specification guarantees that
> String.subSequence() returns a String.
> Then in the application code it would be possible to instantiate the
> comparators as
>           String.comparingAlphaDecimal(String::compareTo);
>           String.comparingAlphaDecimal(String::compareToIgnoreCase);
> or, alternatively,
>           String.comparingAlphaDecimal(Comparator.naturalOrder());
> String.comparingAlphaDecimal(String.CASE_INSENSITIVE_ORDER);
> But this could be deferred for later, of course.
> With kind regards,
> Ivan
> On 8/27/17 1:38 PM, Ivan Gerasimov wrote:
>> Hello everyone!
>> Here's another iteration of the comparator with suggested improvements.
>> Now, there is the only input argument -- the alpha-comparator for
>> comparing the non-decimal-digit sub-sequences.
>> For the javadoc I used the text suggested by Peter with some
>> modifications, additional example and API/implementation notes.
>> Overall, the javadoc looks heavier than need to me, so I'd love to
>> hear comments about how to make it shorter and cleaner.
>> Also, I adopted the name AlphaDecimal, suggested by Peter.  This name
>> is one of popular in the list of variants found in the wild. So, there
>> are higher chances the users can find the routine by its name.
>> For testing if a code point is a decimal digit, I used
>> (Character.getType(cp) == Character.DECIMAL_DIGIT_NUMBER), which seem
>> to be more appropriate than Character.isDigit().  (The later is true
>> for things like a digit in a circle, superscript, etc., which do not
>> seem to be a part of a decimal number composed of several digits.)
>> The updated webrev:
>> http://cr.openjdk.java.net/~igerasim/8134512/04/webrev/
>> Please review at your convenience.
>> With kind regards,
>> Ivan
>> On 8/9/17 4:59 PM, Stuart Marks wrote:
>>> On 8/1/17 11:56 PM, Ivan Gerasimov wrote:
>>>> I've tried to go one step further and created even more abstract
>>>> comparator:  It
>>>> uses a supplied predicate to decompose the input sequences into
>>>> odd/even
>>>> subsequences (e.g. alpha/numeric) and then uses two separate
>>>> comparator to
>>>> compare them. Additionally, a comparator for comparing sequences,
>>>> consisting
>>>> only of digits is provided. For example, to build a case-insensitive
>>>> AlphaDecimal comparator one could use: 1) Character::isDigit -- as
>>>> the predicate
>>>> for decomposing, 2) String::compareToIgnoreCase -- to compare alpha
>>>> (i.e. odd
>>>> parts); to work with CharSequences one would need to make it
>>>> Comparator.comparing(CharSequence::toString,
>>>> String::compareToIgnoreCase), 3)
>>>> The special decimal-only comparator, which compares the decimal
>>>> representation
>>>> of the sequences. Here's the file with all the comparators and a
>>>> simple test:
>>>> http://cr.openjdk.java.net/~igerasim/8134512/test/Test.java
>>> Hi, a couple follow-up thoughts on this.
>>> 1) Supplementary characters
>>> The current code uses Character.isDigit(char), which works only for
>>> char values in the BMP (basic multilingual plane, values <= U+FFFF).
>>> It won't work for supplementary characters. There are several blocks
>>> of digits in the BMP, but there are several more in the supplementary
>>> character range.
>>> I don't see any reason not to handle the supplementary characters as
>>> well, except that it spoils the nice char-by-char technique of
>>> processing the string. Instead, it'd have to pull in code point
>>> values, which might be comprised of two surrogate chars. There are a
>>> variety of methods on Character that help with this. Note that there
>>> is an overload Character.isDigit(int) which takes any code point
>>> value, including supplementary characters.
>>> 2) Too much generality?
>>> This version includes Predicate<Character> for determining whether a
>>> character is part of the alphabetic or decimal portion of the string.
>>> I'm thinking this might be overkill. It might be sufficient to
>>> "hardwire" the partitioning predicate to be Character::isDigit and
>>> the value mapping function to use Character::digit.
>>> The problem is that adding a predicate opens the door to a lot more
>>> complexity, while providing dimishing value. First, the predicate
>>> would have to handle code points (per the above) so it'd need to be
>>> an IntPredicate. Second, there would also need to be a mapping
>>> function from the code point value to a numeric value. This might be
>>> an IntUnaryOperator. This would allow someone to sort based on Roman
>>> numerals, using Character::getNumericValue. (Yes, Roman numerals are
>>> in Unicode.) Or maybe the mapping function should return any
>>> Comparable value, not an int. ... See where I'm going here?
>>> Since this kind of sorting is intended to be viewed by people, it's
>>> probably worth providing full internationalization support
>>> (supplementary characters, and delegation to sub-comparators, to
>>> allow locale-specific collating sequences). But I start to question
>>> any complexity beyond that.
>>> s'marks
> --
> With kind regards,
> Ivan Gerasimov

With kind regards,
Ivan Gerasimov

More information about the core-libs-dev mailing list