RFR (M): JDK-6394757: rev1: AbstractSet.removeAll semantics are surprisingly dependent on relative sizes

Jens Lideström jens at lidestrom.se
Fri Jul 26 12:26:40 UTC 2019

Hello core-lib-dev,

I think it is a mistake to make `AbstractSet#removeAll` use the
membership semantics of the argument collection instead of those of the
receiver set, and to iterate over the receiver instead of the argument

I think that there are much stronger arguments for doing it the other
way around: to use the membership semantics of the receiver and to
iterate over the argument. This is what Stuart suggest in his first
prototype [1], but unlike the how it looks in the current webrev [2].

After having read through the original thread (most importantly this
message: [3]) and the review threads, it is my impression that the
disadvantages of this solution have been over-emphasised and the
advantages under-emphasised.

This is my impression of the advantages and disadvantages of making
`removeAll` use the membership semantics of the receiver set:

## Advantages

1. It leads to the expected behaviour when custom membership semantics
are used.
2. It preserves the behaviour in the common case when custom membership
semantics are used.
3. It has better performance for hashed sets in the common case.

## Disadvantages

4. It is different from `AbstractCollection`.
5. It is not possible to do in `retainAll`, and thus make the two
methods inconsistent. (See [3].)

I don't think 1-2 has been given enough weight in the discussion, and
that 4-5 have been given too much.

## Details

1. I think that often `removeAll` is called without much tough about the
properties of the argument collection, except for its members. It will
be confusing, annoying and error prone if code like this doesn't work as

	var s = new TreeSet<>(String::compareToIgnoreCase);
	s.addAll(List.of("a", "b", "c");
	// Should remove "a" and "b"
	s.removeAll(List.of("A", "B"));

Non-set collections on the other hand are probably much less likely to
have custom membership semantics. Because of this they can get away with
using the membership semantics of the argument, without exhibiting
unexpected behaviour for code that is similar to the above.

2. The consequences are even more sever if code like the above stops
working as expected. This will probably cause some code that has been
working for years (decades?) to suddenly fails.

    * I think that it is much more common to use `removeAll` to remove a
small number of elements from a bigger set. If the membership semantics
of the argument are to be used in that case, then the behaviour of
existing code will change. If on the other hand the membership semantics
of the receiver are to be used, then existing code will continue to
behave in the same way.
    * I think that existing code that works in the opposite way, that
depends on the custom membership semantics of the argument of
`removeAll`, is much more uncommon, and thus that it is more acceptable
that such code stops working.

3. Alan Snyder wrote about performance at length in his mails. This is
an important point, even if the arguments above are more important. For
example, some applications with large HashSet-based caches might stop
working in a satisfactory way.

4. I think that consistency with `AbstractCollection#removeAll` has been
given too much importance in the discussion. Sets are different from
lists, and it seem justified that the implementation of `removeAll` in
AbstractSet works in a different way from `AbstractCollection`.

    * Collections that allow duplicates are forced to iterate `this` out
of necessity.
    * But sets have no duplicates. It is natural take advantage of this
fact and iterate the argument. It seems unlikely that this would be
unexpected for anyone or lead to problems.
    * Non-set collections are probably much less likely to have custom
equality semantics, which makes iterating over the argument less

5. The difference in behaviour between `retainAll` and `removeAll` that
is the result of using the membership semantics of the receiver is, I
think, the most serious disadvantage of this solution. This disadvantage
is not as strong as the advantages in 1-2 above, however.

Also, I think that there is some justification to it. I think
`retainAll` is perceived much more as a set theoretic operation than
`removeAll`. This makes it more intuitive for a user that you should
ensure that the argument to that method is a collection with the right
membership semantics.

### Conclusion

I think the following original suggestion [1] is a much better choice
for an implementation of `AbstractSet#removeAll` than the existing
implementation in `AbstractCollection`:

     public boolean removeAll(Collection<?> c) {
         boolean modified = false;
         for (Object e : c)
             modified |= remove(e);
         return modified;

[2]: http://cr.openjdk.java.net/~smarks/reviews/6394757/webrev.1/

Jens Lideström
Proud new Eclipse committer

More information about the core-libs-dev mailing list