RFR (M): JDK-6394757: rev1: AbstractSet.removeAll semantics are surprisingly dependent on relative sizes
javalists at cbfiddle.com
Tue May 14 15:09:16 UTC 2019
I strongly disagree with this change. (Fortunately, I am not a Reviewer, just Opinionated.)
1. The change to removeAll() is a significant incompatible change because of its performance implications.
2. The benefit of the change to removeAll() is restricted to a few edge cases which arguably are not bugs (with respect to the current documentation).
3. The benefit does not outweigh the cost.
In addition, the new concept of "membership semantics" is inadequately defined.
I would describe the state of Collections before this change as:
An object supporting the Collection interface either:
1. has a state that identifies a set of member objects
where members can be identified by the results of iteration
where object identification refers to the usual class-based equals/hashCode concept
where set is the mathematical notion
2. is an ersatz Collection that adopts the Collection interface because the object supports many useful Collection operations, but not all operations obey #1
The existence of ersatz Collections in the JDK is not ideal, but with the understanding that the effects of Collection operations on ersatz Collections is undefined (implementation dependent), it can be made to work. With this understanding, the "semantic shift" of removeAll() is simply a reflection of the undefined (implementation dependent) behavior of ersatz Collections and is not a bug.
The revised design tries to improve this situation by defining a new category of Collection, namely, a Collection whose membership semantics are defined in a non-standard way (not compatible with equals and hashCode). My understanding is that the ersatz Collections defined by the JDK are all Sets that have either stronger or weaker notions of member identity. Hence, the proposal has the potential of replacing ersatz Collections with non-standard, but well behaved Collections. Indeed, a claim is made that Collections with non-standard membership semantics are well defined and an implication is made that the only possible issue is that the results may be "counterintuitive" when Collections with different membership semantics are mixed.
This sounds like an improvement, except for two things:
1. I don't think removeAll() is well defined.
The definition says: "Removes all of this collection's elements that are also contained in the specified collection."
What are the elements of a Collection with a non-standard membership semantics? Are they all of the objects for which contains() returns true? Are they only the objects returned by the iterator? Is the Collection required to define a canonical set of objects that the iterator must return? Can an iterator return two non-equal objects that are equivalent according to its membership semantics? What happens if the specified collection contains one of these objects but not the other?
The definition says: "After this call returns, this collection will contain no elements in common with the specified collection."
That sounds like Collections.disjoint(), which is undefined in the mixed membership semantics case.
If the statement is meaningful, why isn't Collections.disjoint() defined in this case?
Alternatively, why is it OK for Collections.disjoint() to be undefined, but not removeAll()?
2. There will be serious performance hits.
To me, the troubling consequence is that using removeAll() to remove a small number of elements from a large set implemented using hashing goes from fast to very slow. This consequence does not appear to be covered by the current implementation note. It has nothing to do with the performance of contains() on the parameter, but on the need to test every member of the target.
I will need to review every use of removeAll() in my code and convert to iteration where appropriate. This is a significant incompatibility!
Other comments (if the above issues are discounted):
The concept of non-standard membership semantics is introduced (in Collection) without any previous definition of standard membership semantics or any specification of the requirements for a valid membership semantics. A linkage to the contains() method is implied, but not explained.
Classes such as SortedSet and TreeSet that currently can be used to create ersatz Collections should have their documentation changed to say that they may have non-standard membership semantics.
There should be some guidance to developers on how to write code that operates in a well defined manner on both standard and non-standard Collections, including multiple Collections with mixed membership semantics.
The definition of List.contains() rules out a List class with non-standard membership semantics. Is this necessary?
Given that this is a seriously incompatible change (in performance), why not add a method to reveal whether a Collection is using non-standard membership semantics? If this method returns false, then the Collection is either standard or ersatz, and in the latter case, generic behavior remains undefined, so the old removeAll() optimization can be used.
Alternatively, if the behavior of removeAll with a non-standard parameter is important, it can be supported by a filter(predicate) operation.
> On May 13, 2019, at 5:28 PM, Stuart Marks <stuart.marks at oracle.com> wrote:
> Hi all,
> Here's an updated webrev for this change. The general direction of the change is the same as before. Differences are primarily additional wording in a few relevant places, plus editorial and markup tweaks. Highlights of changes from the previous webrev include:
> - addition of FIXME comment and reference to javadoc bug report, where doc comment from interface cannot be inherited
> - tuned up wording around operations that depend on membership semantics of more than one collection
> - updated wording about membership semantics in Collections.disjoint()
> - added performance-related @implNote to List.removeAll/retainAll
> - various markup fixes
> Previous review thread:
More information about the core-libs-dev