RFR: 8132306: java/lang/ref/ReferenceEnqueue.java fails with "RuntimeException: Error: poll() returned null; expected ref object"

David Holmes david.holmes at oracle.com
Thu Jul 30 02:40:28 UTC 2015

Hi Kim,

tl;dr: Ship it! :)

On 30/07/2015 9:46 AM, Kim Barrett wrote:
> On Jul 29, 2015, at 4:32 AM, David Holmes <david.holmes at oracle.com> wrote:
>> On 29/07/2015 5:57 PM, Kim Barrett wrote:
>>> ...  The race is being fixed by reordering a pair
>>> of volatile assignments.
>> While this seems logical for the failure at hand it isn't immediately obvious to me that setting next before setting the ENQUEUED state won't cause a problem for some other piece of code. Really these things need to be tightly synchronized - I think the real bug is the unsynchronized fast-path for the empty-queue case in poll(). While that change was deliberate in 6666739 this side-effect was not realized and would have affected the change. I hope Martin and/or Tom see this and chime in.
> I thought the poll() fast-path from 6666739 was ok at the time it was
> made, and that it was the later removal of synchronization on the
> reference by 8014890 that lead to the race under discussion, and
> reinstating that synchronization on the reference was another way to
> fix this race.  Turns out I was wrong about that.

Seems there are multiple races present. :(

> The poll() fast-path introduced the possibility of the following
> unexpected behavior when another thread is in r.enqueue() and is in
> the problematic window:
>      !r.enqueue() && q.poll() == null => true
> This can happen because the synchronization on the references is only
> in ReferenceQueue.enqueue().  If it was instead in
> Reference.enqueue(), or if ReferenceQueue.Null.enqueue() also
> synchronized on the reference, this race would be prevented.

Right - the switching of the queues changes the lock in use and so we no 
longer serialize access to the reference state.

> The removal of reference synchronization opened the door wider, also
> allowing this unexpected behavior:
>      r.isEnqueued() && q.poll() == null => true
> The combination of those changes is needed for the ReferenceEnqueue
> regression test to fail, since it requires
>      r.isEnqueued() && !r.enqueue() && q.poll() == null => true

Yes - all of which can unfortunately run concurrently with the still 
in-progress r.enqueue().

> I wouldn't want to revert the poll() fast-path, since that was added
> for specific performance reasons.

But that was done under an assumption of correctness. If it isn't 
correct then its okay to regress performance. But of course if we can 
fix it some other way ...

> I don't think I'd want to add back synchronization on the reference,
> but might be persuaded otherwise.  8029205 should be looked at in that
> case.

It is a tricky one - as I wrote in the review thread for 8014890. Basic 
synchronization rules say you should lock the Reference whilst mutating 
its state - but that leads to nested locking issues when also trying to 
update the queue's state. So we add a volatile, drop a lock and create 
one or more races - and hope the we have correctly determined that the 
races are benign (which we didn't in this case).

> I've looked carefully at uses of r.next and I don't think there is a
> problem with reordering the r.next and r.queue assignments.  Of
> course, the existing code was looked at by a number of people without
> spotting the race under discussion.

Between your analysis and Daniel's I'm more confident about the change. 
The Java side of things seems fine, but trying to enumerate all the 
possibles races is difficult to say the least.

> A way to think about this that helped me (I think) understand the
> locking structure used here is that q.head and r.next are part of the
> queue associated with the reference.  We can manipulate those using
> the queue's lock as the basis for protection, so long as the the
> reference's queue isn't changed.  But when we change the reference's
> queue, the queue (including r.next) must be in a consistent state.

I'm not 100% sure it sufficiently covers the cases where we switch 
queues, but yes logically the Reference's queue field is guarded by the 
queue's lock.

> [This suggests the r.queue = NULL assignment in reallyPoll() should be
> moved later, though I think the assignment order in reallyPoll()
> doesn't matter.]

I don't see that it matters as the current queue is locked and the only 
interesting actions can happen with the current queue. For the NULL and 
ENQUEUED queues enqueuing is a no-op, and there can only be one "real" 
queue involved.

>> That aside the commentary is rather verbose, a simple:
>> // Only set ENQUEUED state after the reference is enqueued
>> would suffice (and add "volatiles ensure ordering" if you want that to be clearer).
> I do get a bit wordy sometimes.  How about this:
>              // Update r.queue *after* adding to queue's list, to avoid
>              // race between enqueued checks and poll().  Volatiles
>              // ensure ordering.

Sounds fine.


>>> CR:
>>> https://bugs.openjdk.java.net/browse/JDK-8132306
>>> Webrev:
>>> http://cr.openjdk.java.net/~kbarrett/8132306/webrev.00/
>>> Testing:
>>> jprt with default and hotspot testsets
>>> Locally tested race in original code by insertion of sleep in enqueue and
>>> verifying ReferenceEnqueue regression test fails consistently as
>>> described in the bug report.
>>> Locally tested fixed code by insertion of sleeps at various points and
>>> verifying ReferenceEnqueue regression test still passes.

More information about the hotspot-gc-dev mailing list