RFR: 8132306: java/lang/ref/ReferenceEnqueue.java fails with "RuntimeException: Error: poll() returned null; expected ref object"
kim.barrett at oracle.com
Wed Jul 29 23:46:39 UTC 2015
On Jul 29, 2015, at 4:32 AM, David Holmes <david.holmes at oracle.com> wrote:
> On 29/07/2015 5:57 PM, Kim Barrett wrote:
>> ... The race is being fixed by reordering a pair
>> of volatile assignments.
> While this seems logical for the failure at hand it isn't immediately obvious to me that setting next before setting the ENQUEUED state won't cause a problem for some other piece of code. Really these things need to be tightly synchronized - I think the real bug is the unsynchronized fast-path for the empty-queue case in poll(). While that change was deliberate in 6666739 this side-effect was not realized and would have affected the change. I hope Martin and/or Tom see this and chime in.
I thought the poll() fast-path from 6666739 was ok at the time it was
made, and that it was the later removal of synchronization on the
reference by 8014890 that lead to the race under discussion, and
reinstating that synchronization on the reference was another way to
fix this race. Turns out I was wrong about that.
The poll() fast-path introduced the possibility of the following
unexpected behavior when another thread is in r.enqueue() and is in
the problematic window:
!r.enqueue() && q.poll() == null => true
This can happen because the synchronization on the references is only
in ReferenceQueue.enqueue(). If it was instead in
Reference.enqueue(), or if ReferenceQueue.Null.enqueue() also
synchronized on the reference, this race would be prevented.
The removal of reference synchronization opened the door wider, also
allowing this unexpected behavior:
r.isEnqueued() && q.poll() == null => true
The combination of those changes is needed for the ReferenceEnqueue
regression test to fail, since it requires
r.isEnqueued() && !r.enqueue() && q.poll() == null => true
I wouldn't want to revert the poll() fast-path, since that was added
for specific performance reasons.
I don't think I'd want to add back synchronization on the reference,
but might be persuaded otherwise. 8029205 should be looked at in that
I've looked carefully at uses of r.next and I don't think there is a
problem with reordering the r.next and r.queue assignments. Of
course, the existing code was looked at by a number of people without
spotting the race under discussion.
A way to think about this that helped me (I think) understand the
locking structure used here is that q.head and r.next are part of the
queue associated with the reference. We can manipulate those using
the queue's lock as the basis for protection, so long as the the
reference's queue isn't changed. But when we change the reference's
queue, the queue (including r.next) must be in a consistent state.
[This suggests the r.queue = NULL assignment in reallyPoll() should be
moved later, though I think the assignment order in reallyPoll()
> That aside the commentary is rather verbose, a simple:
> // Only set ENQUEUED state after the reference is enqueued
> would suffice (and add "volatiles ensure ordering" if you want that to be clearer).
I do get a bit wordy sometimes. How about this:
// Update r.queue *after* adding to queue's list, to avoid
// race between enqueued checks and poll(). Volatiles
// ensure ordering.
>> jprt with default and hotspot testsets
>> Locally tested race in original code by insertion of sleep in enqueue and
>> verifying ReferenceEnqueue regression test fails consistently as
>> described in the bug report.
>> Locally tested fixed code by insertion of sleeps at various points and
>> verifying ReferenceEnqueue regression test still passes.
More information about the hotspot-gc-dev