RFR: 8132306: java/lang/ref/ReferenceEnqueue.java fails with "RuntimeException: Error: poll() returned null; expected ref object"
peter.levart at gmail.com
Thu Jul 30 08:38:43 UTC 2015
Let me chime in and add some comments...
On 07/30/2015 01:46 AM, Kim Barrett wrote:
> On Jul 29, 2015, at 4:32 AM, David Holmes <david.holmes at oracle.com> wrote:
>> On 29/07/2015 5:57 PM, Kim Barrett wrote:
>>> ... The race is being fixed by reordering a pair
>>> of volatile assignments.
>> While this seems logical for the failure at hand it isn't immediately obvious to me that setting next before setting the ENQUEUED state won't cause a problem for some other piece of code. Really these things need to be tightly synchronized - I think the real bug is the unsynchronized fast-path for the empty-queue case in poll(). While that change was deliberate in 6666739 this side-effect was not realized and would have affected the change. I hope Martin and/or Tom see this and chime in.
> I thought the poll() fast-path from 6666739 was ok at the time it was
> made, and that it was the later removal of synchronization on the
> reference by 8014890 that lead to the race under discussion, and
> reinstating that synchronization on the reference was another way to
> fix this race. Turns out I was wrong about that.
> The poll() fast-path introduced the possibility of the following
> unexpected behavior when another thread is in r.enqueue() and is in
> the problematic window:
> !r.enqueue() && q.poll() == null => true
> This can happen because the synchronization on the references is only
> in ReferenceQueue.enqueue(). If it was instead in
> Reference.enqueue(), or if ReferenceQueue.Null.enqueue() also
> synchronized on the reference, this race would be prevented.
Isn't above condition perfectly legal for references that have already
been de-queued (and when the 'q' is otherwise empty)? But I see what you
mean. If r.enqueue() grabs the ENQUEUED queue, the method returns false,
but that does not mean that the reference has already been hooked on the
'head' chain ('head' can still be null and q.poll() would return null
because of fast-path).
reversing assignments of 'head' and 'queue' fix this situation too.
> The removal of reference synchronization opened the door wider, also
> allowing this unexpected behavior:
> r.isEnqueued() && q.poll() == null => true
This condition is more representative, yes...
> The combination of those changes is needed for the ReferenceEnqueue
> regression test to fail, since it requires
> r.isEnqueued() && !r.enqueue() && q.poll() == null => true
> I wouldn't want to revert the poll() fast-path, since that was added
> for specific performance reasons.
> I don't think I'd want to add back synchronization on the reference,
> but might be persuaded otherwise. 8029205 should be looked at in that
> I've looked carefully at uses of r.next and I don't think there is a
> problem with reordering the r.next and r.queue assignments. Of
> course, the existing code was looked at by a number of people without
> spotting the race under discussion.
> A way to think about this that helped me (I think) understand the
> locking structure used here is that q.head and r.next are part of the
> queue associated with the reference. We can manipulate those using
> the queue's lock as the basis for protection, so long as the the
> reference's queue isn't changed. But when we change the reference's
> queue, the queue (including r.next) must be in a consistent state.
> [This suggests the r.queue = NULL assignment in reallyPoll() should be
> moved later, though I think the assignment order in reallyPoll()
> doesn't matter.]
I think the assignment to r.queue = NULL in realyPoll() should be moved
*before* the assignment to 'head' (which might assign null if 'r' was
the last element). Here's why:
Suppose we have a Reference 'r' and it's associated ReferenceQueue 'q'.
Currently it can happen that the following evaluates to true, which is
q.poll() == null && r.isEnqueued()
>> That aside the commentary is rather verbose, a simple:
>> // Only set ENQUEUED state after the reference is enqueued
>> would suffice (and add "volatiles ensure ordering" if you want that to be clearer).
> I do get a bit wordy sometimes. How about this:
> // Update r.queue *after* adding to queue's list, to avoid
> // race between enqueued checks and poll(). Volatiles
> // ensure ordering.
>>> jprt with default and hotspot testsets
>>> Locally tested race in original code by insertion of sleep in enqueue and
>>> verifying ReferenceEnqueue regression test fails consistently as
>>> described in the bug report.
>>> Locally tested fixed code by insertion of sleeps at various points and
>>> verifying ReferenceEnqueue regression test still passes.
More information about the hotspot-gc-dev