RFR(S): 8009536: G1: Apache Lucene hang during reference processing
bengt.rutisson at oracle.com
Tue Mar 12 08:05:04 PDT 2013
Thanks for fixing this so quickly!
I have only had a quick look at the change. I'll make sure to look
closer tomorrow. However, I have two questions. If you have time to
answer them it would be good. If you don't have time I hope they become
clear when I look more in detail at the change tomorrow...
First, in the constructor for G1CMDrainMarkingStackClosure() we do:
2285 _do_stealing = !_is_serial;
2286 _do_termination = true;
Just from looking at this it seems like we could get away with only
having one instance variable instead of three. Is that the case or can
any of _do_stealing, _is_serial or _do_termination change during the
life time of a G1CMDrainMarkingStackClosure instance?
Second, as you describe in the text below you are actually fixing two
issues with this patch. Would it make sense to split the changes up into
On 3/11/13 10:35 PM, John Cuthbertson wrote:
> Hi Everyone,
> Can I have a couple of volunteers review these changes? The webrev can
> be found at: http://cr.openjdk.java.net/~johnc/8009536/webrev.0/.
> First of all - many thanks to Uwe Schindler for discovering an
> reporting the problem and providing very clear instructions on how to
> reproduce the issue. Many thanks also Dawid Weiss for also stepping in
> with a self-contained reproducer.
> I also wish to thank Bengt for his help. It was Bengt who gave me the
> magic proxy formula that allowed Ivy to satisfy and download all the
> dependencies for the test case. Bengt also diagnosed the problem and
> gave an initial fix (which the changes in the webrev are based upon).
> During the remark pause, the execution of the parallel RemarkTask set
> the number of workers thread in the ParallelTaskTerminator and the
> first and second barrier syncs. During serial reference processing,
> the marking stack overflowed causing the single (VMThread) thread to
> enter the overflow handling code in CMTask::do_marking_step(). This
> overflow code using two WorkBarrierSyncs to synchronize the threads
> before resetting the marking state for restarting marking. The barrier
> syncs were waiting for the number of threads that participated in the
> RemarkTask) but, since only the VM thread was executing, only a single
> thread entered the barrier - resulting in the barrier indefinitely
> waiting for the other (non existent) threads.
> A proposed solution was to call set_phase to reset the number of
> threads in the parallel task terminator and the barriers to the number
> of active threads for the reference processing. This solution ran into
> a similar hang while processing the JNI references with parallel
> reference processing enabled. (In parallel reference processing, the
> JNI references are processed serially by the calling thread).
> Resetting the phase to single-threaded before processing the JNI refs
> solved the second hang but resulted in an assertion failure: only a
> concurrentGC thread can enter a barrier sync and the calling thread
> was the VM thread.
> Furthermore another problem was discovered. If the marking state is
> reset, a subsequent call to set_phase() will assert as the global
> finger has been set to start of the heap. This was a discovered by the
> marking stack overflowing during the RemarkTask and parallel reference
> processing calling set_phase() to reinitialize number of workers in
> the parallel task terminator. It was also discovered when trying out
> another proposed solution: adding a start_gc closure to reference
> processing which would call set_phase() before each processing phase.
> As a result the marking state is only reset by worker 0 if an overflow
> occurs during the concurrent phase of marking; if an overflow occurs
> during remark, reference processing is skipped, and the marking state
> is reset by the VM thread. Resetting the marking state before
> reference processing was a benign error (objects would be marked but
> not pushed on to the stack as they were no longer below the finger;
> the objects would then be traced, in the normal fashion, when marking
> restarted) but it's better to safe than sorry. The other part of the
> fix for this secondary problem is that the parallel reference
> processing task executor now calls the terminator's reset_for_reuse()
> routine instead of set_phase().
> The resulting solution for the hang is based upon the patch sent out
> by Bengt - namely we do not enter the sync barriers when
> CMTask::do_marking_step() is being called serially. The difference is
> that I added an extra parameter to CMTask::do_marking_step() instead
> of piggy-backing on the existing parameter list. Additionally, if this
> new parameter indicates serial operation, the current thread will skip
> offering termination. This allows the serial drain closure to enter
> the termination protocol and execute the guarantees contained therein.
> The other changes are for the secondary issues, described above, that
> were discovered while out trying other possible solutions.
> The lucene test case with serial reference processing (with and
> without verification); the lucene test case with parallel reference
> processing (with and without verification).
> GC test suite with a mark stack size of 1K and 4K, with both serial
> and parallel reference processing (with and without verification).
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the hotspot-gc-dev