RFR(S): 8009536: G1: Apache Lucene hang during reference processing

John Cuthbertson john.cuthbertson at oracle.com
Mon Mar 11 14:35:58 PDT 2013

Hi Everyone,

Can I have a couple of volunteers review these changes? The webrev can 
be found at: http://cr.openjdk.java.net/~johnc/8009536/webrev.0/.

First of all - many thanks to Uwe Schindler for discovering an reporting 
the problem and providing very clear instructions on how to reproduce 
the issue. Many thanks also Dawid Weiss for also stepping in with a 
self-contained reproducer.

I also wish to thank Bengt for his help. It was Bengt who gave me the 
magic proxy formula that allowed Ivy to satisfy and download all the 
dependencies for the test case. Bengt also diagnosed the problem and 
gave an initial fix (which the changes in the webrev are based upon).

During the remark pause, the execution of the parallel RemarkTask set 
the number of workers thread in the ParallelTaskTerminator and the first 
and second barrier syncs. During serial reference processing, the 
marking stack overflowed causing the single (VMThread) thread to enter 
the overflow handling code in CMTask::do_marking_step(). This overflow 
code using two WorkBarrierSyncs to synchronize the threads before 
resetting the marking state for restarting marking. The barrier syncs 
were waiting for the number of threads that participated in the 
RemarkTask) but, since only the VM thread was executing, only a single 
thread entered the barrier - resulting in the barrier indefinitely 
waiting for the other (non existent) threads.

A proposed solution was to call set_phase to reset the number of threads 
in the parallel task terminator and the barriers to the number of active 
threads for the reference processing. This solution ran into a similar 
hang while processing the JNI references with parallel reference 
processing enabled. (In parallel reference processing, the JNI 
references are processed serially by the calling thread). Resetting the 
phase to single-threaded before processing the JNI refs solved the 
second hang but resulted in an assertion failure: only a concurrentGC 
thread can enter a barrier sync and the calling thread was the VM thread.

Furthermore another problem was discovered. If the marking state is 
reset, a subsequent call to set_phase() will assert as the global finger 
has been set to start of the heap. This was a discovered by the marking 
stack overflowing during the RemarkTask and parallel reference 
processing calling set_phase() to reinitialize number of workers in the 
parallel task terminator. It was also discovered when trying out another 
proposed solution: adding a start_gc closure to reference processing  
which would call set_phase() before each processing phase. As a result 
the marking state is only reset by worker 0 if an overflow occurs during 
the concurrent phase of marking; if an overflow occurs during remark, 
reference processing is skipped, and the marking state is reset by the 
VM thread. Resetting the marking state before reference processing was a 
benign error (objects would be marked but not pushed on to the stack as 
they were no longer below the finger; the objects would then be traced, 
in the normal fashion, when marking restarted) but it's better to safe 
than sorry. The other part of the fix for this secondary problem is that 
the parallel reference processing task executor now calls the 
terminator's reset_for_reuse() routine instead of set_phase().

The resulting solution for the hang is based upon the patch sent out by 
Bengt - namely we do not enter the sync barriers when 
CMTask::do_marking_step() is being called serially. The difference is 
that I added an extra parameter to CMTask::do_marking_step() instead of 
piggy-backing on the existing parameter list. Additionally, if this new 
parameter indicates serial operation,  the current thread will skip 
offering termination. This allows the serial drain closure to enter the 
termination protocol and execute the guarantees contained therein.

The other changes are for the secondary issues, described above, that 
were discovered while out trying other possible solutions.

The lucene test case with serial reference processing (with and without 
verification);  the lucene test case with parallel reference processing 
(with and without verification).
GC test suite with a mark stack size of 1K and 4K, with both serial and 
parallel reference processing (with and without verification).

More information about the hotspot-gc-dev mailing list