RFR(S): 8009536: G1: Apache Lucene hang during reference processing

John Cuthbertson john.cuthbertson at oracle.com
Tue Mar 12 15:00:27 PDT 2013


Hi Everyone,

Here's a new webrev based upon comments from Bengt and Thomas.

http://cr.openjdk.java.net/~johnc/8009536/webrev.1/

This webrev includes the just the changes to resolve the hang seen by 
overflowing the marking stack during *serial* reference processing. As I 
said in my response to Bengt, this revision will produce the following 
assert:

> [junit4:junit4] 80.722: [GC remark 80.723: [GC ref-proc80.785: [GC 
> concurrent-mark-reset-for-overflow]
> [junit4:junit4] # To suppress the following error report, specify this 
> argument
> [junit4:junit4] # after -XX: or in .hotspotrc: 
> SuppressErrorAt=/concurrentMark.cpp:809
> [junit4:junit4] #
> [junit4:junit4] # A fatal error has been detected by the Java Runtime 
> Environment:
> [junit4:junit4] #
> [junit4:junit4] #  Internal Error 
> (/export/workspaces/8009536_3/src/share/vm/gc_implementation/g1/concurrentMark.cpp:809), 
> pid=16314, tid=14
> [junit4:junit4] #  assert(_finger == _heap_end) failed: only way to 
> get here
> [junit4:junit4] #
> [junit4:junit4] # JRE version: Java(TM) SE Runtime Environment 
> (8.0-b79) (build 1.8.0-ea-fastdebug-b79)
> [junit4:junit4] # Java VM: Java HotSpot(TM) Server VM 
> (25.0-b23-internal-jvmg mixed mode solaris-x86 )
> [junit4:junit4] # Core dump written. Default location: 
> /export/bugs/8009536/lucene-5.0-2013-03-05_15-37-06/build/analysis/uima/test/J0/core 
> or core.16314
> [junit4:junit4] #
> [junit4:junit4] # An error report file with more information is saved as:
> [junit4:junit4] # 
> /export/bugs/8009536/lucene-5.0-2013-03-05_15-37-06/build/analysis/uima/test/J0/hs_err_pid16314.log
> [junit4:junit4] #
> [junit4:junit4] # If you would like to submit a bug report, please visit:
> [junit4:junit4] #   http://bugreport.sun.com/bugreport/crash.jsp
> [junit4:junit4] #
> [junit4:junit4] Current thread is 14
> [junit4:junit4] Dumping core ...

when run with parallel reference processing enabled. That fix will be 
sent out shortly.

JohnC

On 3/11/2013 2:35 PM, John Cuthbertson wrote:
> Hi Everyone,
>
> Can I have a couple of volunteers review these changes? The webrev can 
> be found at: http://cr.openjdk.java.net/~johnc/8009536/webrev.0/.
>
> First of all - many thanks to Uwe Schindler for discovering an 
> reporting the problem and providing very clear instructions on how to 
> reproduce the issue. Many thanks also Dawid Weiss for also stepping in 
> with a self-contained reproducer.
>
> I also wish to thank Bengt for his help. It was Bengt who gave me the 
> magic proxy formula that allowed Ivy to satisfy and download all the 
> dependencies for the test case. Bengt also diagnosed the problem and 
> gave an initial fix (which the changes in the webrev are based upon).
>
> Summary:
> During the remark pause, the execution of the parallel RemarkTask set 
> the number of workers thread in the ParallelTaskTerminator and the 
> first and second barrier syncs. During serial reference processing, 
> the marking stack overflowed causing the single (VMThread) thread to 
> enter the overflow handling code in CMTask::do_marking_step(). This 
> overflow code using two WorkBarrierSyncs to synchronize the threads 
> before resetting the marking state for restarting marking. The barrier 
> syncs were waiting for the number of threads that participated in the 
> RemarkTask) but, since only the VM thread was executing, only a single 
> thread entered the barrier - resulting in the barrier indefinitely 
> waiting for the other (non existent) threads.
>
> A proposed solution was to call set_phase to reset the number of 
> threads in the parallel task terminator and the barriers to the number 
> of active threads for the reference processing. This solution ran into 
> a similar hang while processing the JNI references with parallel 
> reference processing enabled. (In parallel reference processing, the 
> JNI references are processed serially by the calling thread). 
> Resetting the phase to single-threaded before processing the JNI refs 
> solved the second hang but resulted in an assertion failure: only a 
> concurrentGC thread can enter a barrier sync and the calling thread 
> was the VM thread.
>
> Furthermore another problem was discovered. If the marking state is 
> reset, a subsequent call to set_phase() will assert as the global 
> finger has been set to start of the heap. This was a discovered by the 
> marking stack overflowing during the RemarkTask and parallel reference 
> processing calling set_phase() to reinitialize number of workers in 
> the parallel task terminator. It was also discovered when trying out 
> another proposed solution: adding a start_gc closure to reference 
> processing  which would call set_phase() before each processing phase. 
> As a result the marking state is only reset by worker 0 if an overflow 
> occurs during the concurrent phase of marking; if an overflow occurs 
> during remark, reference processing is skipped, and the marking state 
> is reset by the VM thread. Resetting the marking state before 
> reference processing was a benign error (objects would be marked but 
> not pushed on to the stack as they were no longer below the finger; 
> the objects would then be traced, in the normal fashion, when marking 
> restarted) but it's better to safe than sorry. The other part of the 
> fix for this secondary problem is that the parallel reference 
> processing task executor now calls the terminator's reset_for_reuse() 
> routine instead of set_phase().
>
> The resulting solution for the hang is based upon the patch sent out 
> by Bengt - namely we do not enter the sync barriers when 
> CMTask::do_marking_step() is being called serially. The difference is 
> that I added an extra parameter to CMTask::do_marking_step() instead 
> of piggy-backing on the existing parameter list. Additionally, if this 
> new parameter indicates serial operation, the current thread will skip 
> offering termination. This allows the serial drain closure to enter 
> the termination protocol and execute the guarantees contained therein.
>
> The other changes are for the secondary issues, described above, that 
> were discovered while out trying other possible solutions.
>
> Testing:
> The lucene test case with serial reference processing (with and 
> without verification);  the lucene test case with parallel reference 
> processing (with and without verification).
> GC test suite with a mark stack size of 1K and 4K, with both serial 
> and parallel reference processing (with and without verification).



More information about the hotspot-gc-dev mailing list