RFR(S): 8009536: G1: Apache Lucene hang during reference processing

John Cuthbertson john.cuthbertson at oracle.com
Tue Mar 12 12:25:51 PDT 2013

Hi Thomas,

Thanks for looking over the changes. Responses inline....

On 3/12/2013 3:13 AM, Thomas Schatzl wrote:
> Hi,
> On Mon, 2013-03-11 at 14:35 -0700, John Cuthbertson wrote:
>> Hi Everyone,
>> Can I have a couple of volunteers review these changes? The webrev can
>> be found at: http://cr.openjdk.java.net/~johnc/8009536/webrev.0/.
> - maybe in concurrentMark.cpp you can now replace
> 992:   reset_marking_state(concurrent() /* clear_overflow */);
> by
> reset_marking_state(true /* clear_overflow */);
> As the whole block is guarded by that. Not sure if it is better, but
> when reading the code with that change you don't need to think what the
> value of concurrent() is.
> Keeping it is as fine.

That's fine. I'm OK making this change. Since Bengt has asked to make 
the change associated with the set_phase() assertion as a separate fix - 
this comment no longer applies - but I'll make the change in the second fix.

BTW the following replies were done before I got what you were looking 
for. I think we're in agreement.

> - I'm slightly confused by using the "serial"
> G1CMKeepAliveAndDrainClosure and G1CMDrainMarkingStackClosure closures
> (that don't synchronize with other threads) in weakRefsWork() at line
> 2465 and 2466. Or at least the comment above it.

Moving the comment about active workers to after instantiation of these 
closure would be a better place. Would it make more sense if the code 
was structured:


if (!processing_is_mt) {
   // Serial reference processing is executed by the VMThread.
   rp->process_discovered_references(&is_alive, &serial_keep_alive, 
&serial_drain, NULL);
} else {
   // Parallel reference processing is executed by the work gang.

I had the code structured like this before and it was changed at the 
request of some reviewers' comments. Personally I preferred the above 
but not strongly enough to discount the comment.

> So, this "is_serial" flag indicates to the do_marking_step() that it
> should not try to synchronize with concurrent mark threads when
> terminating, as there are none.

The is_serial flag to do_marking_step() is to indicate whether the 
executing thread is the VM thread.

> There are still the parallel work gang threads using these "is_serial"
> closures for processing (by the partaskexecutor used in
> process_discovered_references()).

No. In the JNI refs case it is still the VM thread (actually the caller 
of process_discovered_references()) that executes the "serial" closures 
but I think see what you mean.

> Following the code, in the parallel case, if a thread spawned by
> weakRefsWork() fails "fails" due to mark stack overflow, it does not try
> to synchronize with others. Mainly because the work gang it is running
> in (in Workgang::run_task()) that waits for all threads anyway.

No. The "serial" closures are executed by the VM thread. The parallel 
closures (where we want the worker to offer termination and enter the 
sync barriers) are executed by the work gang. I think we just need to 
call out that processing of JNI refs is performed serially.

> (Couldn't this synchronization potentially take a long time as e.g. one
> thread is able to completely finish work using local buffers as others
> are waiting on it?)

Any wait for an event could potentially take a while. That's why there's 
a timer for the termination. We haven't seen it in the marking code though.

> Then, regardless of parallelism, two (or three) caller levels above in
> ConcurrentMarkThread::run(), after the vm thread finishes the safepoint,
> the current code will notice that there has been an overflow during
> remark and do another safepoint.

No. Marking restarts from the "concurrent" phase:

>       int iter = 0;
>       do {
>         iter++;
>         if (!cm()->has_aborted()) {
>           _cm->markFromRoots();
>         }

> (Is there a reason to schedule another safepoint to recover from a
> marking overflow? This involves a back-to-back safepoint. Is this
> cheaper than just retrying in the current safepoint?)

There's no safepoint in the above code. The next safepoint is the next 

> What do you think of renaming that flag? Maybe something like
> "synchronize_with_cm_threads". Unfortunately I cannot come up with a
> better, more generic name, as do_marking_step() does exactly that, and
> that's what the change wants to avoid (well, it has the flag the other
> way around).

It's not synchronize_with_cm_threads - more like is_VM_thread and I 
think that is a bit too specific. Also the work gang in question is the 
CM work gang during the concurrent phase of marking. During remark we 
use the work gang from the G1 heap. The reason for this is that, by 
default, the number of marking threads is a fraction of the parallel GC 
threads (for example on my workstation it's 1 for 4 GC threads).

> The problem seems to be that do_marking_step() as is now is too much
> geared towards use in the concurrent marking threads, it uses a lot of
> details from it. And the current code reuses do_marking_step() although
> it's not running in the marking threads.

I agree - though I would say it's more geared for use by a work gang. 
These closures were originally parallel reference processing only and 
the original serial closures operated directly on the marking stack. We 
wanted to use them in the serial code because we were starting to see 
more overflows from the serial reference processing code. Using them in 
the serial case should reduce the frequency of overflows as 
do_marking_step() primarily uses the CMTask's local task queue and the 
global marking stack is used as backing store when the local queue 

The problem with do_marking_step() being parallel-centric has always 
been there - we just never noticed it. Consider the remark case when 
parallel GC threads = 0. Even though the remark code sets the number of 
workers in the terminator and the barrier syncs (using set_phase) - and 
we don't hang - an overflow in do_marking_step() would assert when the 
VM thread entered the first barrier sync.

> So another option could be to extract the concurrent marking thread
> specific stuff (e.g. termination) from do_marking_step() into a helper
> class/interface and pass a suitable implementation for each of its uses.
> Instead of the many _cm-> references in do_marking_step, call
> appropriate methods of that. But that would be quite some additional
> work... (and I didn't think it through, just a thought)

Definitely not for this change. :)

> Anyway, could the comment above these lines changed as the current
> comment indicates (to me) that the closures are run by the vmthread.
> E.g. instead of
> 2463:// Serial (i.e executed by VMThread) instances of the 'Keep Alive'
> 2464:// and 'Complete GC' closures.
> use something like:
> // These closures do not need to synchronize with concurrent marking
> // threads as this code is either run serially by this thread (e.g.
> // processing_is_mt is false which uses the VM thread), or the worker
> // gang tasks which have their own synchronization.

The current thread _is_ the VM thread. I guess a better comment would be:

// These closures do not need to offer termination or enter the 
synchronization barriers in do_marking_step().

Actually an amalgamation of both comments would be best. And I still 
think some restructuring in weakRefsWork() would make it more obvious.

> - one other minor thing in concurrentMark.cpp:
> 4317:    bool finished = (is_serial ? true
>                                 :_cm->terminator()->offer_termination(this));
> Maybe change this to
> bool finished = (is_serial ||
> _cm->terminator()->offer_termination(this));
> While this is no different, I actually had to pause and think what the
> intent of the use of the "?" operator was here :) It seems easier to
> read and more common to to not use the "?" operator in such cases.

Sure. Done.



More information about the hotspot-gc-dev mailing list