RFR (M): 6672778: G1 should trim task queues more aggressively during evacuation pauses
thomas.schatzl at oracle.com
Wed Apr 11 11:46:37 UTC 2018
I updated and (hopefully) improved the change a bit after some more
Particularly I thought that tracking the partial trim time in the
closures would be confusing and too intrusive, so I moved it out from
them to the G1ParScanThreadState.
This also removed quite a few otherwise necessary includes to
Also the time accounting has been a bit messy as you needed to
add/subtract the trim time in various places that were non-obvious. I
tried to improve that by avoiding (existing) nested timing (remembered
set cards/remembered set code roots and updateRS/scanHCC), which made
the code imho much more easier to follow.
These two changes also make the follow-up "JDK-8201313: Sub-phases of
ext root scan may be larger than the sum of individual timings"
Note that to gather the timings the code uses Tickspan for holding
intermediate values, i.e. basically nanoseconds. Unfortunately G1
logging uses seconds encoded as doubles everywhere for historical
reasons; there is a really huge change fixing this coming, but for now
more and more places are going to use Tickspan.
On Mon, 2018-04-09 at 13:20 +0200, Thomas Schatzl wrote:
> Hi all,
> I am happy to finally bring this, one of the oldest G1 issues we
> have, to a happy ending :)
> So until now G1 buffered all oop locations it encountered during root
> scanning (including from remembered sets and refinement queues) in
> per-thread work queues, and only drained them at the very end of the
> evacuation pause.
> I am not completely sure why this has been implemented this way, but
> has serious drawbacks:
> - the work queues and overflow stacks may use a lot of memory, and I
> mean *a lot*
> - since we buffer all oop references, the prefetching G1 does goes to
> waste as G1 always prefetches (during root scan) without following up
> on it, wasting memory bandwidth.
> Other GC's already employ this technique, so my best guess why G1 did
> not so far is that G1 needs sub-timings for the various phases to get
> prediction right, and even if doing timing is cheap, doing it too
> just adds up.
> Anyway, this problem has been solved by implementing a hysteresis,
> start trimming the work queues at a threshold higher than ending it,
> and time the length of the trimming inbetween. So the timing
> measurement overhead gets distributed across many work queue
> Note that I did not do much testing about the optimal hysteresis
> the suggested guess of 2xGCDrainStackTargetSize seems to be a pretty
> good value (i.e. reduces overhead well enough).
> Results are pretty good: I have seen reductions of the maximum task
> queue size by multiple orders of magnitudes (i.e. less memory usage
> during GC), and reduction of total pause time by up to 50%,
> particularly on larger applications in the few GB heap range where
> quite a bit of data is copied around every gc.
> But also smaller applications and applications doing less copying
> benefit quite a bit, reducing pause times significantly.
> Note that there is a known, actually pre-existing bug with buffering
> references (by the now obsolete and removed BufferingOopClosure): the
> sum of timings for the sub-phases of ext root scan may be larger than
> the printed total. This will be addressed with the follow-up JDK-
> 8201313 to keep this change small, and it's a pre-existing issue
> hs-tier 1-3, perf tests
More information about the hotspot-gc-dev