RFR (M): 8077144: Concurrent mark initialization takes too long
mikael.gerdin at oracle.com
Mon Mar 14 15:37:35 UTC 2016
I had an IM discussion with Thomas around some issues with the design of
the aggregation and verification code.
G1LiveDataClosureBase should become a utlility class instead of a base
class. It's weird that the G1VerifyLiveDataHRClosure embeds another
HRClosure and calls doHeapRegion on it, it would be better if the
mark_marked_during_marking and mark_allocated_since_marking methods
could be called right away.
Mark_marked_during_marking should return the marked bytes instead and
let the caller take care of mutating the heap region and doing the yield
check. There is currently a bug where the verification code calls
add_to_marked_bytes on the HeapRegion. There is also an issue with how
add_to_marked_bytes is called on HumongousContinues regions since
marked_bytes is aggregated in each iteration and then the aggregate is
added to the current hr.
It might be a good idea to hold off on reviews until the updated webrev
On 2016-03-14 14:15, Thomas Schatzl wrote:
> Hi all,
> could I have reviews for this from-scratch solution for the problem
> that G1 startup takes too long?
> Current G1 uses per-mark thread liveness mark bitmaps that span the
> entire heap to be ultimately able to create information about areas in
> the heap where there are any live objects on a card basis.
> This information is needed for scrubbing remembered sets later.
> Basically, in addition to updating the previous bitmap required for
> SATB, the marking threads also, for every live object, mark all bits
> corresponding to the area the object covers on a per thread basis on
> these per-thread liveness mark bitmaps.
> During the remark pause, this information is aggregated into (two)
> global bitmaps ("Liveness Count Data"), then in the cleanup pause
> augmented with some more liveness information, and then used for
> scrubbing the remembered sets.
> The main problems with that solution:
> - the per-mark thread data structures take up a lot of space. E.g. with
> 64 mark threads, this data structure has the same size of the Java
> heap. Now, when you need to use 60 mark threads, the heap is big. And
> at those heap sizes, needing that much more memory hurts a lot.
> - management of these additional data structures is costly, it takes a
> long time to initialize, and regularly clear them. The increased
> startup time has actually been the cause for this issue.
> - it takes a significant amount of time to aggregate this data in the
> remark pause.
> - it slows down marking, the combined bitmap update (the prev bitmap
> and these per-thread bitmaps) is slower than doing these phases
> This proposed solution removes the per-thread additional mark bitmaps,
> and recreates this information from the (complete) prev bitmap in an
> extra concurrent phase after the Remark pause.
> This can be done since the Prev bitmap does not change after Remark any
> In total, this separation of the tasks is faster (lowers concurrent
> cycle time) than doing this work at once for the following reasons:
> - I did not observe any throughput regresssions with this change:
> actually, throughput of some large applications even increases with
> that change (not taking into account that you could increase heap size
> now since not so much is taken up by these additional bitmaps).
> - the concurrent phase to prepare for the next marking is much
> shorter now, since we do not need to clear lots of memory any more.
> - the remark pause can be much faster (I have measurements of a
> decrease in the order of a magnitude on large applications, where this
> aggregation phase dominates the remark pause).
> - startup and footprint naturally decreases significantly,
> particularly on large systems.
> As a nice side-effect, the change in effect removes a significant
> amount of LOC.
> There is a follow-up change to move (and later clean up) the still
> remaining data structures required for scrubbing into extra classes,
> since they will be used more cleverly in the future (JDK-8151386).
> There will be another follow-up change without CR yet to fix the use of
> an excessive amount of parallel gc threads for clearing the liveness
> count data.
> The change is based on JDK-8151614, JDK-8151126 (I do not think it
> conflicts with that actually), and JDK-8151534 (array allocator
> jprt, vm.gc, kitchensink, some perf benchmarks
More information about the hotspot-gc-dev