RFR (M): 8077144: Concurrent mark initialization takes too long

Mikael Gerdin mikael.gerdin at oracle.com
Mon Mar 14 15:37:35 UTC 2016


Hi,

I had an IM discussion with Thomas around some issues with the design of 
the aggregation and verification code.

G1LiveDataClosureBase should become a utlility class instead of a base 
class. It's weird that the G1VerifyLiveDataHRClosure embeds another 
HRClosure and calls doHeapRegion on it, it would be better if the 
mark_marked_during_marking and mark_allocated_since_marking methods 
could be called right away.

Mark_marked_during_marking should return the marked bytes instead and 
let the caller take care of mutating the heap region and doing the yield 
check. There is currently a bug where the verification code calls 
add_to_marked_bytes on the HeapRegion. There is also an issue with how 
add_to_marked_bytes is called on HumongousContinues regions since 
marked_bytes is aggregated in each iteration and then the aggregate is 
added to the current hr.

It might be a good idea to hold off on reviews until the updated webrev 
is out.

/Mikael

On 2016-03-14 14:15, Thomas Schatzl wrote:
> Hi all,
>
>    could I have reviews for this from-scratch solution for the problem
> that G1 startup takes too long?
>
> Current G1 uses per-mark thread liveness mark bitmaps that span the
> entire heap to be ultimately able to create information about areas in
> the heap where there are any live objects on a card basis.
> This information is needed for scrubbing remembered sets later.
>
> Basically, in addition to updating the previous bitmap required for
> SATB, the marking threads also, for every live object, mark all bits
> corresponding to the area the object covers on a per thread basis on
> these per-thread liveness mark bitmaps.
>
> During the remark pause, this information is aggregated into (two)
> global bitmaps ("Liveness Count Data"), then in the cleanup pause
> augmented with some more liveness information, and then used for
> scrubbing the remembered sets.
>
> The main problems with that solution:
>
> - the per-mark thread data structures take up a lot of space. E.g. with
> 64 mark threads, this data structure has the same size of the Java
> heap. Now, when you need to use 60 mark threads, the heap is big. And
> at those heap sizes, needing that much more memory hurts a lot.
>
> - management of these additional data structures is costly, it takes a
> long time to initialize, and regularly clear them. The increased
> startup time has actually been the cause for this issue.
>
> - it takes a significant amount of time to aggregate this data in the
> remark pause.
>
> - it slows down marking, the combined bitmap update (the prev bitmap
> and these per-thread bitmaps) is slower than doing these phases
> seperately.
>
> This proposed solution removes the per-thread additional mark bitmaps,
> and recreates this information from the (complete) prev bitmap in an
> extra concurrent phase after the Remark pause.
>
> This can be done since the Prev bitmap does not change after Remark any
> more.
>
> In total, this separation of the tasks is faster (lowers concurrent
> cycle time) than doing this work at once for the following reasons:
>
>    - I did not observe any throughput regresssions with this change:
> actually, throughput of some large applications even increases with
> that change (not taking into account that you could increase heap size
> now since not so much is taken up by these additional bitmaps).
>
>    - the concurrent phase to prepare for the next marking is much
> shorter now, since we do not need to clear lots of memory any more.
>
>    - the remark pause can be much faster (I have measurements of a
> decrease in the order of a magnitude on large applications, where this
> aggregation phase dominates the remark pause).
>
>    - startup and footprint naturally decreases significantly,
> particularly on large systems.
>
> As a nice side-effect, the change in effect removes a significant
> amount of LOC.
>
> There is a follow-up change to move (and later clean up) the still
> remaining data structures required for scrubbing into extra classes,
> since they will be used more cleverly in the future (JDK-8151386).
>
> There will be another follow-up change without CR yet to fix the use of
> an excessive amount of parallel gc threads for clearing the liveness
> count data.
>
> The change is based on JDK-8151614, JDK-8151126 (I do not think it
> conflicts with that actually), and JDK-8151534 (array allocator
> refactoring).
>
> CR:
> https://bugs.openjdk.java.net/browse/JDK-8077144
> Webrev:
> http://cr.openjdk.java.net/~tschatzl/8077144/webrev.2/
> Testing:
> jprt, vm.gc, kitchensink, some perf benchmarks
>
> Thanks,
>    Thomas
>


More information about the hotspot-gc-dev mailing list