[G1GC] Evacuation failures with bursts of humongous object allocations
thomas.schatzl at oracle.com
Mon Nov 9 08:34:22 UTC 2020
On 05.11.20 23:49, Charlie Gracie wrote:
> We have been investigating an issue with G1GC and bursts of short lived
> humongous object allocations. Normally during the application, the humongous
> object allocation rate is about 1 humongous region between each GC. Occasionally,
> the humongous allocation rate climbs to 600 or more regions between 2 GC cycles
> and consumes 100% of the free regions. The subsequent GC has no free regions for
> to-space and not even a single object can be evacuated. Since to-space is exhausted
> immediately, the GC is extremely long due to dealing with evacuation failures. The
> workload is running on JDK 11 but we have been able to reproduce it on JDK 16 builds.
> About 1/40 GCs are impacted by these bursts of humongous allocations.
>  is an example of a GC running on JDK 11 when the burst of humongous
> allocations happens.  is an example of the rest of the GCs.
> It seems like -XX:G1ReservePercent is the recommended way to tune for humongous
> object allocations. Is this correct?
> We could tune around this behaviour by increasing
> the G1ReserverPercent and heap size but since this happens rarely the JVM will be over
> provisioned most of the time. This is an ok work-around but I am hoping we can make
> G1GC more resilient to bursts of humongous object allocations.
You probably are missing "... that are short-living" in your
description. Otherwise the suggested workaround does not... work.
> What we are experiencing seems related to JDK-8248783  and I have been
> prototyping changes that may resolve one of their issues as well. My approach is to
> force a GC during the slow allocation path if the number of free regions is about to
> drop below a reasonable threshold to complete the next GC cycle. The check is inserted
> into the slow path for regular objects
Why regular objects too? Maybe to completely obsolete G1ReservePercent
for this purpose?
> and humongous objects. In my current prototype 
> the G1 slow allocation path will only allow a free region to be consumed:
> if (((ERC / SR) + ((SRC * TSR) / 100)) <= (FRC - CR))
This looks like: "do a gc if the amount of free space after allocation
is less than the currently expected amount of survivors".
A few notes/problems on that suggested formula which seems very
- for the first term, eden regions, g1 already provides more accurate(?)
survivor prediction using survivor rate predictors (see
G1SurvRateGroup), a use is in G1Policy::predict_eden_copy_time_ms(),
another in G1Policy::predict_bytes_to_copy.
Note that this value is not adjusted by any kind of allocation
I think for eden regions that prediction fairly okay and better than
some random value like SR.
- There is currently no good prediction for survivor regions; the use of
TSR for survivors seems kind of... random?
TSR is the ratio of survivor objects that the collector keeps in
survivor space to limit the amount of objects to limit copy costs. It
has not much to do with actual survival rates. Objects live and die
regardless of that value - maybe not taking space in survivor but in old.
If you ever looked at the expected survivor size output, from my
experience this value is typically way too high. That's why it works
well with your application.
Note that G1 does track survivor rate for survivor regions too, but it's
not good in my experience - survivor rate tracking assumes that objects
within a region are of approximately the same age, which survivors are
not. Survivor region objects are typically jumbled together from many
different ages which completely violates that assumption.
(Assuming that object age is a good indicator for death rate).
There were some attempts in the past by me to improve that but they were
not completed, mostly related due to testing time as more tracking tends
to add more code in the inner copy-loop (which typically requires lots
- potential survivors from old regions are missing completely. I presume
in your case these were not interesting because (likely) this
application mostly does short living humongous allocations?
> ERC - eden region count
> SR - SurvivorRatio
> SRC - survivor region count
> TSR - TargetSurvivorRatio
> FRC - free region count
> CR - number of free regions required for allocation
> Using this algorithm significantly improves G1GCs handling of bursts of humongous
> object allocations. I have not measured any degradations to "normal" workloads we
> run but that may not be representative set. In theory, this should only impact workloads
> that consume more humongous regions than G1ReservePercent between GC cycles.
Basically the intent is to replace G1ReservePercent for this purpose,
making it automatic, which is not a bad idea at all.
One problem I can see in this situation is that what if that GC does not
free humongous objects memory? Is the resulting behavior better than
before, or in which situation it is(n't)?
And, is there anything that can be done to speed up evacuation failure?
:) Answering my rhetorical question: very likely, see the issues with
evacuation failure collected using the gc-g1-pinned-regions labels
lately , in particular JDK-8254739 .
So it would be interesting to see the time distribution for evacuation
failure (gc+phases=trace) and occupancy distribution of these failures.
> I am curious about what other people think of the behaviour we are seeing and the
> solution I am experimenting with. Any feedback would be greatly appreciated.
Hth a bit,
>  - https://bugs.openjdk.java.net/browse/JDK-8248783
>  - https://github.com/charliegracie/jdk/tree/humongous_regions
>  - Example of a bad GC during the burst humongous object allocations
> GC(468) Pause Young (Prepare Mixed) (G1 Humongous Allocation)
> GC(468) GC(468) Age table with threshold 15 (max threshold 15)
> GC(468) To-space exhausted
> GC(468) Pre Evacuate Collection Set: 0.2ms
> GC(468) Prepare TLABs: 0.2ms
> GC(468) Choose Collection Set: 0.0ms
> GC(468) Humongous Register: 0.2ms
> GC(468) Evacuate Collection Set: 30.1ms
> GC(468) Post Evacuate Collection Set: 253.3ms
> GC(468) Evacuation Failure: 249.1ms
> GC(468) Eden regions: 404->0(64)
> GC(468) Survivor regions: 8->0(69)
> GC(468) Old regions: 182->594
> GC(468) Humongous regions: 686->2
> GC(468) Pause Young (Prepare Mixed) (G1 Humongous Allocation) 10225M->4755M(10240M) 285.057ms
>  Regular GC from the same log for comparison.
> GC(465) Pause Young (Normal) (G1 Evacuation Pause)
> GC(465) Age table with threshold 15 (max threshold 15)
> GC(465) - age 1: 21586848 bytes, 21586848 total
> GC(465) - age 2: 7962712 bytes, 29549560 total
> GC(465) - age 3: 1033216 bytes, 30582776 total
> GC(465) - age 4: 4710920 bytes, 35293696 total
> GC(465) - age 5: 716064 bytes, 36009760 total
> GC(465) - age 6: 2387064 bytes, 38396824 total
> GC(465) - age 7: 2331208 bytes, 40728032 total
> GC(465) - age 8: 321680 bytes, 41049712 total
> GC(465) - age 9: 4974056 bytes, 46023768 total
> GC(465) - age 10: 106488 bytes, 46130256 total
> GC(465) Pre Evacuate Collection Set: 0.0ms
> GC(465) Evacuate Collection Set: 16.0ms
> GC(465) Post Evacuate Collection Set: 1.2ms
> GC(465) Other: 1.3ms
> GC(465) Eden regions: 494->0(537)
> GC(465) Survivor regions: 5->7(63)
> GC(465) Old regions: 182->182
> GC(465) Humongous regions: 1->1
> GC(465) Pause Young (Normal) (G1 Evacuation Pause) 5454M->1512M(10240M) 18.704ms
More information about the hotspot-gc-dev