RFC: Adaptively resize heap at any GC/SoftMaxHeapSize for G1

Thomas Schatzl thomas.schatzl at oracle.com
Wed Jun 10 09:31:00 UTC 2020


Hi all, Liang,

   after a few months of busy working in the area of G1 heap resizing 
and ultimately SoftMaxHeapSize support, I am fairly okay with a first 
preview of these associated changes. So I would like to ask for feedback 
on the current changes for what I intend to complete in the (early) 
jdk16 timeframe.

This is not a request for review of the changes for pushing, although 
feedback on the code is also appreciated.

 From my point of view only tuning a few heuristics and code polishing 
is left to do as the change seems to do what it is intended to do.

In particular it would be nice if Liang Mao, the original requestor of 
all this functionality, could help with feedback on his loads. :)

Just to recap: Sometime around end of last year, Liang posted review(s) 
with functionality to:
- concurrent uncommit of memory
- implement SoftMaxHeapSize by uncommitting free memory

That did not work well in some cases, so we agreed on us at Oracle 
taking over. Today I would like to talk about the progress on the second 
part :)

The original proposal did not work well because it did not really change 
how G1 resized the heap - i.e. SoftMaxHeapSize related changes to the 
heap were quickly undone by regular heap expansion because it was too 
aggressive for several reasons (e.g. bugs like JDK-8243672, 
JDK-8244603), uncooperative (JDK-8238686) and never actually helped 
shrinking or keeping a particular heap size.

This resulted in lots of unnecessary heap changes even on known constant 
load.

After some analysis after fixing these issues (at least internally ;)) I 
thought that for G1 to keep a particular heap size G1 needs to have an 
element in its heap sizing control loop that pushes back on (excessive) 
heap expansion.

The best approach I thought of has been to introduce a *lower* 
GCTimeRatio that G1 tries to stay *above* by resizing the heap. 
Effectively, G1 then tries to stay within ["LowerGCTimeRatio", 
GCTimeRatio] for its actual gc time ratio.

That works out fairly well actually, and today I thought that the code 
is in a state, while still heavy in development (it does look like that 
:) still), could be provided for gathering feedback on more loads from you.

First, how to try and use before going into the details and questions I 
have:

This is a series of patches, which I put up on cr.openjdk.net that need 
to be applied on recent trunk:

These are the ones already out for review:

1) JDK-8243672: http://cr.openjdk.java.net/~tschatzl/8243672/webrev.1/
2) JDK-8244603: http://cr.openjdk.java.net/~tschatzl/8244603/webrev/

These are in the pipeline and not "fully complete" yet:

3) JDK-8238163: http://cr.openjdk.java.net/~tschatzl/8238163/webrev/ 
(optional)
4) JDK-8238686: http://cr.openjdk.java.net/~tschatzl/8238686/webrev/
5) JDK-8238687: http://cr.openjdk.java.net/~tschatzl/8238687/webrev/
6) JDK-8236073: http://cr.openjdk.java.net/~tschatzl/8236073/webrev/

All of the above: 
http://cr.openjdk.java.net/~tschatzl/8236073/webrev.preview/

What these do:

(1) and (2) make the input variables to the control loop more 
consistent. Since they are out for review, I would defer to the review 
threads for them.

(3) stabilizes IHOP calculation a bit, trying to improve uncommon 
situations. This change is optional.

(4) fixes the issue with resizing at Remark being totally disconnected 
with actual load, causing some erratic expansions and shrinks.
After some time tinkering with that I decided to remove resizing at 
Remark - since we check heap size at every gc anyway, this is not 
required any more (but also delaying uncommit to the next gc).

(5) is the main change that implements a what has been mentioned above: 
G1 tries to keep actual GC time ratio within the range of 
LowerGCTimeRatio and GCTimeRatio. As long as actual GC time ratio is 
within this range, no action occurs. As soon as it finds that there is a 
trend of being outside, it tries to correct for that, internally trying 
to reach an actual gc time ratio in the middle of that range.

(6) implements SoftMaxHeapSize on top of that, trying to steer IHOP so 
that G1 does not use more than that. (I.e. a complete mess of 
potentially conflicting goals ;)

What I would like to ask you is try out these changes on your load(s), 
and potentially report back with at least

gc*,gc=debug,gc+ergo+heap=trace

logging.

Of course more feedback about how it works for you is even better, and 
if you are adventurous, maybe try tuning (internal) knobs a bit, which 
I'll describe in a minute :)

As mentioned, the changes are not complete, here's what I think should 
still be tuned a bit, and what I expect helps. The interesting method is 
G1HeapSizingPolicy::resize_amount_after_young_gc().

- determining the desired gc time ratio range: there is a new (likely 
temporary) option G1MinimumPercentOfGCTimeRatio that determines the 
lower gc time ratio described above as percentage of the original 
GCTimeRatio. Currently set at 50%, which seems a good value as a too 
tight range will cause lots of resizing (which might be good), and a too 
large range will effectively disable shrinking (which also might be 
desired).
Either way, this value works fairly well so far in my tests. Suggestions 
very appreciated.

- detection of being outside of the expected gc time ratio range: this 
somewhat works as before, separating short term and long term behavior.

Long term behavior: every X gcs without a heap resize g1 looks if long 
term gc ratio is outside of the bounds, if so, react. I think this is 
fairly straightforward.

Short term behavior: tracks the amount of times short term gc time ratio 
exceeds the bounds in a single variable, incrementing or decrementing it 
depending on whether current gc time ratio is above or below the gc time 
ratio bounds. If that value exceeds certain thresholds, do something.

There is a new bias towards expansion at startup to make g1 react faster 
at that time, and some decay towards "no action to be taken" if for a 
"long" time nothing happens.

I reused the same values for "short" time (+/-4) and "long" (10) as 
before, they seem to be okay.

- actual resizing: expansion is supposed to be the same as before, 
relatively aggressive, which I intend to keep.

Shrinking is based on the number of free regions at the moment. This is 
not optimal because e.g. you do not want to shrink below what is needed 
for current eden (and the survivors of the next gc).

Other than that it is bounded by a percentage of the number of free 
regions (G1ShrinkByPercentOfAvailable). That results some heap size 
undershoot in some cases (i.e. temporarily uncommitting a bit to much), 
but in my tests it hasn't been too bad.

Still rather (too) simple, expect some tunings and changes particularly 
here, deviating a bit more from the expansion code.

Comments and ideas in this area, particularly ones applied to your 
workloads, particularly appreciated.

Another big area not yet really tested is interaction with JEP 346: 
Promptly Return Unused Committed Memory from G1, but I am certain that 
with it you can reduce heap usage a lot (too much?).

My performance (throughput) tests so far look almost always encouraging: 
20-30% less heap size with statistically insignificant throughput 
changes. There are some exceptions, in these cases you loose 10% of 
throughput for like 90% of less heap usage.

The only really bad results come from tests that try to find the maximum 
throughput of g1 by incrementally increasing the load finding out that 
it does not work, slightly back off with the load and then increase the 
load again to find an "equilibrium". From what I can tell it looks like 
the heap sizing follows the application (i.e. what it's supposed to do), 
making the application think it's already done while there is still more 
heap available to potentially increase performance (looking at you 
specjbb2015 out-of-box performance!).

Not yet sure how to counter that, but some decrease in default 
GCTimeRatio to decrease the shrinking aggressiveness (and keeping more 
heap longer) might fix this.

Of course, if you disable this adaptive heap sizing by fixing the heap 
min/max in your benchmarks, there are no differences to before.

One interesting upcoming change is to make MinHeapSize manageable 
(JDK-8224879) to help the algorithm a bit.

As closing words, given that the email is quite long already, thanks for 
your attention and looking forward to feedback :)
If you have questions, please chime in too, I am happy to answer them.

Thanks,
   Thomas


More information about the hotspot-gc-dev mailing list