RFR (M): 8136681: Factor out IHOP calculation from G1CollectorPolicy
jon.masamitsu at oracle.com
Tue Nov 17 19:14:52 UTC 2015
On 11/17/2015 12:14 AM, Thomas Schatzl wrote:
> On Mon, 2015-11-16 at 11:02 -0800, Jon Masamitsu wrote:
>> On 11/16/2015 05:31 AM, Thomas Schatzl wrote:
>>> Hi Jon,
>>> thanks a lot for all these reminders for better documentation. I have
>>> been working too long on this functionality so that "everything is
>>> clear" to me :)
>>> New webrevs with hopefully more complete explanations at:
>>> http://cr.openjdk.java.net/~tschatzl/8136681/webrev.2/ (changes)
>>> On Fri, 2015-11-13 at 07:58 -0800, Jon Masamitsu wrote:
>>>> This is partial. If you send out a second webrev based on Mikael's
>>>> review, I'll finish with that.
>>>>> 1370 // Returns the number of regions the humongous object of the
>>>>> given word size
>>>>> 1371 // covers.
>>>> "covers" is not quite right since to me it says that the humongous
>>>> object completely uses the
>>>> region. I'd use "requires".
>>>> 49 // Update information about recent time during which allocations happened,
>>>> 50 // how many allocations happened and an additional safety buffer.
>>>> // Update information about
>>>> I DON'T KNOW WHICH OF THESE IS MORE PRECISE.
>>>> // Time during which allocations occurred (sum of mutator execution time + GC pause times)
>>>> // Concurrent marking time (concurrent mark end - concurrent mark start)
>>>> // Allocations in bytes during that time
>>>> // Safety buffer ???
>>>> I couldn't figure out what the safety buffer is supposed to be. It seems to
>>>> be the young gen size but don't know why.
>>> I tried to explain in in the text. In short, in G1 the IHOP value is
>>> based on old gen occupancy only. The problem is that the young gen also
>>> needs to be allocated somewhere too.
>>> Now you could just say, use the maximum young gen size. However this is
>>> 60% of the heap by default... so the adaptive IHOP algorithm uses a
>>> measure of the young gen that is not bounded by G1ReservePercent.
>>> The reason to use the unbounded value is because if the code used the
>>> bounded one, it would cancel out with G1ReservePercent, because the
>>> closer we get to G1ReservePercent, the smaller that bounded value would
>>> get, which would make the current IHOP value rise etc, which would delay
>>> the initiation of the marking.
>>> That would end up loosing throughput as then the young gen gets smaller
>>> and smaller (and GC frequency increases), it can take a long time until
>>> G1 gets close enough to G1ReservePercent so that the other factors
>>> (allocation rate, marking time) are used.
>>> Basically initial mark will be delayed until young gen reaches its
>>> minimum size, at which time G1 will continue to use that young gen size
>>> until marking occurs. Which means that typically G1 will eat into
>>> G1ReservePercent, which we also do not want.
>>> Additionally it would get G1 more in trouble in regards to pause time,
>>> giving it less room during that time.
>> I think I understand the issue of using an unbounded young gen but
>> what precisely is meant by "measure of the young gen"? By measure
>> do you mean you used the size of young gen from recent young-only
> As in "measurement". Yes, this uses the size of the young gen from a
> recent young-only collection when the decision whether to start marking
> or not occurs.
> Which is supposed to be on the large side compared to the ones following
> during marking.
>>> Unfortunately G1 is in two minds about this, i.e. used() for humongous
>>> objects does not contain the "waste" at the end of the las region, but
>>> used() for regular regions does.
>> So occupancy is not useful and you use free regions + free in the current
>> old allocation regions?
> I have been referring to that G1's used() for humongous regions seems to
> return an incorrect value.
> That's why when adding the allocation information for humongous regions
> (around line 1018 in g1CollectedHeap.cpp) the change first calculates
> the size in regions and multiplies it by full region sizes, instead of
> using something like used().
>>>> Have you given much thought to what affect a to-space exhaustion should
>>>> have on
>>>> IHOP? I understand that it is not the design center for this but I
>>>> think that to-space
>>>> exhaustion can confuse the statistics. Maybe a reset of the statistics
>>>> and a dropping
>>>> the IHOP to a small value (current heap occupancy or even 0) until you
>>>> get 3 successful
>>>> marking cycles. Or think about it later.
>>> I already thought a little about how this should interact with regards
>>> to the calculation: the idea is that basically the algorithm will notice
>>> that there is a significant amount of additional allocation, and will
>>> lower the IHOP threshold automatically. (Looking through the code I
>>> think I saw some problems here, I will see to fix that)
>> The increased allocation would be the allocation from compacting
>> all the live data into regions during the full GC. That certainly should
> No, from converting survivor/eden regions into old regions. That will
> result in a huge bump in allocation rate (if the evac failure has been
> serious, i.e. a lot of failed evacuations). Which means that in the
> future, the IHOP will be lower.
Ah, yes. Evacuation failure does not necessarily mean a full GC.
> If the evacuation failure has been not so serious, the impact is
> certainly smaller.
> The impact of evacuation failures is already much smaller than before,
> and there are a few more things that could be done to make it even
> The only drawback is that potentially this decrease in IHOP is too small
> to avoid the next evac failure/full gc. None of our prediction can
> handle long-term cyclic occurrences (like once a day there is a
> significant brief 30s spike in promotion rate), so I do not see that as
> a particular issue.
I've mentioned before that I would consider a policy that started
the next marking cycle immediately. It's a simple policy (no
confusion about whether the decrease in IHOP was enough) and
provides the best effort to avoid an undesirable situation. I won't
bug you with that again. :-)
> That's something that the user needs to tune out at this time (CMS would
> not be able to handle this either).
Patch looks good. Reviewed.
>> make the IHOP drop but that seems rather weakly related to the
>> actual IHOP needed to avoid promotion failure. It's hard for me
>> to see how that is going to scale (i.e., it seems complicated to
>> use something like the live data size as input to the modeling
>> of IHOP). I'd start with something really simple but if you're
>> comfortable with it, that's fine.
>>> If evacuation failure happens in a gc during marking, there are a few
>>> options, I have not decided on what's best:
>>> - do nothing because the user asked to run an impossible to manage
>>> workload (combination of live set, allocation rate, and other options).
>>> - there is already some log output which that information can be
>>> derived from.
>>> - allow the user to set a maximum IHOP threshold. He could base this
>>> value on the log messages he gets.
>>> - the user can already do that by increasing G1ReservePercent btw
>>> - make sure that marking completes in time
>>> - let the mutator threads (or during young gcs while marking is
>>> running) do some marking if we notice that we do not have enough time.
>>> Not sure if it is worth the effort.
>>> - fix some bugs in marking :) that prevent that in extraordinary
> - another option would be to just start doing mixed gcs as G1 is able to
> do that without any completed marking (even during marking), hoping that
> this will yield enough space.
>>> - make sure that we always start marking early enough by making sure
>>> that mixed gc reclaims enough memory. Planning some work here as part of
>>> generic work on improving gc policy.
>> I like this one above.
>> Thanks for the extra explanations.
> Thanks for the discussion.
More information about the hotspot-gc-dev