8214155: GCHeapLog inconsistents with CollectorCounters

Thomas Schatzl thomas.schatzl at oracle.com
Thu Nov 22 09:59:18 UTC 2018


Hi Yasumasa,

On Wed, 2018-11-21 at 21:58 +0900, Yasumasa Suenaga wrote:
> Hi all,
> 
> We can see number of GC events via jcmd VM.info and
> PerfCounter.print. However they hold different values.
> (See JDK-8214155 on JBS [1] for more details)

  during bug triage (making sure bugs are understandable and in general
ready to be worked on) we had several questions with the description of
the problem - you might have noticed slight editing already.

In particular it is unclear to us what the problem and scope of this
issue is.

So from what I understand (and there has already been some discussion
even about this :)), the issue is that the number of GC events
("invocations") in the hs_err log does not match the number of jstat
events.

It seems that there is something wrong here, but maybe it is not
(only?) an issue of wrong functionality, but missing expected
functionality.

Note that since this functionality is ages-old, none of us in the
Oracle gc team can give you answers why the output is the way it is -
probably just a mixture of "just forgot that functionality" (for this
one), "nobody cared until now" and similar reasons.

So if we are interested in making the output a bit more useful we might
want to take a step back and look at a slightly larger context.

So the first observations just when looking at the output given in the
description are:

- the sun.gc.collector.2.name label "G1 stop-the-world phases" seems to
be just wrong. Looking at
src/jdk.jcmd/share/classes/sun/tools/jstat/resources/jstat_options, the
comment for sun.gc.collector.2.* and the actual code indicates that
this should actually be called something like "G1 concurrent (stop-the-
world) phases".

- the description in the CR does not mention why you expect that the
event counts in the hs_err file should match the jshdb output, and what
it is used for.

To my understanding, the "GC events" in the hs_err log print *heap
sizes* before and after actual collections at the moment. I.e. this is
log tracing heap changes, not a GC pause log.

Some digging showed that the given count ("36") is derived from the
number of total collections that modify the heap; in G1 that is
incremented during incremental collections, full gc *and* the remark
pause (because during cleanup the heap does not change, but I think the
intent is to capture the concurrent cycle as a single "concurrent
collection").

This is the reason why the count is 36 and not 52. (i.e. 18 + 2 + 32 /
2)

I do not mind changing all this, but all this needs to be considered
when reporting issues to make it clear what the problem is and give
everyone a heads up on what the problem to be solved here. Particularly
the CR misses reasoning why the "invocations" count should be 52 here
(remember, this is counting heap changes, not GC events).

[...]
> I believe this issue should be fixed for troubleshooters. Of course,
> they can use `jhsdb jsnap` to check the counters. But I guess most of
> them believe contents in hs_err log.

Looking at the problem it might be useful to understand what these
troubleshooters need.

My opinion as basis for further discussion: the current "last gc
events" is almost useless for me at the moment, as it misses some
crucial information:

- what kind of pause this event is looking at

- not sure it is useful to notify about remark and cleanup pause
separately in the sun.gc.collector.2 statistics - isn't it sufficient
to have a concurrent cycles count there? (idk about any dependencies).

- maybe split up the "invocations" (which should be renamed as it is
confusing) into counts for the respective pauses (i.e. incremental=XX
full=YY remark=ZZ cleanup=AA). Something like "pauses" would be
appropriate.

- (given above investigation it would really be nice to have the
Cleanup pause included in the count if we do not think just "concurrent
cycles" is good enough)

- maybe do not make the text so focused on heap sizes (but still
include it). I can imagine the text can be made a lot more useful than
it is now with some input of the community.

- some recent change might have introduced an issue: looking at
g1ConcurrentMark.cpp:1359 where the "total_collection" count is
increased there is some comment about races which I do not understand.
However that count might be more appropriately increased in the Cleanup
pause. Or not at all there, but rather at the end of the concurrent
cycle (and isn't this a kind of "full" collection?).

What would be the compatibility implications of all that?

> 
> Do you have any idea for this issue?

Before proceeding I would like to understand the issue and in
particular the proposed solution, and if you have any further goals in
this area or if this is everything you want to do, a bit better.

As mentioned, only fixing the count in the gc event won't help me much
as "troubleshooter" though.

Thanks,
  Thomas




More information about the hotspot-gc-dev mailing list