Long "stop-the-world" pauses in CMS GC mode
Y. S. Ramakrishna
y.s.ramakrishna at oracle.com
Tue Aug 30 09:51:08 PDT 2011
On 08/30/11 02:52, David Tavoularis wrote:
> Hi Ramki, Zden�k,
> Thank you for your valuable answers.
> /> So the pause is associated not with //GC work itself (which is
> correctly reported as 202 ms), but rather with a /
> /> preamble to the GC, perhaps //with bringing threads to a safepoint, I
> am guessing./
> I will ask to add -XX:+PrintSafepointStatistics. What are the expected
> outputs ? Will it be in GC logs or in stdout ?
To stdout i believe. But with a latest JVM these data (which are
batched into a record of several entries written out together) should have
a timestamp column associated with each safepoint operation
which will allow alignment of the data wrt the GC log events in the
GC logs even, though the two split off into different i/o streams.
> /> you have a 48 core machine but you have handicapped yr JVM by forcing
> > which forces single--threaded remarks on such a huge heap./
> I will ask to remove it and let you know.
Thanks. I am guessing it must be "legacy" from an (much) earlier time when
there were bugs in the cms parallel remark.
> /> If "ref proc" turns out to be still a problem, you can then enable
> +ParallelRefProcEnabled to parallelize that sub-phase as well./
> I will not activate -XX:+ParallelRefProcEnabled, because according to
> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7028845, it is broken
> in Java6u25 and fixed in Java6u27.
Ah yes, good catch; sorry to've not remembered as i had fixed
that problem a while back during JDK 7 development.
> /> I have heard apocryphal stories of -XX:+UseMembar having worked to
> get rid of //overly long safepointing pauses,/
> /> and I have heard -XX:-UseBiasedLocking for pauses associated //with
> bulk bias revocations./
> Good to know, but I won't use them until I get more info from
> -XX:+PrintSafepointStatistics and a new analysis after removing
> />> The only suspicious thing is "[Times: user=5.67 sys=0.02, real=59.52
> secs]", which means that the "real" duration is a lot higher than "user"
> CPU time.
> >> Because "sys" duration is low, it also means that the server is not
> /> this can happen when the machine is overloaded. And as for swapping,
> I think it is not involved in the sys time because these times are times
> of the application thread./
> In my experience, when server is swapping, the "sys" time duration is
> increasing a lot.
> I can confirm that there is no high CPU load on the server (max CPU
> usage is 30% in the last 7 days) and no disk swapping (according to
> vmstat "sr"="scan rate" metrics).
> According to Ramki, I need to understand the reason of slow safepoint
Right; sounds like a plan.
More information about the hotspot-gc-use