RFR: 8167190: Remove confusing timestamps from the gc log

Erik Helin erik.helin at oracle.com
Fri Oct 7 13:11:25 UTC 2016

On 2016-10-07, kirk at kodewerk.com wrote:
> Hi Erik,
> IMHO, UL over-reached when it decided to decorate log entries with timing information. My first reaction to this is that I generally would never rely on the timings provided by an logging frame if they were important for a number of reasons of which I would include skidding. We actually make attempts to measure skidding though it can only be an estimate in the current JVM implementations as this is a good indicator of what should be tuned, the JVM, the application, or the operating environment (OS, etc.). Our current technique for making this estimate necessarily includes safepoint timings. Although the calculations are some what insensitive to noise, It would not take a lot of skidding to add enough of noise to these calculations to diminish the value of the calculations. In my experience, you are unlikely to see these effects in test environments as they are simply too clean to expose these types of problems. Also, I see the highly fragmented nature in which the GC log as a bigger issue than the call to gather the timing information. This is an issue I cited when the format was first being worked on. IMHO Azul has a better logging format and we offered an example of high performance journaling (logging) implementation (Peter Lawrey’s Chronicle) that would have supported all the JVM’s logging needs at very high volumes of writes working in low latency environments.
> As for the start of the GC there are two important events. The first is when the GC was called for and when the work actually starts. If the thread is scheduled out between the time of the “start” and the time that UL set the time stamp than this can have some impact on how particular problems are viewed. An an example, I just helped tune a clusters that each contained several hundred JVMs. Using this type of analysis helped me guide the client away from tuning the collectors in some cases to tuning the operating system. With the logs coming from UL I’m not sure I could have come to that conclusion quite so easily.

It is this part that confuses me. Looking at the code, in
GCTraceTimeImpl::GCTraceTimeImpl, you see:

  if (_enabled) {
    if (log_heap_usage) {
      _heap_usage_before = Universe::heap()->used();

and the code for log_start is:

inline void GCTraceTimeImpl::log_start(jlong start_counter) {
  if (_out_start.is_enabled()) {
    LogStream out(_out_start);

    out.print("%s", _title);
    if (_gc_cause != GCCause::_no_gc) {
      out.print(" (%s)", GCCause::to_string(_gc_cause));
    out.print_cr(" (%.3fs)", TimeHelper::counter_to_seconds(start_counter));

So, UL will take the timestamp as part of the call to print_cr on this

    out.print_cr(" (%.3fs)", TimeHelper::counter_to_seconds(start_counter));

There aren't gonna be a whole lot of instructions between the call
time_stamp(_start_ticks) in the GCTraceTimeImpl constructor and the call
to out.print_cr() in log_start() (called by the very same
GCTraceTimeImpl). I understand that the two timestamps,
if they are different, will highlight a skid/skew here. However, if one
is concerned about these kinds of skids/skews, then why this particular
time window? There are numerous windows of equal number of instructions
that we could have timers around to try to detect if a thread has been
stalled for some reason. I don't understand why this particular window
is that interesting? Trying to detect any kind of skew/skid with these
two timestamps requires that it happens _exactly_ during one of the
instructions between these two timestamps, and that doesn't seem very
likely (since the number of instructions should be fairly small).

A much better approach seems to take the largest such window, for
example using the safepoint logging. Using the safepoint and the gc log
tags will increase the window to the time the threads initiated the
safepoint until the GC starts, which is the largest possible window for
detecting if a thread got stalled.

Or am I missing something here?


> As for scheduling, anyway you slice it, there is always a risk that you’re going to get scheduled out. I know that you should minimize this by avoid unnecessary calls to functions that might need to travel off chip to retrieve data but then this one of the reasons why I feel UL over-reached.
> As for confusion, in my experience in teaching people how to read GC logs as part of my performance tuning workshop, I find that the confusion is very short lived. What is confusing in this case is that there are two time stamps, one provided by UL and another included in the log line. Understanding why is an opportunity to further expose how “dirty" most production environments really are.
> Anyways, these are my final arguments for retaining the timings and let you get to it.
> Kind regards,
> Kirk Pepperdine
> > On Oct 7, 2016, at 1:54 PM, Erik Helin <erik.helin at oracle.com> wrote:
> > 
> > On 2016-10-07, kirk at kodewerk.com wrote:
> >> Hi Erik,
> >> 
> >> These time stamps were put in specifically because the time stamps are to guard against skidding between the time of the event and the time that UL will tag the event. This skidding is unlikely to be seen in test environments but we do run into it in production environments. So the comment it adds no additional information is likely only true in testing.
> > 
> > Hey Kirk, thanks for having a look!
> > 
> > I know about this, that is why I wrote "Almost this exact timestamp is
> > present in the UL decorator". The problem with taking a timestamp for an
> > event (e.g. "start of GC") and the event actually happening exists
> > for the GC internal timestamps as well. The call to Ticks::stamp() that
> > "sets" the _start_counter in GCTraceTimeImpl will results in a call to
> > gettimeofday(2) which very well could cause the process to be scheduled
> > out. Whether that timestamp now is right or wrong depends on when you
> > consider the GC to "actually start".
> > 
> > If you look at the patch, you will see:
> > -    out.print_cr(" (%.3fs)", TimeHelper::counter_to_seconds(start_counter));
> > +    out.cr();
> > 
> > and UL will timestamp the decorator in print_cr (or the call to cr()
> > after my patch). In almost all cases that will result in the exact same
> > value as for start_counter. And in the rare case when those two
> > timetstamps differ then it is my opinion that no timestamp is more
> > "correct" than the other for when the GC actually started (it depends on
> > your definition of "start of GC").
> > 
> > On the other hand, if you want to know for how long the Java applicaton
> > was paused, then you probably want to look at the safepoint logging
> > (available via the safepoint tag).
> > 
> > Given how confusing these timestamps are for everyone reading the log, I
> > don't see any value in keeping them.
> > 
> > Thanks,
> > Erik
> > 
> >> Kind regards,
> >> Kirk
> >> 
> >>> On Oct 7, 2016, at 10:58 AM, Erik Helin <erik.helin at oracle.com> wrote:
> >>> 
> >>> Hi all,
> >>> 
> >>> this patch removes the confusing GC internal timestamps from the GC
> >>> logs. Right now, the log looks like the following (with -Xlog:gc*=info):
> >>> 
> >>> [19.534s][info][gc,start     ] GC(4) Pause Young (G1 Evacuation Pause) (19.534s)
> >>> [19.534s][info][gc,task      ] GC(4) Using 18 workers of 18 for evacuation
> >>> [20.133s][info][gc,mmu       ] GC(4) MMU target violated: 201.0ms (200.0ms/201.0ms)
> >>> [20.133s][info][gc,phases    ] GC(4)   Evacuate Collection Set: 234.2ms
> >>> [20.133s][info][gc,phases    ] GC(4)   Clear Card Table: 0.5ms
> >>> [20.133s][info][gc,phases    ] GC(4)   Free Collection Set: 252.6ms
> >>> [20.133s][info][gc,phases    ] GC(4)   Expand Heap After Collection: 1.7ms
> >>> [20.133s][info][gc,phases    ] GC(4)   Other: 113.6ms
> >>> [20.133s][info][gc,heap      ] GC(4) Eden regions: 1463->0(1476)
> >>> [20.133s][info][gc,heap      ] GC(4) Survivor regions: 75->62(193)
> >>> [20.133s][info][gc,heap      ] GC(4) Old regions: 0->0
> >>> [20.133s][info][gc,heap      ] GC(4) Humongous regions: 16->1
> >>> [20.133s][info][gc,metaspace ] GC(4) Metaspace: 8328K->8328K(1056768K)
> >>> [20.133s][info][gc           ] GC(4) Pause Young (G1 Evacuation Pause) 1554M->62M(1738M) (19.534s, 20.133s) 599.088ms
> >>> [20.133s][info][gc,cpu       ] GC(4) User=4.24s Sys=0.08s Real=0.60s
> >>> 
> >>> This patch focuses on the timestamps on the lines with "Pause Young":
> >>> 
> >>> [19.534s][info][gc,start     ] GC(4) Pause Young (G1 Evacuation Pause) (19.534s)
> >>> [20.133s][info][gc           ] GC(4) Pause Young (G1 Evacuation Pause) 1554M->62M(1738M) (19.534s, 20.133s) 599.088ms
> >>> 
> >>> The timestamp at the first line, (19.534s), is one (of several) internal
> >>> timestamps from the GC code. Almost this exact timestamp is present in
> >>> the UL decorator (the timestamp with [19.534]). Furthemore, the GC
> >>> timestamp is often confused for being the pause time. The GC internal
> >>> timestamps are also present on the second "Pause Young" line, now as a
> >>> (begin, end) tuple. Again, this information is already available via the
> >>> UL decorators.
> >>> 
> >>> Since these GC internal timestamps adds no additional information (but
> >>> quite a lot of confusion), this patch removes them.
> >>> 
> >>> Webrev:
> >>> http://cr.openjdk.java.net/~ehelin/8167190/00/
> >>> 
> >>> Bug:
> >>> https://bugs.openjdk.java.net/browse/JDK-8167190
> >>> 
> >>> Testing:
> >>> - JPRT (A few tests that parsed the GC logs had to be updated)
> >>> 
> >>> Thanks,
> >>> Erik
> >> 

More information about the hotspot-gc-dev mailing list