A hotspot patch for stack profiling (frame pointer)

Volker Simonis volker.simonis at gmail.com
Fri Dec 5 22:50:52 UTC 2014

Yes, that's clear. I didn't wanted to propose using "jstack -F"
directly. I just wanted to say that it's possible for an external tool
to get a "reasonable good" stack trace out of a JVM process at any
time and "jstack -F" can be taken as a boilerplate of how to do that.

That said, I still don't know how perf creates stack traces. Does it
attach to the process with ptrace or how else does it inspect the
stacks after a performance counter event?

On Fri, Dec 5, 2014 at 8:34 PM, Staffan Larsen
<staffan.larsen at oracle.com> wrote:
> Just to note that the implementation of “jstack -F” is not at all suitable for profiling since has a very high overhead (it attaches a debugger to the process).
> /Staffan
>> On 5 dec 2014, at 20:22, Volker Simonis <volker.simonis at gmail.com> wrote:
>> Hi Brendan,
>> I'm still not understanding who is taking the actual stack traces (let
>> alone the symbols) in your examples. Is this done by 'perf' itself
>> based only on the frame pointer?
>> As I wrote before, this is pretty hard to get right for a JVM, but
>> there are good approximations. Have you looked at the 'jstack' tool
>> which is part of the JDK? If you run it on a Java process, it will
>> give you exact stack traces with full inlining information. However
>> this only works at safepoints so it is probably not suitable for
>> profiling with performance counters. But you can also use 'jstack -F
>> -m' which gives you a 'best effort' mixed Java/C++ stacaktrace (most
>> of the time even with inlined Java frames. This is probably the best
>> you can get when interrupting a running JVM at an arbitrary point in
>> time. As you mentioned in one of your blogs, the VM can be in the
>> C-Library or even in the kernel at that time which don't preserve the
>> frame pointer either. So it will be already hard to even walk up to
>> the first Java frame.
>> But nevertheless, if the output of 'jstack -F -m' is "good enough" for
>> your purpose, you can implement something similar in 'perf' or a
>> helper library of 'perf' and be happy (I don't actually know how perf
>> takes stack traces but I suppose there may some kind of callback
>> mechanism for walking unknown frames). This is actually not so hard.
>> I've recently implemented a "print_native_stack()" function within
>> hotspot itself (you can call it for example from gdb during debugging
>> - see http://hg.openjdk.java.net/jdk9/dev/hotspot/rev/86183a940db4).
>> Maye you could call this functions directly from 'perf' if perf
>> attaches with ptrace to the process (I assume it does or how else
>> could it walk the stack)?
>> These were just some random thoughts with the hope that they may be helpful.
>> Regards,
>> Volker
>> PS: by the way - the flame graphs look really impressive and it would
>> be really nice to have something like this for Java.
>> On Thu, Dec 4, 2014 at 11:55 PM, Brendan Gregg
>> <brendan.d.gregg at gmail.com> wrote:
>>> G'Day,
>>> I've hacked hotspot to return the frame pointer, in part to see what this
>>> involves, and also to have a working prototype for analysis. Along with an
>>> agent to resolve symbols, this has allowed full stack profiling using Linux
>>> perf_events. The following flame graphs show the resulting profiles.
>>> A mixed mode CPU flame graph of a vert.x benchmark (click to zoom):
>>> http://www.brendangregg.com/FlameGraphs/cpu-mixedmode-vertx.svg
>>> Same thing, but this time disabling inlining, to show more frames:
>>> http://www.brendangregg.com/FlameGraphs/cpu-mixedmode-flamegraph.svg
>>> As expected, performance is worse without inlining. You can compare the
>>> flame graphs side by side to see why. Less time spent doing work / I/O!
>>> https://github.com/brendangregg/Misc/blob/master/java/openjdk8_b132-fp.diff
>>> is my patch, and currently only works for x86-64. It removes RBP from the
>>> register pools, and inserts "mov(rbp, rsp)" into two function prologues. It
>>> is also unsupported: use at your own risk. I'm not a veteran hotspot
>>> engineer, so chances I messed something up are high.
>>> I'd love to be able to enable frame pointers in Oracle JDK, eg, with an
>>> -XX:+NoOmitFramePointer option. It could be put under
>>> -XX:+UnlockDiagnosticVMOptions or XX:+UnlockExperimentalVMOptions. So long
>>> as we had some way to turn it on. If someone wants to include (improve,
>>> rewrite) my patch, please do.
>>> I don't have much perf data yet, but on the vert.x microbenchmark it looked
>>> like returning the frame pointer cost 2.6% performance. I hope that's
>>> somewhat worst-case for production workloads. (I was also able to recover
>>> the 2.6% by fine tuning other options, so were this a production change, I'd
>>> be hoping not to regress performance at all.)
>>> We've discussed this before
>>> (http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2014-October/thread.html#15939).
>>> The Solaris-assisted approach that Serguei Spitsyn described (JDK-6617153)
>>> should work very well. The JVM can run as-is, full stacks can be generated
>>> on-demand, and symbols should always be correct.
>>> The frame pointer approach costs a little performance, and only shows
>>> partial stacks after inlining (unless you disable inlining, but that can
>>> cost >40% performance). There is the other issue Volker Simonis mentioned as
>>> well, where some stacks may not be profiled correctly. And, if you are
>>> unlucky, symbols can move during the profile, so any static perf-map-agent
>>> map will translate some incorrectly (I've considered developing a way to
>>> detect this, and highlight such frames as dubious.)
>>> At Netflix we are mostly Java on Linux. Switching to Oracle Solaris for this
>>> feature is going to be a tough sell, especially when the value of full stack
>>> profiling isn't widely understood. I personally think it might be a bit
>>> easier if a -XX:+NoOmitFramePointer option existed, so Linux users can try
>>> the feature, then consider the better Solaris version after gaining solid
>>> experience on why it is so important.
>>> We recently blogged about the value of stack profiling and flame graphs,
>>> http://techblog.netflix.com/2014/11/nodejs-in-flames.html, although this was
>>> for Node.js, which already has frame pointer support.
>>> If anyone wants to try generating these mixed mode CPU flame graphs
>>> themselves (in a test environment!), the first step is to compile OpenJDK 8
>>> b132 with the previous patch, and get that running. Also install the
>>> packages for the "perf" command. The remaining steps would be something
>>> like:
>>> # git clone --depth=1 https://github.com/brendangregg/FlameGraph
>>> # git clone --depth=1 https://github.com/jrudolph/perf-map-agent
>>> # cd perf-map-agent
>>> # export JAVA_HOME=/...
>>> # cmake .
>>> # make
>>> # perf record -F 99 -p `pgrep -n java` -g -- sleep 30
>>> # java -cp attach-main.jar:$JAVA_HOME/lib/tools.jar
>>> net.virtualvoid.perf.AttachOnce `pgrep -n java`
>>> # perf script > ../FlameGraph/out.stacks
>>> # cd ../FlameGraph
>>> # ./stackcollapse-perf.pl < out.stacks | ./flamegraph.pl --color=java >
>>> out.svg
>>> Finally, if you are new to CPU flame graphs, see
>>> http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html .
>>> Brendan

More information about the hotspot-compiler-dev mailing list