A hotspot patch for stack profiling (frame pointer)
volker.simonis at gmail.com
Fri Dec 5 19:22:22 UTC 2014
I'm still not understanding who is taking the actual stack traces (let
alone the symbols) in your examples. Is this done by 'perf' itself
based only on the frame pointer?
As I wrote before, this is pretty hard to get right for a JVM, but
there are good approximations. Have you looked at the 'jstack' tool
which is part of the JDK? If you run it on a Java process, it will
give you exact stack traces with full inlining information. However
this only works at safepoints so it is probably not suitable for
profiling with performance counters. But you can also use 'jstack -F
-m' which gives you a 'best effort' mixed Java/C++ stacaktrace (most
of the time even with inlined Java frames. This is probably the best
you can get when interrupting a running JVM at an arbitrary point in
time. As you mentioned in one of your blogs, the VM can be in the
C-Library or even in the kernel at that time which don't preserve the
frame pointer either. So it will be already hard to even walk up to
the first Java frame.
But nevertheless, if the output of 'jstack -F -m' is "good enough" for
your purpose, you can implement something similar in 'perf' or a
helper library of 'perf' and be happy (I don't actually know how perf
takes stack traces but I suppose there may some kind of callback
mechanism for walking unknown frames). This is actually not so hard.
I've recently implemented a "print_native_stack()" function within
hotspot itself (you can call it for example from gdb during debugging
- see http://hg.openjdk.java.net/jdk9/dev/hotspot/rev/86183a940db4).
Maye you could call this functions directly from 'perf' if perf
attaches with ptrace to the process (I assume it does or how else
could it walk the stack)?
These were just some random thoughts with the hope that they may be helpful.
PS: by the way - the flame graphs look really impressive and it would
be really nice to have something like this for Java.
On Thu, Dec 4, 2014 at 11:55 PM, Brendan Gregg
<brendan.d.gregg at gmail.com> wrote:
> I've hacked hotspot to return the frame pointer, in part to see what this
> involves, and also to have a working prototype for analysis. Along with an
> agent to resolve symbols, this has allowed full stack profiling using Linux
> perf_events. The following flame graphs show the resulting profiles.
> A mixed mode CPU flame graph of a vert.x benchmark (click to zoom):
> Same thing, but this time disabling inlining, to show more frames:
> As expected, performance is worse without inlining. You can compare the
> flame graphs side by side to see why. Less time spent doing work / I/O!
> is my patch, and currently only works for x86-64. It removes RBP from the
> register pools, and inserts "mov(rbp, rsp)" into two function prologues. It
> is also unsupported: use at your own risk. I'm not a veteran hotspot
> engineer, so chances I messed something up are high.
> I'd love to be able to enable frame pointers in Oracle JDK, eg, with an
> -XX:+NoOmitFramePointer option. It could be put under
> -XX:+UnlockDiagnosticVMOptions or XX:+UnlockExperimentalVMOptions. So long
> as we had some way to turn it on. If someone wants to include (improve,
> rewrite) my patch, please do.
> I don't have much perf data yet, but on the vert.x microbenchmark it looked
> like returning the frame pointer cost 2.6% performance. I hope that's
> somewhat worst-case for production workloads. (I was also able to recover
> the 2.6% by fine tuning other options, so were this a production change, I'd
> be hoping not to regress performance at all.)
> We've discussed this before
> The Solaris-assisted approach that Serguei Spitsyn described (JDK-6617153)
> should work very well. The JVM can run as-is, full stacks can be generated
> on-demand, and symbols should always be correct.
> The frame pointer approach costs a little performance, and only shows
> partial stacks after inlining (unless you disable inlining, but that can
> cost >40% performance). There is the other issue Volker Simonis mentioned as
> well, where some stacks may not be profiled correctly. And, if you are
> unlucky, symbols can move during the profile, so any static perf-map-agent
> map will translate some incorrectly (I've considered developing a way to
> detect this, and highlight such frames as dubious.)
> At Netflix we are mostly Java on Linux. Switching to Oracle Solaris for this
> feature is going to be a tough sell, especially when the value of full stack
> profiling isn't widely understood. I personally think it might be a bit
> easier if a -XX:+NoOmitFramePointer option existed, so Linux users can try
> the feature, then consider the better Solaris version after gaining solid
> experience on why it is so important.
> We recently blogged about the value of stack profiling and flame graphs,
> http://techblog.netflix.com/2014/11/nodejs-in-flames.html, although this was
> for Node.js, which already has frame pointer support.
> If anyone wants to try generating these mixed mode CPU flame graphs
> themselves (in a test environment!), the first step is to compile OpenJDK 8
> b132 with the previous patch, and get that running. Also install the
> packages for the "perf" command. The remaining steps would be something
> # git clone --depth=1 https://github.com/brendangregg/FlameGraph
> # git clone --depth=1 https://github.com/jrudolph/perf-map-agent
> # cd perf-map-agent
> # export JAVA_HOME=/...
> # cmake .
> # make
> # perf record -F 99 -p `pgrep -n java` -g -- sleep 30
> # java -cp attach-main.jar:$JAVA_HOME/lib/tools.jar
> net.virtualvoid.perf.AttachOnce `pgrep -n java`
> # perf script > ../FlameGraph/out.stacks
> # cd ../FlameGraph
> # ./stackcollapse-perf.pl < out.stacks | ./flamegraph.pl --color=java >
> Finally, if you are new to CPU flame graphs, see
> http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html .
More information about the hotspot-compiler-dev