[RFR]: Per thread IO statistics in JFR
gunter.haug at sap.com
Wed Jan 23 15:59:45 UTC 2019
Hi Alan, David
Here are the results of the measurements. As mentioned before, I had measured some of the standard Java benchmarks and there was no impact. This is probably not surprising to anybody as the overhead up-calls is minuscule in comparison to the calculations that take place and is far below the noise level.
A benchmark that is most prone to show the impact the up-calls would do fast (short) IO operations. The fastest operations I could think of were writing to /dev/null and reading from /dev/zero on Linux or macOS. I wrote benchmarks which write/read just one byte one million times in a loop using the jmh framework (https://openjdk.java.net/projects/code-tools/jmh/).
The results measured on my mac were all over the place and I moved to a otherwise empty Linux host. Results were more stable there. jmh runs the benchmark multiple times for each invocation and calculates an error within one invocation. It turned out that invocations which had a large error showed poorer performance, obviously because something happened on the host in parallel what degraded performance. I invoked the benchmarks multiple times and decided to discard measurements with an error of more than 0.01. This value is rather random but it was achieved by about one out of four invocations and I think we can get rid of the outliers.
It turned out that even with this setup the impact of the up-call is below noise level. In fact, the write benchmark was faster with the up-calls what is not plausible, of course. The read benchmark showed the best and the poorest performance without the up-calls while the results for a JDK with the up-calls (see below for the actual results).
As Volker writes, we still think that the overhead of the up-calls is acceptable but it would be possible to introduce a switch which is based on a system property.
@Thomas, thanks for your input! I think all your suggestions are well funded. I'll reply in more detail if/when we can alleviate the more general concerns.
Write without up-call
Mode Cnt Score Error Units
thrpt 25 1.867 0.005 ops/s
thrpt 25 1.866 0.006 ops/s
thrpt 25 1.865 0.004 ops/s
thrpt 25 1.860 0.010 ops/s
Write with up-call
Mode Cnt Score Error Units
thrpt 25 1.880 0.010 ops/s
thrpt 25 1.884 0.007 ops/s
thrpt 25 1.887 0.004 ops/s
thrpt 25 1.879 0.010 ops/s
Read without up-calls
Mode Cnt Score Error Units
thrpt 25 2.145 0.008 ops/s
thrpt 25 2.145 0.010 ops/s
thrpt 25 2.129 0.010 ops/s
thrpt 25 2.122 0.009 ops/s
Read with up-calls
Mode Cnt Score Error Units
thrpt 25 2.136 0.007 ops/s
thrpt 25 2.132 0.003 ops/s
thrpt 25 2.130 0.006 ops/s
thrpt 25 2.129 0.009 ops/s
On 17.01.19, 18:31, "Volker Simonis" <volker.simonis at gmail.com> wrote:
On Thu, Jan 17, 2019 at 1:31 PM Alan Bateman <Alan.Bateman at oracle.com> wrote:
> On 17/01/2019 07:23, Thomas Stüfe wrote:
> > :
> > Do you object against keeping these counters (which basically boils
> > down to Thread::current->stat_structure->counter++)? Or do you even
> > object against making upcalls into the jvm? Note that, if deemed
> > necessary, we could omit updating the counters unless JFR or our
> > extended thread dumps are activated (which are the consumers of the
> > counters).
> > In any case, I would have assumed the costs for upcall + counter
> > update to be insignificant compared to the IO calls. We should of
> > course measure that.
> > If you generally object upcalls into the libjvm for
> > statistical/monitoring reasons, this would make matters on a number of
> > fronts more complicated. For instance, it was discussed extending NMT
> > coverage to the JDK - which is already in part reality at
> > Unsafe.AllocateMemory - and this would have to be done with upcalls too.
> There are many issues here that will need write-up and discussion, maybe
> a JEP if discussions converge on a proposal to bring into the main line
> as this is a significant change with implications for many areas of the
We could certainly create a new JEP for this feature, but in the end
it is rather trivial. In fact it can be expressed in a single
sentence: instrument all native IO functions to collect the number of
read and written bytes. As stated before, the overhead is minimal
(Gunter will provide some concrete benchmarks soon) and could even be
dropped to ~zero if you really insist on this (e.g. by wrapping all
calls by a switch which is based on a system property). But honestly
speaking we don't think this is really necessary.
I want to stress the fact that this is not a wacky idea Gunter came up
with but a feature which we're using for almost 10 years in our SAP
JVM. It is running in all our enterprise scenarios without ever having
caused any problems.
> It also potentially conflicts in direction with some of the
> other projects in progress (particularly with Loom trying to re-imagine
> threads, do you really want to collect I/O stats on a per thread basis
> in the future???).
Who knows about the future, our trace is quite useful today :) You're
right that this may change when Loom becomes available and I think
we've faced similar problems when people started using ThreadPools
more heavily. We've mitigated this with another SAP JVM specific
feature (which we havn't contributed yet) that we've called "Thread
Annotations". They could be set from user code and our profiler could
then split the collected statistics based on these annotations even if
they were all collected for the same native thread. I image we can
come up with similar solutions for project Loom in the future, once it
> As regards the points to instrument then I think we have to assume that
> much of the native code that is targeted by the current webrev will go
> away or change significantly in the future.
But that's actually nice! That will eventually shrink the code
proposed by this change to a few places and make the costs of
supporting it even smaller. And needless to say that SAP is committed
to support and maintain this feature once it gets integrated.
> We've been on that path for
> some time, e.g. the zip area or the prototype to replace the SocketImpl
> used for classic networking that eliminates a lot of the native code
> touched in that area by the webrev. Once Panama is further along then I
> assume we will want to make use of it in the core libraries and at least
> initially replace the JNI methods that just wrap syscalls today, and
> longer term more significant refactoring.
Again, I think this will make our implementation simpler because we
would have to instrument fewer places.
> My point is that instrumenting
> native methods may not be the right approach, instead maybe we should be
> look at instrumenting the I/O paths at the java level as that will
> likely play better with the VM.
We gathered some experience with Java-level instrumentation during the
last years (e.g. Dynatrace, Wiley, etc) and it has not always been
positive :) The overhead is certainly higher compared to the native
instrumentation at C-level. And it can influence especially JIT
optimizations in an unpredictable way. Finally the Java
instrumentation is version dependent and puts the burden on
instrumenting the "right" places (end especially catching all the
right places) on the developer / support engineer. You could of course
upload this overhead into another, external project like JFR, but
supporting that doesn't come for free either.
> There is some support for collecting I/O
> stats in JFR today and maybe someone working in that area can explain
> that a bit more and what the issues are.
As far as I understand, the corresponding JFR events only sample the
time spent in the instrumented calls while we do an exact IO
statistics with regard to the bytes read and written.
> It's impossible to tell from the mail with the webrev what has been
> explored and not explored. It feels like early stages in a much large
> project that will need a write up of prototypes before a direction can
> be proposed.
As I wrote before, this is a solution we're using successfully since
many years. We're convinced that is is superior to Java-level
instrumentation with regards to both, overhead as well as
More information about the core-libs-dev