Improvement for lowered throughput when allocation tracking is enabled
gabriel.reid at gmail.com
Thu Jul 30 10:47:49 UTC 2020
We recently ran into an issue of significantly reduced throughput in a
workload when running with JFR enabled, specifically related to the
enabling of allocation tracking (i.e. ObjectAllocationInNewTLAB and
ObjectAllocationOutsideTLAB). The workload in question is CPU-bound,
parallel (running on 15 cores), with a high allocation rate, and has
many distinct call stack states. This situation has been encountered
both in OpenJDK 13 and OpenJDK 14 (14.0.1+7), and can also be
reproduced current HEAD of the OpenJDK source tree.
The symptom that we’ve encountered is that throughput is lowered, and
also CPU utilization is lowered when allocation tracking is enabled.
Profiling with java-async-profiler it appears that a significant
amount of time is spent in JfrStackTrace::equals , and the calls to
this method are performed within the context of a mutex, which
explains the lowered CPU utilization. Profiling in wall clock mode
confirms that a significant amount of time is spent waiting on the
mutex in JfrStackTraceRepository::add_trace .
I dug into this a bit deeper, and it appears that this is largely
related to the hashcode calculation performed for JfrStackTrace ,
as it appears to be very susceptible to collisions. Based on testing
with a reduced test workload on 6 cores, I’ve seen that replacing this
hash calculation with a more thorough implementation can nearly
eliminate the impact of allocation tracking (which otherwise results
in a lowering of 15% in throughput for this test workload).
It’s worth noting that the issue doesn’t seem to be with the
distribution of the hash codes over buckets, but more with the reduced
cardinality of the actual hash values. Because the hash code is used
as a short-circuit for equality comparison, having hash collisions
on stack traces with equal height results in a potentially costly
comparison of all frames until a non-matching frame is found.
I’d be very happy to submit a patch for this (as well as provide more
background on my benchmarking if desired), but I wanted to first check
if I was missing something obvious here (i.e. if there’s a specific
reason for the hash calculation to be done as it currently is). Could
you let me know if I’m missing some important details here, or should
I go ahead with submitting a patch?
More information about the hotspot-jfr-dev