RFR (S) 8181143: Introduce diagnostic flag to abort VM on too long VM operations
david.holmes at oracle.com
Mon Nov 19 06:13:53 UTC 2018
First the synopsis is not accurate:
"Introduce diagnostic flag to abort VM on too long VM operations"
You're not just introducing one diagnostic flag, your introducing the
entire VM operation timeout mechanism, including two product flags and
one diagnostic. So the CR needs to reflect that clearly and you will
need a CSR request to add the two product flags. And they will need to
Three flags just for this makes me cringe. (Yes it mirrors the safepoint
timeout flags but if that were proposed today I'd have the same reaction.)
On 17/11/2018 2:30 am, Aleksey Shipilev wrote:
> SafepointTimeout is nice to discover long/stuck safepoint syncs. But it is as important to discover
> long/stuck VM operations. This patch complements the timeout machinery with tracking VM operation
> themselves. Among other things, this allows to terminate the VM when very long VM operation is
> blocking progress. High-availability users would enjoy fail-fast JVM -- in fact, the original
> prototype was done as request from Apache Ignite developers.
> Example with -XX:+VMOperationTimeout -XX:VMOperationTimeoutDelay=100 -XX:+AbortVMOnVMOperationTimeout:
> [3.117s][info][gc,start] GC(2) Pause Young (Normal) (G1 Evacuation Pause)
> [3.224s][warning][vmthread] VM Operation G1CollectForAllocation took longer than 100 ms
> # A fatal error has been detected by the Java Runtime Environment:
> # Internal Error (/home/sh/jdk-jdk/src/hotspot/share/runtime/vmThread.cpp:218), pid=2536, tid=2554
> # fatal error: VM Operation G1CollectForAllocation took longer than 100 ms
It's not safe to access vm_safepoint_description() outside the VMThread
as the _cur_vm_operation could be deleted while you're trying to access
Initially I thought this might be useful for tracking down excessively
long VM ops, but with a global timeout it can't do that. And a per-op
timeout would be rather tricky to pass through from the command-line
(but easy enough to use once you had it).
And as we don't have a general timer mechanism this has to use polling
so you pay for a 10ms task wakeup regardless of how long the timeout is.
Given the limitations of the global timeout I'm not sure I see a use for
the non-aborting form. This could just reduce down to:
otherwise I don't really think this carries its weight. Of course that's
just my opinion. Interested to hear others.
> Testing: hotspot/tier1, ad-hoc tests, jdk-submit (pending)
More information about the hotspot-dev