RFR (S) 8181143: Introduce diagnostic flag to abort VM on too long VM operations
shade at redhat.com
Tue Dec 11 12:05:33 UTC 2018
Getting back to this. Replying to all three reviews in the same mail, because they share the
answers! New patch:
Testing: x86_64 build, hotspot tier1, new test
On 11/20/18 8:07 AM, Thomas Stüfe wrote:
> I don't know whether I need 10ms resolution though - if we limit the
> goal to catching just hanging VMOps - which would still pretty useful
> - 1s would even be enough.
Right. New patch implements variable delay, depending on what actual delay had been requested. New
test makes sure it works reliably.
> I think we can do with just two flags, since VMOperationTimeout and
> VMOperationTimeoutDelay are redundant. Just go with the delay, make -1
> the default that does nothing. Same could go for SafepointTimeout vs
Redid like this:
diagnostic(bool, AbortVMOnVMOperationTimeout, false, \
"Abort upon failure to complete VM operation promptly") \
diagnostic(intx, AbortVMOnVMOperationTimeoutDelay, 1000, \
"Delay in milliseconds for option AbortVMOnVMOperationTimeout") \
range(0, max_intx) \
> I also agree with others that it would make more sense were we to kill
> the VMThread instead (e.g. just with pthread_kill(SIGILL) or similar)
> - we do something like that in error handling (see ErrorLogTimeout).
I looked into that, and it seems too much hassle without significant benefit. For example, it would
require drilling new platform-specific holes to support it, and it also likely to be supported on
Linux (pthreads) only anyway? Seems to me piggybacking on periodic tasks is much simpler. In
development use, you'd have a core file to see what is going on in VM. In "production" use you'd
care that circuit breaker that did its job of terminating the VM.
On 11/19/18 9:35 AM, Robbin Ehn wrote:
> You patch seems not to be against jdk/jdk (jdk12).
Really? The new patch is definitely against jdk/jdk.
> Without the actual core file, I don't see the hs file being very useful
> containing the watcherthread stacktrace. It would be good if the 'killer'
> signaled the VM thread and we do the error reporting from the signal handler
> instead in VM thread context.
Right. Looked into that too, decided it is too complicated without significant benefit. See above.
On 11/19/18 7:13 AM, David Holmes wrote:
> You're not just introducing one diagnostic flag, your introducing the entire VM operation
> timeout mechanism, including two product flags and one diagnostic. So the CR needs to reflect
> that clearly and you will need a CSR request to add the two product flags. And they will need to
> be documented.
This was "solved" by doing two diagnostic flags, see above. I think the symmetry against other
AbortVM* flags is important, and having the boolean flag for AbortVM* is good.
> It's not safe to access vm_safepoint_description() outside the VMThread as the _cur_vm_operation
> could be deleted while you're trying to access its name.
Noted, removed that part.
> And as we don't have a general timer mechanism this has to use polling so you pay for a 10ms
> task wakeup regardless of how long the timeout is.
Redone that part, see above.
> Given the limitations of the global timeout I'm not sure I see a use for the non-aborting form.
> This could just reduce down to:
> otherwise I don't really think this carries its weight. Of course that's just my opinion.
> Interested to hear others.
I prefer not to introduce another shape for AbortVM* options. Two diagnostic options seems to have
much less weight on the whole thing.
More information about the hotspot-dev