RFR (S) 8181143: Introduce diagnostic flag to abort VM on too long VM operations

David Holmes david.holmes at oracle.com
Wed Dec 12 12:28:18 UTC 2018

Hi Aleksey,

On 11/12/2018 10:05 pm, Aleksey Shipilev wrote:
> Hi,
> Getting back to this. Replying to all three reviews in the same mail, because they share the
> answers! New patch:
>    http://cr.openjdk.java.net/~shade/8181143/webrev.04/

Okay I can live with that. :)

Three comments:

1. We don't use bug id's to name test directories any more (I see a 
couple  have snuck in this past year! so please place the test in a 
suitable directory ... Safepoints?

2. -Xmx1g might be an issue on some test systems. Does it have to be 
that large?

3. Can you disable core dumping as part of the test so we don't get 
large core files generated.

No need to see updated webrev.


> Testing: x86_64 build, hotspot tier1, new test
> On 11/20/18 8:07 AM, Thomas Stüfe wrote:
>> I don't know whether I need 10ms resolution though - if we limit the
>> goal to catching just hanging VMOps - which would still pretty useful
>> - 1s would even be enough.
> Right. New patch implements variable delay, depending on what actual delay had been requested. New
> test makes sure it works reliably.
>> I think we can do with just two flags, since VMOperationTimeout and
>> VMOperationTimeoutDelay are redundant. Just go with the delay, make -1
>> the default that does nothing. Same could go for SafepointTimeout vs
>> SafepointTimeoutDelay.
> Redid like this:
>    diagnostic(bool, AbortVMOnVMOperationTimeout, false,                      \
>            "Abort upon failure to complete VM operation promptly")           \
>                                                                              \
>    diagnostic(intx, AbortVMOnVMOperationTimeoutDelay, 1000,                  \
>            "Delay in milliseconds for option AbortVMOnVMOperationTimeout")   \
>            range(0, max_intx)                                                \
>                                                                              \
>> I also agree with others that it would make more sense were we to kill
>> the VMThread instead (e.g. just with pthread_kill(SIGILL) or similar)
>> - we do something like that in error handling (see ErrorLogTimeout).
> I looked into that, and it seems too much hassle without significant benefit. For example, it would
> require drilling new platform-specific holes to support it, and it also likely to be supported on
> Linux (pthreads) only anyway? Seems to me piggybacking on periodic tasks is much simpler. In
> development use, you'd have a core file to see what is going on in VM. In "production" use you'd
> care that circuit breaker that did its job of terminating the VM.
> On 11/19/18 9:35 AM, Robbin Ehn wrote:
>>   You patch seems not to be against jdk/jdk (jdk12).
> Really? The new patch is definitely against jdk/jdk.
>>   Without the actual core file, I don't see the hs file being very useful
>>   containing the watcherthread stacktrace. It would be good if the 'killer'
>>   signaled the VM thread and we do the error reporting from the signal handler
>>   instead in VM thread context.
> Right. Looked into that too, decided it is too complicated without significant benefit. See above.
> On 11/19/18 7:13 AM, David Holmes wrote:
>> You're not just introducing one diagnostic flag, your introducing the entire VM operation
>> timeout mechanism, including two product flags and one diagnostic. So the CR needs to reflect
>> that clearly and you will need a CSR request to add the two product flags. And they will need to
>> be documented.
> This was "solved" by doing two diagnostic flags, see above. I think the symmetry against other
> AbortVM* flags is important, and having the boolean flag for AbortVM* is good.
>> It's not safe to access vm_safepoint_description() outside the VMThread as the _cur_vm_operation
>> could be deleted while you're trying to access its name.
> Noted, removed that part.
>> And as we don't have a general timer mechanism this has to use polling so you pay for a 10ms
>> task wakeup regardless of how long the timeout is.
> Redone that part, see above.
>> Given the limitations of the global timeout I'm not sure I see a use for the non-aborting form.
>> This could just reduce down to:
>> -XX:AbortVMOpsAfter:NN
>> otherwise I don't really think this carries its weight. Of course that's just my opinion.
>> Interested to hear others.
> I prefer not to introduce another shape for AbortVM* options. Two diagnostic options seems to have
> much less weight on the whole thing.
> -Aleksey

More information about the hotspot-dev mailing list