RFR(s): 8065895: Synchronous signals during error reporting may terminate or hang VM process
thomas.stuefe at gmail.com
Wed Feb 4 10:15:53 UTC 2015
Great, thank you!
@David: Do you still need something from me to sponsor the change?
Kind Regards, Thaoms
On Tue, Feb 3, 2015 at 10:15 PM, Gerard Ziemski <gerard.ziemski at oracle.com>
> Looks good! Thank you for a great fix!
> On 2/3/2015 10:40 AM, Thomas Stüfe wrote:
> Hi Gerard,
> thank you for your feedback! See here my new change:
> On Mon, Feb 2, 2015 at 4:19 PM, Gerard Ziemski <gerard.ziemski at oracle.com
> > wrote:
>> hi Thomas,
>> On 2/1/2015 9:56 AM, Thomas Stüfe wrote:
>> On Fri, Jan 30, 2015 at 11:23 PM, Gerard Ziemski <
>> gerard.ziemski at oracle.com> wrote:
>>> hi Thomas,
>>> #1 The following comment in vmError_bsd.cpp, vmError_linux.cpp and
>>> vmError_solaris.cpp is now outdated and should be updated from:
>>> // handle all synchronous program error signals which may happen during
>>> // reporting. They must be unblocked, caught, handled.
>>> // Note that in hotspot signal handlers are installed with a fully filled
>>> // signal mask, i.e. upon delivery all signals - also synchronous error
>>> // - are blocked. For synchronous error signals this is deadly - program
>>> // hang or be terminated immediately if secondary errors happen during
>>> // reporting.
>>> to the comment from vmError_aix.cpp:
>>> // Handle all synchronous signals which may happen during signal
>>> // not just SIGSEGV and SIGBUS.
>> I am not sure I understand: the problem is that signal handlers are
>> installed in the hotspot with sa_mask completely set, so all signals are
>> blocked. See e.g. os_linux.cpp, line 4304++
>> (os::Linux::set_signal_handler()). So the comment appears to be correct.
>> We continue to block the signals initially, but the comment, or at least
>> its last sentence, does not apply anymore because of the crux of your fix -
>> ie. the program will not hang because we immediately unblock the
>> synchronous traps in the crash handler. So maybe we should make that clear
>> in that comment?
> Okay, I see that now. I modified the comments as you suggested earlier.
>> POSIX itself mentioned only the four, see documentation on sigprocmask:
>> "If any of the SIGFPE, SIGILL, SIGSEGV, or SIGBUS signals are generated
>> while they are blocked, the result is undefined, unless the signal was
>> generated by the *kill*()
>> <http://pubs.opengroup.org/onlinepubs/009695399/functions/kill.html> function,
>> the *sigqueue*()
>> <http://pubs.opengroup.org/onlinepubs/009695399/functions/sigqueue.html> function,
>> or the *raise*()
>> I added SIGTRAP on AIX because it is synchronously delivered too, but
>> do not know any other signals. I am happy to add any other signals you know
>> Nope, I do not know any more synchronous signals - just wanted to make
>> sure we take a second look and collectively agree that this list is
>> I would like to test it at least twice, to see that signal handling
>> continues to work after secondary handler ran once. It needs to correctly
>> unset the signal mask for the synchronous signals.
>> Executing it four times may be a bit excessive; if you prefer, I can
>> remove it for the out_done case.
>> Note that there is one benefit of repeatedly crashing in the error
>> handler, it is to stress the stack usage during error reporting. But I
>> admit that this test could be done much better.
>> It's a bit confusing at first when someone debugs this code and sees
>> four of those secondary signals coming through, so maybe just a small
>> comment here would suffice?
> I changed the coding to only fire for the log case. That should be less
>> vmerrors.sh is very minimal. The bug symptom was incomplete or empty
>> hs-err files, and you would not catch this with the vmerrors.sh test, would
>> you not? You also can reproduce this bug only by using both
>> -XX:ErrorHandlerTest and -XX:TestCrashInErrorHandler together, so it does
>> not quite fit into what vmerrors.sh does.
>> At SAP we have a nice test suite which crashes the VM in a number of
>> ways and then checks whether the hs-err files are complete (enough). I was
>> planning to add a Jtreg test which does something similar, but did not add
>> it to this change because I did not want to make the change larger than
>> I would think that the added complexity of the test case, would in fact
>> make it easier to review the test with the fix at the same time. Also,
>> since the test in this case is feasible, isn't it recommended to have the
>> fix committed along with its test?
> I added a small jtreg test. Note that the test cannot really demonstrate
> the bug without my fix, because it needs the new
> -XX:TestCrashInErrorHandler options which is part of the fix. But it will
> show regressions.
> Best Regards, Thomas
More information about the hotspot-runtime-dev