RFR(s): 8065895: Synchronous signals during error reporting may terminate or hang VM process
david.holmes at oracle.com
Wed Feb 4 10:18:58 UTC 2015
On 4/02/2015 8:15 PM, Thomas Stüfe wrote:
> Great, thank you!
> @David: Do you still need something from me to sponsor the change?
No I think I have everything I need. Will push it in the morning.
You'll be listed as Contributor as you don't have Author status for jdk9
> Kind Regards, Thaoms
> On Tue, Feb 3, 2015 at 10:15 PM, Gerard Ziemski
> <gerard.ziemski at oracle.com <mailto:gerard.ziemski at oracle.com>> wrote:
> Looks good! Thank you for a great fix!
> On 2/3/2015 10:40 AM, Thomas Stüfe wrote:
>> Hi Gerard,
>> thank you for your feedback! See here my new change:
>> On Mon, Feb 2, 2015 at 4:19 PM, Gerard Ziemski
>> <gerard.ziemski at oracle.com <mailto:gerard.ziemski at oracle.com>> wrote:
>> hi Thomas,
>> On 2/1/2015 9:56 AM, Thomas Stüfe wrote:
>>> On Fri, Jan 30, 2015 at 11:23 PM, Gerard Ziemski
>>> <gerard.ziemski at oracle.com
>>> <mailto:gerard.ziemski at oracle.com>> wrote:
>>> hi Thomas,
>>> #1 The following comment in vmError_bsd.cpp,
>>> vmError_linux.cpp and vmError_solaris.cpp is now outdated
>>> and should be updated from:
>>> // handle all synchronous program error signals which may
>>> happen during error
>>> // reporting. They must be unblocked, caught, handled.
>>> // Note that in hotspot signal handlers are installed
>>> with a fully filled
>>> // signal mask, i.e. upon delivery all signals - also
>>> synchronous error signals
>>> // - are blocked. For synchronous error signals this is
>>> deadly - program will
>>> // hang or be terminated immediately if secondary errors
>>> happen during error
>>> // reporting.
>>> to the comment from vmError_aix.cpp:
>>> // Handle all synchronous signals which may happen during
>>> signal handling,
>>> // not just SIGSEGV and SIGBUS.
>>> I am not sure I understand: the problem is that signal
>>> handlers are installed in the hotspot with sa_mask completely
>>> set, so all signals are blocked. See e.g. os_linux.cpp, line
>>> 4304++ (os::Linux::set_signal_handler()). So the comment
>>> appears to be correct.
>> We continue to block the signals initially, but the comment,
>> or at least its last sentence, does not apply anymore because
>> of the crux of your fix - ie. the program will not hang
>> because we immediately unblock the synchronous traps in the
>> crash handler. So maybe we should make that clear in that comment?
>> Okay, I see that now. I modified the comments as you suggested
>>> POSIX itself mentioned only the four, see documentation on
>>> "If any of the SIGFPE, SIGILL, SIGSEGV, or SIGBUS signals are
>>> generated while they are blocked, the result is undefined,
>>> unless the signal was generated by the /kill/()
>>> <http://pubs.opengroup.org/onlinepubs/009695399/functions/kill.html> function,
>>> the /sigqueue/()
>>> <http://pubs.opengroup.org/onlinepubs/009695399/functions/sigqueue.html> function,
>>> or the /raise/()
>>> <http://pubs.opengroup.org/onlinepubs/009695399/functions/raise.html> function."
>>> I added SIGTRAP on AIX because it is synchronously delivered
>>> too, but do not know any other signals. I am happy to add any
>>> other signals you know of.
>> Nope, I do not know any more synchronous signals - just wanted
>> to make sure we take a second look and collectively agree that
>> this list is exhaustive.
>>> I would like to test it at least twice, to see that signal
>>> handling continues to work after secondary handler ran once.
>>> It needs to correctly unset the signal mask for the
>>> synchronous signals.
>>> Executing it four times may be a bit excessive; if you
>>> prefer, I can remove it for the out_done case.
>>> Note that there is one benefit of repeatedly crashing in the
>>> error handler, it is to stress the stack usage during error
>>> reporting. But I admit that this test could be done much better.
>> It's a bit confusing at first when someone debugs this code
>> and sees four of those secondary signals coming through, so
>> maybe just a small comment here would suffice?
>> I changed the coding to only fire for the log case. That should be
>> less confusing.
>>> vmerrors.sh is very minimal. The bug symptom was incomplete
>>> or empty hs-err files, and you would not catch this with the
>>> vmerrors.sh test, would you not? You also can reproduce this
>>> bug only by using both -XX:ErrorHandlerTest and
>>> -XX:TestCrashInErrorHandler together, so it does not quite
>>> fit into what vmerrors.sh does.
>>> At SAP we have a nice test suite which crashes the VM in a
>>> number of ways and then checks whether the hs-err files are
>>> complete (enough). I was planning to add a Jtreg test which
>>> does something similar, but did not add it to this change
>>> because I did not want to make the change larger than necessary.
>> I would think that the added complexity of the test case,
>> would in fact make it easier to review the test with the fix
>> at the same time. Also, since the test in this case is
>> feasible, isn't it recommended to have the fix committed along
>> with its test?
>> I added a small jtreg test. Note that the test cannot really
>> demonstrate the bug without my fix, because it needs the new
>> -XX:TestCrashInErrorHandler options which is part of the fix. But
>> it will show regressions.
>> Best Regards, Thomas
More information about the hotspot-runtime-dev