RFR(S): JDK-8137035 tests got EXCEPTION_STACK_OVERFLOW on Windows 64 bit

David Holmes david.holmes at oracle.com
Tue Aug 30 03:10:01 UTC 2016

Hi Fred,

On 30/08/2016 12:37 AM, Frederic Parain wrote:
> Hi David,
> Thank you for the review.
> A few comments in-lined below.
> On 08/28/2016 09:36 PM, David Holmes wrote:
>> Hi Fred,
>> On 27/08/2016 6:00 AM, Frederic Parain wrote:
>>> Hi,
>>> Please review this fix for bug JDK-8137035
>>> The bug is confidential but it is related to several VM crashes
>>> that occurred on the Windows 64 bits platform in stack overflow
>>> conditions. I've copied/pasted the analysis of the bug and the
>>> description of the fix below.
>> The analysis and solution all seem reasonable. Though I do have to
>> wonder how the failure to reenable the yellow zone when returning to
>> Java would not cause far more problem, on all platforms.
> Running with Yellow Pages disabled clearly opens the door to random
> crashes. Making the mechanism simpler and more robust would benefit
> to all platforms.
>>> Webrev:
>>> http://cr.openjdk.java.net/~fparain/8137035/webrev.00/
>> src/os/windows/vm/os_windows.cpp
>> While examining the thread state logic in the exception handler I
>> noticed some pre-existing bugs:
>> 2506   if (exception_code == EXCEPTION_ACCESS_VIOLATION) {
>> 2507     JavaThread* thread = (JavaThread*) t;
>> there is no check that t is in fact a JavaThread, or even that t is
>> non-NULL. Such checks occur slightly later:
> I've investigated this issue, and it is currently harmless.
> The casted pointer is only used to call a method requiring
> a JavaThread* pointer and the only usage of its argument it's
> a NULL check. Unfortunately, fixing this issue would require
> to modify the prototype of os::is_memory_serialize_page()
> and propagate the change across all platforms using it.
> It's a wider scope fix than JDK-8137035.

I was only expecting you to move (if the scoping allows it, else copy) 
the later:

if (t != NULL && t->is_Java_thread()) {

check. ;)


> I've added a comment the unsafe cast in os_windows.cpp file,
> highlighting the fact it was unsafe, and explaining why it
> is currently harmless.
>> 2523   if (t != NULL && t->is_Java_thread()) {
>> 2524     JavaThread* thread = (JavaThread*) t;
>> This bug seems significant:
>> 2566       if (thread->stack_guards_enabled()) {
>> 2567         if (_thread_in_Java) {
>> _thread_in_Java is an enum value not a variable so we will always
>> execute this block! This code should be testing the local in_java
>> variable.
> Good catch! Fixed.
> Updated webrev:
> http://cr.openjdk.java.net/~fparain/8137035/webrev.01/index.html
> Thank you,
> Fred
>> Your changes seem fine in themselves.
>> Thanks,
>> David
>>> Testing: JPRT (testset hotspot) and nsk.stress
>>> Thanks,
>>> Fred
>>> ---------
>>> All these crashes related to stack overflows on Windows have presumably
>>> the same causes:
>>>     - an undersized StackShadowPages parameter
>>>     - the behavior of guard pages on Windows
>>>     - a flaw in Yellow Pages management
>>> These three factors combined together can lead to sporadic crashes of
>>> the JVM when stack overflow conditions are encountered.
>>> All the crashes listed in this CR and in the related CR are almost
>>> impossible to reproduce, which indicates that the issue only shows up in
>>> some extreme or uncommon conditions. By design, the JVM crashes on stack
>>> overflow only if the Red Zone (the last one in the execution stack) is
>>> hit. Before the Red Zone, there's the Yellow Zone which is here to
>>> detect and handle stack overflows in a nicer way (throwing a
>>> StackOverflowError instead of crashing the process). If the Red zone is
>>> hit, it means that the Yellow Zone was
>>> disabled, and there's only two cases where the Yellow Zone is disabled:
>>>   1 - when a potential stack overflow is detected in Java code, in this
>>> case the Yellow Zone is disabled during the generation of the
>>> StackOverflowError and restored during the propagation of the
>>> StackOverflowError
>>>   2 - when a stack overflow occurs either in native code or in JVM code,
>>> because there's anything else the JVM can do.
>>> In several crashes, the call stack doesn't show any special recursive
>>> Java calls that could suggest the JVM is in case 1. But they show
>>> relatively complex code paths inside JVM code (de-optimization or
>>> class/symbol resolution), which suggests that case 2 occurred.
>>> The case of stack overflow in native code is straight forward: if the
>>> Yellow Zone is hit, it is disabled, but when a JavaThread returns from
>>> native code to Java code, the Yellow Zone is systematically re-enabled
>>> (this is part of the native call wrapper
>>> generated by the JVM).
>>> The case of stack overflow in JVM code is more problematic. The JVM
>>> tries to avoid the case of stack overflow in VM code with the Shadow
>>> Pages mechanism. Whenever a Java method is invoked, the JVM tries to
>>> ensure that there's enough free stack space to execute the Java method
>>> and *any call to the JVM code (or JDK native code) that could occur
>>> during the execution of this method*. This check is performed by banging
>>> (touching) n pages ahead on the execution stack, and n is set to
>>> StackShadowPages. If the Yellow Zone is hit during the stack banging, a
>>> StackOverflowError is thrown before the execution of the first bytecode
>>> of the Java method. But this mechanism assumes that StackShadowPages
>>> pages is big enough to cover *any call to the JVM*. If this assumption
>>> is wrong, so
>>> bad things happen.
>>> I ran experiments with tests for which stack overflow related crashes
>>> were reported. I ran them with a JVM where the StackShadowPages value
>>> was decreased by only 1 compared the usual default value. It was very
>>> easy to reproduce stack overflow crashes. By instrumenting the JVM, it
>>> appeared that some threads hit the Yellow Zone while having thread state
>>> _thread_in_vm. Which means that in many cases, the margin between the
>>> stack space provided by StackShadowPages and the real stack usage while
>>> executing VM code is less than one page. And because knowing the biggest
>>> stack requirement to execute any JVM code is an undecidable problem,
>>> there's a high probability that some paths require more stack space than
>>> StackShadowPages ensures. It is important to notice
>>> that Windows is the platform with the smallest default value for
>>> StackShadowPages.
>>> So, an undersized StackShadowPages could cause the Yellow Zone to be hit
>>> while executing JVM code. On Unices (Solaris, Linux, MacOSX), the
>>> sanction is immediate: a SIGSEGV signal is sent, but because there's no
>>> more free space on the execution stack, the signal handler cannot be
>>> executed and the JVM process is killed. It's a crash without hs_error
>>> file generation.
>>> On Windows, the story is different. Yellow Pages are marked with the
>>> "Guard" bit. When a page with a Guard bit set is touched, the current
>>> thread receives an exception, but before the exception handler is
>>> executed, the OS remove the Guard bit from the page, so the page that
>>> trigger the fault can be used to execute the signal handler. So on
>>> Windows, when the Yellow Zone is hit while executing JVM code, the JVM
>>> doesn't die like on Unices systems, but the signal handler is executed.
>>> The logic in the signal handler looks like this (simplified version):
>>>    if thread touches the yellow zone:
>>>       if thread_in_java:
>>>           disable yellow pages
>>>           jump to code throwing StackOverflowError
>>>           // note: yellow pages will be re-enabled
>>>           // while unwinding the stack
>>>       else:
>>>           // thread_in_vm or thread_in_native
>>>           disable yellow pages
>>>           resume execution
>>>    else:
>>>        // Fatal red zone violation.
>>>        disable red pages
>>>        generate VM crash
>>> So, the signal handler disable the protection of the Yellow Pages and
>>> resume JVM code execution.
>>> Eventually, the thread will return from the VM and will continue
>>> executing Java code.  But at this point, the yellow pages are still
>>> disabled and there's no systematic check to ensure that Yellow Pages are
>>> re-enabled when returning to Java. The only places where the JVM  checks
>>> if Yellow Pages need to be re-activated is when returning from native
>>> code or in the exception propagation code (but not all paths reactivate
>>> the Yellow Zone).
>>> Once the execution of Java code has resumed with the yellow zone
>>> disabled, the thread is not protected any more against stack overflows.
>>> The only remaining protection is the red zone, and if it is hit, the VM
>>> will generate a crash report and die. Note that having Yellow Zone
>>> de-activated makes the stack banging of StackShadowPages inefficient.
>>> Stack banging relies on the Yellow Pages to be activated, so touching
>>> them triggers a signal. If Yellow Pages are de-activated (unprotected)
>>> no signal is sent, unless the stack banging hits the Red Page, which
>>> triggers a VM crash with hs_error file generation.
>>> To summarize: an undersized StackShadowPages on Windows can lead to a
>>> JavaThread executing Java code with Yellow Pages disabled, which means
>>> without any stack overflow protection except the Red Zone which is the
>>> one triggering VM crashes with hs_error file generation.
>>> Note that the Yellow Pages can be "incidentally" re-activated by a call
>>> to native code  or by throwing an exception. Which could explain why
>>> stack overflow crashes are not so frequent, the time window during which
>>> Java code is executed without stack overflow protection might be small
>>> for some applications.
>>> Proposed fixes for this issue:
>>>   - increase StackShadowPages for the Windows platform
>>>   - add assertion is signal handler to detect thread hitting the Yellow
>>> Zone while executing JVM code (to detect undersized StackShadowPages
>>> during our testing)
>>>   - ensure Yellow Pages are activated when transitioning from
>>> _thread_in_vm to _thread_in_java

More information about the hotspot-runtime-dev mailing list