RFR(M): 8244383: jhsdb/HeapDumpTestWithActiveProcess.java fails with "AssertionFailure: illegal bci"

Daniel D. Daugherty daniel.daugherty at oracle.com
Wed Jun 24 02:07:03 UTC 2020

Just one more comment on this part:

> >    L220: System.out.println("CurrentFrameGuess: choosing interpreter 
> frame: sp = " +
> >    L221:                                  spFound + ", fpFound = " + 
> fp + ", pcFound = " + pc);
> >        This debug output doesn't make sense to me:
> >
> >            "sp = " label and 'spFound' value
> >            "fpFound = " label and 'fp' value
> >            "pcFound = " label and 'pc' value
>         but I may not have enough context...
> From the point of view of the person reading the output, they want to 
> know the values for sp, fp, and pc. But within the code these values 
> are stored in the "found" variables. 

In that case, the code is wrong for the 'fp' and 'pc' outputs
since you changed the labels and not the variables.


On 6/23/20 7:28 PM, Chris Plummer wrote:
> On 6/23/20 2:44 PM, Daniel D. Daugherty wrote:
>> On 6/18/20 8:54 PM, Chris Plummer wrote:
>>> [I've added runtime-dev to this SA review since understanding 
>>> interpreter invokes (code generated by 
>>> TemplateInterpreterGenerator::generate_normal_entry()) and stack 
>>> walking is probably more important than understanding SA.]
>>> Hello,
>>> Please help review the following:
>>> https://bugs.openjdk.java.net/browse/JDK-8244383
>>> http://cr.openjdk.java.net/~cjplummer/8244383/webrev.00/index.html
> Thanks for helping!
>> Sorry for the delay in reviewing this one. I've come back to it a couple
>> of times because code like this is very hard to review.
>> General comment:
>>     This fix reminds of the crazy things that AsyncGetCallTrace has to
>>     do in order to gather call trace data. I'm guessing that SA is
>>     attaching to the VM in an asynchronous manner and that's why it
>>     can observe things like partially constructed frames. If that's a
>>     correct guess, then how is SA stopping/suspending the threads?
>>     I'm just curious here.
> On linux SA uses ptrace. I'm not familiar with the details of how it 
> works. I'm not sure where ptrace allows suspends to happen, but 
> certainly it has no knowledge of JVM safepoints or other 
> synchronization that the JVM does. So from the JVM and SA point of 
> view the suspend can happen at any arbitrary JVM instruction.
> From what I can gather, PTRACE_ATTACH suspends the entire process, so 
> that means all threads are suspended once you attach. However, 
> PTRACE_GETREGS can be called on individual threads (LWPs), but I don't 
> see any indication in the SA code that you need to attach to each LWP 
> first.
>>     Or this might be a case where SA is examining a core file in
>>     which case the various threads stacks are not necessarily at
>>     good/safepoint-safe pause points.
> For this bug and test it's a live process, but I think the bug being 
> addressed here can happen just as well with a core file. Unfortunately 
> we have very little core file testing support. I'm actually in the 
> middle of addressing that right now.
>> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/runtime/x86/X86Frame.java 
>>     No comments.
>> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/bsd/amd64/BsdAMD64CFrame.java 
>>     No comments.
>> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/runtime/amd64/AMD64CurrentFrameGuess.java 
>>     L104:     // two locations, then we canot determine the frame.
>>         typo: s/canot/cannot/
> ok
>>     L127:     // it's validity will help us determine the state of 
>> the new frame push.
>>         typo: s/it's/its/
> ok
>>     L148:         System.out.println("CurrentFrameGuess: frame pushed 
>> but not initaliazed.");
>>         typo: s/initaliazed/initialized/
> ok
>>     L220:               System.out.println("CurrentFrameGuess: 
>> choosing interpreter frame: sp = " +
>>     L221:                                  spFound + ", fpFound = " + 
>> fp + ", pcFound = " + pc);
>>         This debug output doesn't make sense to me:
>>             "sp = " label and 'spFound' value
>>             "fpFound = " label and 'fp' value
>>             "pcFound = " label and 'pc' value
>>         but I may not have enough context...
> From the point of view of the person reading the output, they want to 
> know the values for sp, fp, and pc. But within the code these values 
> are stored in the "found" variables.
>> With code like this, it's really hard to figure out if you've covered
>> all the cases unless you've been in the observer seat yourself and
>> even then your test runs may not hit all the possible cases. All you
>> can really do is start with a set of adaptive changes, run with those
>> for a while and tweak them as you gather more observations.
> Yes, and I know there is still a very tiny gap or two in coverage that 
> are maybe one or two instructions long, but they aren't worth dealing 
> with. This bug was already very rare, and with the fixes I've done I 
> don't see any issues now. SA is a debugger, so perfection in this 
> regard is not expected.
>> Chris, nice job with this bit of insanity!
> Thanks! I mostly stuck with this one to help with my SA expertise. 
> Otherwise it wouldn't have been worth the time.
> Chris
>> Thumbs up!
>> Dan
>>> The crux of the bug is when doing stack walking the topmost frame is 
>>> in an inconsistent state because we are in the middle of pushing a 
>>> new interpreter frame. Basically we are executing code generated by 
>>> TemplateInterpreterGenerator::generate_normal_entry(). Since the PC 
>>> register is in this code, SA assumes the topmost frame is an 
>>> interpreter frame.
>>> The first issue with this interpreter frame assumption is if we 
>>> haven't actually pushed the frame yet, then the current frame is the 
>>> caller's frame, and could be compiled. But since SA thinks it's 
>>> interpreted, later on it tries to convert the frame->bcp to a BCI, 
>>> but frame->bcp is only valid for interpreter frames. Thus the 
>>> "illegal BCI" failures. If the previous frame happened to be 
>>> interpreted, then the existing SA code works fine.
>>> The other state of frame pushing that was problematic was when the 
>>> new frame had been pushed, but frame->method and frame->bcp were not 
>>> setup yet. This also would lead to "illegal BCI" later on because 
>>> garbage would be stored in these locations.
>>> Fixing the above problems requires trying to determine the state of 
>>> the frame push through a series of checks, and then adapting what is 
>>> considered to be the current frame based on the outcome of the 
>>> checks. The first things checked is that frame->method is valid (we 
>>> can successfully instantiate a wrapper for the Method* without 
>>> failure) and that frame->bcp is within the method. If both these 
>>> pass then we can use the frame as-is.
>>> If the above checks fail, then we try to determine whether the issue 
>>> is that the frame is not yet pushed and the current frame is 
>>> actually compiled, or the frame has been pushed but not yet 
>>> initialized. This is done by first getting the return address from 
>>> the stack or RAX (it's location depends on how far along we are in 
>>> the entry code) and comparing this to what is stored in 
>>> frame->return_addr. If they are the same, then we have pushed the 
>>> frame but not yet initialized it. In this case we use the previous 
>>> frame (senderSP() and senderFP()) as the current frame since the 
>>> current frame is not yet initialized. If the return address check 
>>> fails, then we assume the new frame is not yet pushed, and and treat 
>>> the current frame as compiled, even though PC points into the 
>>> interpreter (we replace PC with RAX in this case).
>>> Comments in the code pretty well explain all the above, so it is 
>>> probably easier to follow the logic in the code along with the 
>>> comments rather than apply my above description to the code.
>>> I should add that it's very rare that we ever get into this special 
>>> error handling code. This bug was very hard to reproduce initially. 
>>> I was only able to make progress with reproducing and debugging by 
>>> inserting delay loops in various spots in the code generated by 
>>> TemplateInterpreterGenerator::generate_normal_entry(). By doing this 
>>> I was able to reproduce the issue quite easily and hit all the logic 
>>> in the new code I've added.
>>> The fix is basically entirely contained within 
>>> AMD64CurrentFrameGuess.java. The rest of the changes are minor:
>>> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/runtime/amd64/AMD64CurrentFrameGuess.java 
>>> -Main fix for CR
>>> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/runtime/x86/X86Frame.java 
>>> -Added getInterpreterFrameBCP(), which is now needed by 
>>> AMD64CurrentFrameGuess.java
>>> -I also simplified some code by using the existing 
>>> getInterpreterFrameMethod()
>>>  rather than replicating inline what it does.
>>> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/bsd/amd64/BsdAMD64CFrame.java 
>>> -I noticed the windows version of this code had some extra checks 
>>> that were missing
>>>  from the bsd version. I then looked at the linux version, but it 
>>> had been heavily modified
>>>  a short while back to leverage DWARF info to determine frames. So I 
>>> looked at the previous
>>>  rev and it too had these extra checks. I decided to add them to the 
>>> BSD port. I'm not sure
>>>  if it helps at all, but it certainly doesn't seem to do any harm.
>>> thanks,
>>> Chris

More information about the hotspot-runtime-dev mailing list