Need help to fix a potential G1 crash in jdk11

Erik Österlund erik.osterlund at oracle.com
Fri Jun 5 08:55:14 UTC 2020


Hi Volker,

On 2020-06-03 20:18, Volker Simonis wrote:
> Unfortunately, "-XX:-ClassUnloading" doesn't help :(

I am actually happy that did not help. I suspect a bug in that code 
would be harder to track down; it is rather complicated.

> I already saw two new crashes. The first one has 6 distinct Root 
> locations pointing to one dead object:
>
> [863.222s][info ][gc,verify,start   ] Verifying During GC (Remark after)
> [863.222s][debug][gc,verify         ] Threads
> [863.224s][debug][gc,verify         ] Heap
> [863.224s][debug][gc,verify         ] Roots
> [863.229s][error][gc,verify         ] Root location 0x00007f11719174e7 
> points to dead obj 0x00000000f956dbd8
> [863.229s][error][gc,verify         ] 
> org.antlr.v4.runtime.atn.PredictionContextCache
> [863.229s][error][gc,verify         ] {0x00000000f956dbd8} - klass: 
> 'org/antlr/v4/runtime/atn/PredictionContextCache'
> ...
> [863.229s][error][gc,verify         ] Root location 0x00007f1171921978 
> points to dead obj 0x00000000f956dbd8
> [863.229s][error][gc,verify         ] 
> org.antlr.v4.runtime.atn.PredictionContextCache
> [863.229s][error][gc,verify         ] {0x00000000f956dbd8} - klass: 
> 'org/antlr/v4/runtime/atn/PredictionContextCache'
> [863.231s][debug][gc,verify         ] HeapRegionSets
> [863.231s][debug][gc,verify         ] HeapRegions
> [863.349s][error][gc,verify         ] Heap after failed verification 
> (kind 0):
>
> The second crash has only two Root locations pointing to the same dead 
> object but more than 40_000 fields in distinct objects pointing to 
> more than 3_500 dead objects:
>
> [854.473s][info ][gc,verify,start   ] Verifying During GC (Remark after)
> [854.473s][debug][gc,verify         ] Threads
> [854.475s][debug][gc,verify         ] Heap
> [854.475s][debug][gc,verify         ] Roots
> [854.479s][error][gc,verify         ] Root location 0x00007f6e60461d5f 
> points to dead obj 0x00000000fa874528
> [854.479s][error][gc,verify         ] 
> org.antlr.v4.runtime.atn.PredictionContextCache
> [854.479s][error][gc,verify         ] {0x00000000fa874528} - klass: 
> 'org/antlr/v4/runtime/atn/PredictionContextCache'
> [854.479s][error][gc,verify         ] Root location 0x00007f6e60461d6d 
> points to dead obj 0x00000000fa874528
> [854.479s][error][gc,verify         ] 
> org.antlr.v4.runtime.atn.PredictionContextCache
> [854.479s][error][gc,verify         ] {0x00000000fa874528} - klass: 
> 'org/antlr/v4/runtime/atn/PredictionContextCache'
> [854.479s][error][gc,verify         ] Root location 0x00007f6e60462138 
> points to dead obj 0x00000000fa874528
> [854.479s][error][gc,verify         ] 
> org.antlr.v4.runtime.atn.PredictionContextCache
> [854.479s][error][gc,verify         ] {0x00000000fa874528} - klass: 
> 'org/antlr/v4/runtime/atn/PredictionContextCache'
> [854.482s][debug][gc,verify         ] HeapRegionSets
> [854.482s][debug][gc,verify         ] HeapRegions
> [854.484s][error][gc,verify         ] ----------
> [854.484s][error][gc,verify         ] Field 0x00000000fd363c70 of live 
> obj 0x00000000fd363c58 in region [0x00000000fd300000, 0x00000000fd400000)
> [854.484s][error][gc,verify         ] class name 
> org.antlr.v4.runtime.atn.ATNConfig
> [854.484s][error][gc,verify         ] points to dead obj 
> 0x00000000fa88a540 in region [0x00000000fa800000, 0x00000000fa900000)
> [854.484s][error][gc,verify         ] class name 
> org.antlr.v4.runtime.atn.ArrayPredictionContext
> [854.484s][error][gc,verify         ] ----------
> ...
> more than 40_000 fields in distinct objects pointing to more than 
> 3_500 dead objects.
>
> So how can this happen. Is "-XX:+VerifyAfterGC" really reliable here?

Naturally, it's hard to tell for definite what the issue is with only 
these printouts.
However, we can make some observations from the printouts:

Based on the address values of the "Root location" of the printouts, 
each dead object
reported is pointed at from at least one misaligned oop. The only 
misaligned oops in
HotSpot are nmethod oops embedded into the instruction stream as immediates.
So this smells like some kind of nmethod oop processing bug in G1 to me.

The Abortable Mixed GCs (https://openjdk.java.net/jeps/344) that went 
into 12 changed
quite a bit of the nmethod oop scanning code. Perhaps the reason why 
this stopped
reproducing in 12 is related to that. The nmethod oop processing code 
introduced with
AMGC actually had a word tearing problem for nmethod oops, which was 
fixed later with
https://bugs.openjdk.java.net/browse/JDK-8235305

Hope these pointers help.

/Erik

> Thank you and best regards,
> Volker
>
>
> On Wed, Jun 3, 2020 at 7:14 PM Volker Simonis 
> <volker.simonis at gmail.com <mailto:volker.simonis at gmail.com>> wrote:
>
>     Hi Erik,
>
>     thanks a lot for the quick response and the hint with
>     ClassUnloading. I've just started several runs of the test program
>     with "-XX:-ClassUnloading". I'll report back instantly once I have
>     some results.
>
>     Best regards,
>     Volker
>
>     On Wed, Jun 3, 2020 at 5:26 PM Erik Österlund
>     <erik.osterlund at oracle.com <mailto:erik.osterlund at oracle.com>> wrote:
>
>         Hi Volker,
>
>         In JDK 12, I changed quite a bit how G1 performs class
>         unloading, to a
>         new model.
>         Since the verification runs just after class unloading, I
>         guess it could
>         be interesting
>         to check if the error happens with -XX:-ClassUnloading as
>         well. If not,
>         then perhaps
>         some of my class unloading changes for G1 in JDK 12 fixed the
>         problem.
>
>         Just a gut feeling...
>
>         Thanks,
>         /Erik
>
>         On 2020-06-03 17:02, Volker Simonis wrote:
>         > Hi,
>         >
>         > I would appreciate some help/advice for debugging a
>         potential G1 crash in
>         > jdk 11. The crash usually occurs when running a proprietary
>         jar file for
>         > about 20-30 minutes and it happens in various parts of the
>         VM (C1- or
>         > C2-compiled code, interpreter, GC). Because the crash
>         locations are so
>         > different and because the customer which reported the issue
>         claimed that it
>         > doesn't happen with Parallel GC, I thought it might be a G1
>         issue. I
>         > couldn't reproduce the crash with jdk 12 and 14 (but with
>         jdk 11 and
>         > 11.0.7, OpenJDK and Oracle JDK). When looking at the G1
>         changes in jdk 12 I
>         > couldn't find any apparent bug fix which potentially solves
>         this problem
>         > but it may have been solved by one of the many G1 changes
>         which happened in
>         > jdk 12.
>         >
>         > I did run the reproducer with "-XX:+UnlockDiagnosticVMOptions
>         > -XX:+VerifyBeforeGC -XX:+VerifyAfterGC -XX:+VerifyDuringGC
>         > -XX:+CheckJNICalls -XX:+G1VerifyRSetsDuringFullGC
>         > -XX:+G1VerifyHeapRegionCodeRoots" and I indeed got
>         verification errors (see
>         > [1] for a complete hs_err file). Sometimes it's just a few
>         fields pointing
>         > to dead objects:
>         >
>         > [1035.782s][error][gc,verify         ] ----------
>         > [1035.782s][error][gc,verify         ] Field
>         0x00000000fb509148 of live obj
>         > 0x00000000fb509130 in region [0x00000000fb500000,
>         0x00000000fb600000)
>         > [1035.782s][error][gc,verify         ] class name
>         > org.antlr.v4.runtime.atn.ATNConfig
>         > [1035.782s][error][gc,verify         ] points to dead obj
>         > 0x00000000f9ba39b0 in region [0x00000000f9b00000,
>         0x00000000f9c00000)
>         > [1035.782s][error][gc,verify         ] class name
>         > org.antlr.v4.runtime.atn.SingletonPredictionContext
>         > [1035.782s][error][gc,verify         ] ----------
>         > [1035.783s][error][gc,verify         ] Field
>         0x00000000fb509168 of live obj
>         > 0x00000000fb509150 in region [0x00000000fb500000,
>         0x00000000fb600000)
>         > [1035.783s][error][gc,verify         ] class name
>         > org.antlr.v4.runtime.atn.ATNConfig
>         > [1035.783s][error][gc,verify         ] points to dead obj
>         > 0x00000000f9ba39b0 in region [0x00000000f9b00000,
>         0x00000000f9c00000)
>         > [1035.783s][error][gc,verify         ] class name
>         > org.antlr.v4.runtime.atn.SingletonPredictionContext
>         > [1035.783s][error][gc,verify         ] ----------
>         > ...
>         > [1043.928s][error][gc,verify         ] Heap Regions:
>         E=young(eden),
>         > S=young(survivor), O=old, HS=humongous(starts),
>         HC=humongous(continues),
>         > CS=collection set, F=free, A=archive, TAMS=top-at-mark-start
>         (previous,
>         > next)
>         > ...
>         > [1043.929s][error][gc,verify         ] | 79|0x00000000f9b00000,
>         > 0x00000000f9bfffe8, 0x00000000f9c00000| 99%| O| |TAMS
>         0x00000000f9bfffe8,
>         > 0x00000000f9b00000| Updating
>         > ...
>         > [1043.971s][error][gc,verify         ] | 105|0x00000000fb500000,
>         > 0x00000000fb54fc08, 0x00000000fb600000| 31%| S|CS|TAMS
>         0x00000000fb500000,
>         > 0x00000000fb500000| Complete
>         >
>         > but I also got verification errors with more than 30000
>         fields of distinct
>         > objects pointing to more than 1000 dead objects. How can
>         that happen? Is
>         > the verification always accurate or can this also be a
>         problem with the
>         > verification itself and I'm hunting the wrong problem?
>         >
>         > Sometimes I also saw verification errors where fields point
>         to objects in
>         > regions with "Untracked remset":
>         >
>         > [673.762s][error][gc,verify] ----------
>         > [673.762s][error][gc,verify] Field 0x00000000fca49298 of
>         live obj
>         > 0x00000000fca49280 in region [0x00000000fca0000
>         > 0, 0x00000000fcb00000)
>         > [673.762s][error][gc,verify] class name
>         org.antlr.v4.runtime.atn.ATNConfig
>         > [673.762s][error][gc,verify] points to obj
>         0x00000000f9d5a9a0 in region
>         >
>         81:(F)[0x00000000f9d00000,0x00000000f9d00000,0x00000000f9e00000]
>         remset
>         > Untracked
>         > [673.762s][error][gc,verify] ----------
>         >
>         > But they are by far not that common like the pointers to
>         dead objects. Once
>         > I even saw a "Root location" pointing to a dead object:
>         >
>         > [369.808s][error][gc,verify] Root location
>         0x00007f35bb33f1f8 points to
>         > dead obj 0x00000000f87fa200
>         > [369.808s][error][gc,verify]
>         org.antlr.v4.runtime.atn.PredictionContextCache
>         > [369.808s][error][gc,verify] {0x00000000f87fa200} - klass:
>         > 'org/antlr/v4/runtime/atn/PredictionContextCache'
>         > [369.850s][error][gc,verify] ----------
>         > [369.850s][error][gc,verify] Field 0x00000000fbc60900 of
>         live obj
>         > 0x00000000fbc608f0 in region [0x00000000fbc00000,
>         0x00000000fbd00000)
>         > [369.850s][error][gc,verify] class name
>         > org.antlr.v4.runtime.atn.ParserATNSimulator
>         > [369.850s][error][gc,verify] points to dead obj
>         0x00000000f87fa200 in
>         > region [0x00000000f8700000, 0x00000000f8800000)
>         > [369.850s][error][gc,verify] class name
>         > org.antlr.v4.runtime.atn.PredictionContextCache
>         > [369.850s][error][gc,verify] ----------
>         >
>         > All these verification errors occur after the Remark phase in
>         > G1ConcurrentMark::remark() at:
>         >
>         > verify_during_pause(G1HeapVerifier::G1VerifyRemark,
>         > VerifyOption_G1UsePrevMarking, "Remark after");
>         >
>         > V  [libjvm.so+0x6ca186]  report_vm_error(char const*, int,
>         char const*,
>         > char const*, ...)+0x106
>         > V  [libjvm.so+0x7d4a99]
>         G1HeapVerifier::verify(VerifyOption)+0x399
>         > V  [libjvm.so+0xe128bb] Universe::verify(VerifyOption, char
>         const*)+0x16b
>         > V  [libjvm.so+0x7d44ee]
>         >  G1HeapVerifier::verify(G1HeapVerifier::G1VerifyType,
>         VerifyOption, char
>         > const*)+0x9e
>         > V  [libjvm.so+0x7addcf]
>         >
>          G1ConcurrentMark::verify_during_pause(G1HeapVerifier::G1VerifyType,
>         > VerifyOption, char const*)+0x9f
>         > V  [libjvm.so+0x7b172e] G1ConcurrentMark::remark()+0x3be
>         > V  [libjvm.so+0xe6a5e1] VM_CGC_Operation::doit()+0x211
>         > V  [libjvm.so+0xe69908] VM_Operation::evaluate()+0xd8
>         > V  [libjvm.so+0xe6713f]
>         VMThread::evaluate_operation(VM_Operation*) [clone
>         > .constprop.54]+0xff
>         > V  [libjvm.so+0xe6764e]  VMThread::loop()+0x3be
>         > V  [libjvm.so+0xe67a7b]  VMThread::run()+0x7b
>         >
>         > The GC log output looks as follows:
>         > ...
>         > [1035.775s][info ][gc,verify,start   ] Verifying During GC
>         (Remark after)
>         > [1035.775s][debug][gc,verify         ] Threads
>         > [1035.776s][debug][gc,verify         ] Heap
>         > [1035.776s][debug][gc,verify         ] Roots
>         > [1035.782s][debug][gc,verify         ] HeapRegionSets
>         > [1035.782s][debug][gc,verify         ] HeapRegions
>         > [1035.782s][error][gc,verify         ] ----------
>         > ...
>         > A more complete GC log can be found here [2].
>         >
>         > For the field 0x00000000fb509148 of live obj
>         0x00000000fb509130 which
>         > points to the dead object 0x00000000f9ba39b0 I get the following
>         > information if I inspect them with clhsdb:
>         >
>         > hsdb> inspect 0x00000000fb509130
>         > instance of Oop for org/antlr/v4/runtime/atn/ATNConfig @
>         0x00000000fb509130
>         > @ 0x00000000fb509130 (size = 32)
>         > _mark: 13
>         > _metadata._compressed_klass: InstanceKlass for
>         > org/antlr/v4/runtime/atn/ATNConfig
>         > state: Oop for org/antlr/v4/runtime/atn/BasicState @
>         0x00000000f83ecfa8 Oop
>         > for org/antlr/v4/runtime/atn/BasicState @ 0x00000000f83ecfa8
>         > alt: 1
>         > context: Oop for
>         org/antlr/v4/runtime/atn/SingletonPredictionContext @
>         > 0x00000000f9ba39b0 Oop for
>         > org/antlr/v4/runtime/atn/SingletonPredictionContext @
>         0x00000000f9ba39b0
>         > reachesIntoOuterContext: 8
>         > semanticContext: Oop for
>         org/antlr/v4/runtime/atn/SemanticContext$Predicate
>         > @ 0x00000000f83d57c0 Oop for
>         > org/antlr/v4/runtime/atn/SemanticContext$Predicate @
>         0x00000000f83d57c0
>         >
>         > hsdb> inspect 0x00000000f9ba39b0
>         > instance of Oop for
>         org/antlr/v4/runtime/atn/SingletonPredictionContext @
>         > 0x00000000f9ba39b0 @ 0x00000000f9ba39b0 (size = 32)
>         > _mark: 41551306041
>         > _metadata._compressed_klass: InstanceKlass for
>         > org/antlr/v4/runtime/atn/SingletonPredictionContext
>         > id: 100635259
>         > cachedHashCode: 2005943142
>         > parent: Oop for
>         org/antlr/v4/runtime/atn/SingletonPredictionContext @
>         > 0x00000000f9ba01b0 Oop for
>         > org/antlr/v4/runtime/atn/SingletonPredictionContext @
>         0x00000000f9ba01b0
>         > returnState: 18228
>         >
>         > I could also reproduce the verification errors with a fast
>         debug build of
>         > 11.0.7 which I did run with "-XX:+CheckCompressedOops
>         -XX:+VerifyOops
>         > -XX:+G1VerifyCTCleanup -XX:+G1VerifyBitmaps" in addition to
>         the options
>         > mentioned before, but unfortunaltey the run didn't trigger
>         neither an
>         > assertion nor a different verification error.
>         >
>         > So to summarize, my basic questions are:
>         >   - has somebody else encountered similar crashes?
>         >   - is someone aware of specific changes in jdk12 which
>         might solve this
>         > problem?
>         >   - are the verification errors I'm seeing accurate or is it
>         possible to get
>         > false positives when running with
>         -XX:Verify{Before,During,After}GC ?
>         >
>         > Thanks for your patience,
>         > Volker
>         >
>         > [1]
>         >
>         http://cr.openjdk.java.net/~simonis/webrevs/2020/jdk11-g1-crash/hs_err_pid28294.log
>         > [2]
>         >
>         http://cr.openjdk.java.net/~simonis/webrevs/2020/jdk11-g1-crash/verify-error.log
>



More information about the hotspot-gc-dev mailing list