RFR(M): 8210754: print_location is not reliable enough (printing register info)
martin.doerr at sap.com
Mon Sep 17 07:56:45 UTC 2018
the kind of errors which I'd like to address are basically during the process of dumping Java objects.
The current "is_oop" check assumes almost every Java heap address to be a valid oop. This check is not sufficient for dumping it.
Please note that the dumping code also accesses the metadata which may get modified concurrently. So my changes don't really make that worse.
I'm assuming that Java heap modifications happen far more often than modifications of the small piece of the metaspace which needs to get accessed in order to check a few Java objects.
A crash in the iteration due to concurrent structural modification sounds like a very unlikely situation in which we might still get "error occurred during error reporting". IMHO nothing which sounds worse than what we currently have.
From: David Holmes <david.holmes at oracle.com>
Sent: Montag, 17. September 2018 08:54
To: Doerr, Martin <martin.doerr at sap.com>; hotspot-runtime-dev at openjdk.java.net
Subject: Re: RFR(M): 8210754: print_location is not reliable enough (printing register info)
On 17/09/2018 3:41 PM, Doerr, Martin wrote:
> Hi David,
> thanks for looking at my proposal.
> I'm aware of that the new code accesses memory which may be mutated concurrently.
> But I'm convinced that this is far better than what we currently have. Analyzing the state of a crashed VM can never be 100% safe.
Can you summarise what the causes of the secondary errors were and how
this additional set of checks tries to deal with that please. This looks
like its trying to do more than just improve reliability - and some
parts seem potentially just as unreliable (not that it may not be useful
when it does work - though how could you tell if you walk a bad pointer
when examining the CLDGraph?).
> I could use try_lock to improve this situation. When I get the lock, fine.
> But what should we do when the lock is held by the code which has crashed?
> I think we shouldn't wait for any lock. It's better to risk errors due to concurrent mutation which seems to be not so likely.
Definitely do not want to take locks. :)
My continual concern with the ever expanding error reporting code is
that every change, whilst improving one scenario, potentially degrades
> Best regards,
> -----Original Message-----
> From: David Holmes <david.holmes at oracle.com>
> Sent: Montag, 17. September 2018 07:07
> To: Doerr, Martin <martin.doerr at sap.com>; hotspot-runtime-dev at openjdk.java.net
> Subject: Re: RFR(M): 8210754: print_location is not reliable enough (printing register info)
> Hi Martin,
> On 15/09/2018 12:03 AM, Doerr, Martin wrote:
>> I'd like to make os::print_location more reliable which is used in error reporting step "printing register info". Oops and Klasses should get inspected more carefully.
> But some of what you are doing is accessing shared state that could be
> mutated concurrently with the error reporting thread that is trying to
> read it e.g. walking the ClassLoaderDataGraph!
>> I have seen errors like "[error occurred during error reporting (printing register info), id 0xe0000000, Internal Error (/usr/work/d056149/openjdk/jdk/src/hotspot/share/oops/klass.inline.hpp:63)]" in many hs_err files.
>> Sometimes, I get such errors when using -XX:+CrashGCForDumpingJavaThread, sometimes when injecting crashing code into compiled methods which I did by the following code:
>> I can also contribute this if it's desired. Automatic tests would certainly be nice to have. Maybe I can find some time for that.
>> Please review.
>> Best regards,
More information about the hotspot-runtime-dev