RFR: JDK-8220671: Initialization race for non-JavaThread PtrQueues
kim.barrett at oracle.com
Wed Mar 20 21:41:30 UTC 2019
> On Mar 20, 2019, at 4:06 PM, Roman Kennke <rkennke at redhat.com> wrote:
>> I have a different approach here:
>> If you agree, reassign 8221102 to me and I'll finish this up.
> This seems a workable solution too.
> I am not a fan on mechanisms that depend on stuff not overflowing, though. So, at the very least, we should protect against that.
It’s not just overflow / wraparound of the counter that is needed to get into trouble. It also
requires behavior that seems unlikely, e.g. iterate over superset of threads, then iterate *many*
times over subset, then iterate over superset again.
But as I said, it’s easy to deal with, and the code is there to do so.
> When I see multithreaded iterators with claiming mechanisms like that (we have a few of such all over Hotspot, e.g. CLD claiming works similar), it makes me cringe a little. It means that all worker threads are really looking at all items, battling over claim tokens. It would be so much easier and more efficient to just have the array of items, and update the iteration pointer using atomic-add. Right? In case of linked list, it would have to be updated using CAS. I guess with list sizes of few dozen to few 1000 (all threads) it doesn't matter much though.
But then one gets into array management, which is it’s own hassle. And sequence management
may become more expensive (e.g. remove some random entry from the array). I ended up going
down that route for OopStorage (for example) because otherwise the scaling of parallel iteration
was poor, but it’s not without its own downsides.
> Also, another simple way to fix it is to always scan all threads. That would at least be consistent with what other iterators do (e.g. Threads::threads_do() ). I don't think that performance is a very big issue there.
That was one of my earlier suggestions. I think expanding the size of the claim counter is better though.
>> Note that I haven't tried this against your crashing test yet. I'm
>> just getting set up to try to reproduce that failure locally. Based
>> on your reports, it sounds like there might be yet another issue
> Yeah, something's still up, and this makes this particular change hard to verify ;-)
Well, I’ve reproduced the failure. And it failed 2 for 2 (with the baddaddrtest patch applied). And things
were looking good with this change for a bit; it didn’t fail until the 8th iteration (around the time I was
reading the above paragraph from your email!), and then hasn’t passed several attempts since.
I think my patch to expand the claim counter is a good solution to 8221102 though, so unless you object
I’m going to claim that bug and send out that patch.
Still trying to debug the crash though. I haven’t been delving into the string dedup code much before, and
Shenandoah’s looks pretty different from G1’s. I doubt the race we’ve been discussing for 8220671 is the
source of the problem though. I will continue investigating.
More information about the hotspot-gc-dev