RFR: JDK-8220671: Initialization race for non-JavaThread PtrQueues
kim.barrett at oracle.com
Sat Mar 16 20:08:25 UTC 2019
> On Mar 15, 2019, at 9:39 AM, Roman Kennke <rkennke at redhat.com> wrote:
> In Shenandoah testing we discovered an initialization race: A non-Java
> GC thread (we have observed it on the StringDedupThread) may be
> initialized concurrently while Java and GC are already up an running,
> but not (yet) participate in safepointing.
> BS::on_thread_attach() usually does propagate global GC state to
> thread-local GC state, in this case the SATB active flag.
> When doing this concurrently, while not participating in safepointing,
> this may propagate the wrong state, and subsequently lead to heap
> corruption (e.g. because we missed some SATB updates).
> This is related to JDK-8219613 because before that change,
> non-Java-threads would simply use a shared SATB queue instead.
Drat! I was even looking for that kind of race, but missed it.
> The bug appeared in Shenandoah testing, but I don't see why it wouldn't
> affect G1 too. It's probably not run with aggressive enough tests to
> make it happen. (We run Shenandoah in aggressive mode, which starts
> continuous GCing right at the start.)
Agree that this affects G1 too.
> The problem seems specific to the StringDedupThread (for now), and so is
> the solution: in the StringDedupThread's pre_run() and post_run(), take
> the STSJoiner. This ensures that the thread doesn't accidentally crosses
> safepoints while initializing or exiting, and thus loosing SATB updates.
> I tried a couple of other approaches like:
> But we also need to protect the addition/removal of the thread to the
> global NonJavaThread list.
> Testing: Running the offending test (TestStringDedupStress.java) 20x in
> a row. It used to fail ~1 of 5 runs before. Now it all passes. Also,
> hotspot_gc_shenandoah passes. tier1 is fine too. Will push it through
> jdk-submit next.
> Can I please get reviews?
I think the proposed fix for _just_ StringDedupThread is insufficient.
I think there are other threads that could have similar races between
on_thread_attach and being added to the NJT thread list. And it
doesn't matter whether those threads touch oops or not. Even if they
don't, the state verification done when the SATB transition is made
can fail. (See SATBMarkQueueSet::verify_active_states.)
It might be sufficient to add a STS joiner around the on_thread_attach
/ add_to_the_list sequence in NJT::pre_run, with the joiner only
active if invoked when not at a safepoint. That way we can still
create threads during a safepoint that are needed in that safepoint
(e.g. lazy allocation of worker threads). There are some possibly
complicated cases to think about though, or maybe disallow somehow.
More information about the hotspot-gc-dev