RFR (XS): 8164948: Initializing stores of HeapRegions are not ordered with regards to its use in G1ConcurrentMark
thomas.schatzl at oracle.com
Fri Sep 9 10:44:41 UTC 2016
can I have reviews for this change that adds required barriers before
publishing HeapRegion instances to the rest of the VM to avoid random
crashes seen on ARM* machines with G1?
The problem (in this instance - it might have caused other issues in
other code before) is that in G1ConcurrentMark when moving the finger
before marking we may load HeapRegion instances containing invalid
data. This is because before publishing the HeapRegion* instance to the
global HeapRegionManager::_regions map, we do not make sure that the
initializating stores retire. So the subsequent loads of values from
that data structure may not have been visible for the other threads
(the marking thread in particular).
About the patch: I think the loadload in G1ConcurrentMark is not
strictly necessary, as the following loads are always dependent on the
HeapRegion* just loaded. Even ARM (but not supported Alpha) processors
have an implicit loadload barrier here.
(I could remove it if not wanted).
Actually a more comprehensive change would remove the need to actually
have these loads from the HeapRegion* like envisioned in the comments
to this bug (somewhat like JDK-7127707). That has been deemed to be too
complex at this time, and may in fact open up more of these issues
without very good thought.
There are lots of other places that might or might not be susceptible
to this problems, alone G1CollectedHeap::heap_region_containing() is
used in 61 places, and in very frequently used ones (e.g.
G1CollectedHeap::is_in()). I did not conduct a full-scale review of all
these places here, just briefly looked at the locations. They all seem
to be covered by the implicit loadload barrier as far as I can tell
quickly. That would realistically not complete soon. I will file a CR
for that though.
jprt, running crashing test >3000 times without issue (crash frequency
typically 50-1000 iterations, but will keep on running it over the
weekend) on arm64 hardware - the symptoms shown in the CR very much
point to this exact problem.
More information about the hotspot-gc-dev