RFR: 8267073: Race between Card Redirtying and Freeing Collection Set regions results in missing remembered set entries with G1

Kim Barrett kbarrett at openjdk.java.net
Thu Jun 10 13:12:16 UTC 2021

On Thu, 10 Jun 2021 10:31:51 GMT, Kim Barrett <kbarrett at openjdk.org> wrote:

>> Hi all,
>>   can I have reviews for this change that fixes a race between card redirtying and collection set freeing?
>> So card redirtying tries to be clever about which cards to redirty by not keeping (redirtying) cards in regions that have been fully evacuated as they can't have any references. (We did collect them exactly in the case of evac failure of an old region X: if that region X fails evacuation it is kept and may still contain references to other old regions Y; if it is successfully evacuated, then obviously we keep these cards for nothing):
>>       bool RedirtyLoggedCardTableEntryClosure::will_become_free(HeapRegion* hr) const {
>>         // A region will be freed by during the FreeCollectionSet phase if the region is in the
>>         // collection set and has not had an evacuation failure.
>>         return _g1h->is_in_cset(hr) && !hr->evacuation_failed();
>>       }
>> At the same time, since [JDK-8214237](https://bugs.openjdk.java.net/browse/JDK-8214237), G1 frees the collection set which modifies both the is-in-cset field in the attribute table in `is_in_cset()` and the region's `_evacuation_failed` field which yields to returning the wrong value and so dropping cards from the dirty card queue (and in extension from the remembered sets).
>> The suggested fix consists of two parts:
>>   1) do not have the free collection set phase change the is-in-cset field in the attribute table
>>   2) do not have the free collection set phase change the `_evacuation_failed` field in the `HeapRegion`
>> The first item is fairly trivial, there is actually no need to clear that field at that point. Almost right after this phase the entire attribute table is reset again (in `start_new_collection_set()`).
>> The second is a bit trickier, as if we did not clear then, we would need to do it at a later point. So while this could be done, adding the clearing somewhere in the `pre_evacuate_collection_set_phase()`, the easier (and imo better solution) is to move that `_evacuation_failed` field for regions that is only necessary during young gc, out of `HeapRegion` and collecting it in a side table.
>> There are two reasons for this: first I want to in the future split out all transient, young-gc related data and code from `G1CollectedHeap` into a separate class like `G1FullCollector`, and we need that information available in such a side table for future improved pinned region support [JDK-8254167](https://bugs.openjdk.java.net/browse/JDK-8254167).
>> After this change a test where I could reproduce this failure fairly consistently with `-XX:+VerifyAfterGC` (rate ~1/100), did not fail in thousands of iterations. 
>> Testing: tier1-5
>> Thanks,
>>   Thomas
> src/hotspot/share/gc/g1/g1CollectedHeap.hpp line 871:
>> 869:   volatile uint _num_regions_failed_evacuation;
>> 870:   // Records for every region on the heap whether evacuation failed for it.
>> 871:   volatile bool* _regions_failed_evacuation;
> Consider using BitMap and par_xxx operations instead of an array of bool and explicit atomic operations.

I would also be fine with the change as is, if that gets this fix for a fairly noisy bug pushed sooner.  So also consider a followup RFE to switch to using BitMap.


PR: https://git.openjdk.java.net/jdk/pull/4429

More information about the hotspot-gc-dev mailing list