RFR: 8267073: Race between Card Redirtying and Freeing Collection Set regions results in missing remembered set entries with G1

Thomas Schatzl tschatzl at openjdk.java.net
Thu Jun 10 06:59:25 UTC 2021

On Wed, 9 Jun 2021 11:16:13 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

> Hi all,
>   can I have reviews for this change that fixes a race between card redirtying and collection set freeing?
> So card redirtying tries to be clever about which cards to redirty by not keeping (redirtying) cards in regions that have been fully evacuated as they can't have any references. (We did collect them exactly in the case of evac failure of an old region X: if that region X fails evacuation it is kept and may still contain references to other old regions Y; if it is successfully evacuated, then obviously we keep these cards for nothing):
>       bool RedirtyLoggedCardTableEntryClosure::will_become_free(HeapRegion* hr) const {
>         // A region will be freed by during the FreeCollectionSet phase if the region is in the
>         // collection set and has not had an evacuation failure.
>         return _g1h->is_in_cset(hr) && !hr->evacuation_failed();
>       }
> At the same time, since [JDK-8214237](https://bugs.openjdk.java.net/browse/JDK-8214237), G1 frees the collection set which modifies both the is-in-cset field in the attribute table in `is_in_cset()` and the region's `_evacuation_failed` field which yields to returning the wrong value and so dropping cards from the dirty card queue (and in extension from the remembered sets).
> The suggested fix consists of two parts:
>   1) do not have the free collection set phase change the is-in-cset field in the attribute table
>   2) do not have the free collection set phase change the `_evacuation_failed` field in the `HeapRegion`
> The first item is fairly trivial, there is actually no need to clear that field at that point. Almost right after this phase the entire attribute table is reset again (in `start_new_collection_set()`).
> The second is a bit trickier, as if we did not clear then, we would need to do it at a later point. So while this could be done, adding the clearing somewhere in the `pre_evacuate_collection_set_phase()`, the easier (and imo better solution) is to move that `_evacuation_failed` field for regions that is only necessary during young gc, out of `HeapRegion` and collecting it in a side table.
> There are two reasons for this: first I want to in the future split out all transient, young-gc related data and code from `G1CollectedHeap` into a separate class like `G1FullCollector`, and we need that information available in such a side table for future improved pinned region support [JDK-8254167](https://bugs.openjdk.java.net/browse/JDK-8254167).
> After this change a test where I could reproduce this failure fairly consistently with `-XX:+VerifyAfterGC` (rate ~1/100), did not fail in thousands of iterations. 
> Testing: tier1-5
> Thanks,
>   Thomas

There are a few concerns in addition to those that you mentioned that does not make it that proposal clear cut:
  * The difference between the attribute table and this status is the lifetime: we need to attribute table always as we update it as we allocate more regions at the moment - this is something that may be reconsidered. Especially if we want to improve region allocation during mutator time (i.e. making it lock-free or something) where it might be more convenient to not try to make this update atomic too, and make the attribute table only an per-collection data structure.
  * evacuation failure is at this time (not considering region pinning yet) is something exceptional, and may warrant extra side data structures for cachability. In most cases we do not need it anyway.
  * we need to atomically update this flag (compared to the other stuff in the attribute table), and there is no infrastructure for that.
  * the attribute table is supposed to be read-only during actual gc to avoid reloads during gc (supposedly) for performance. For the same reason the `_humongous_reclaim_candidate` state is also separate. We may certainly want to combine this one with the evacuation failure table. There is a few things in `G1CollectedHeap` I would like to move to a `G1YoungGC(Context)` similar to the full gc. That will hopefully make reasoning about the code easier if less is crammed into `G1CollectedHeap`.
  * for JDK-8254167 we need another side table actually collecting the region ids that are affected (we have this idiom elsewhere too, so it might be useful to actually factor out an appropriate data structure). This might be another candidate to join with above mentioned two tables.
  * I would like to at least attempt to keep space in the attribute table free for some kind of remembered set identifier :), for fast access (as I hope there will not be an 1:1 mapping soon'ish anymore) but that is more an idea floating at the back of my mind and may not be necessary.


Tier1-5 testing completed with no particular issues so I opened it as ready for review.


PR: https://git.openjdk.java.net/jdk/pull/4429

More information about the hotspot-gc-dev mailing list