Removing G1 Reference Post Write Barrier StoreLoad Barrier
jon.masamitsu at oracle.com
Mon Dec 22 17:09:02 UTC 2014
One concern regarding the use of asymmetric Dekker synchronization
(ADS) is how well this technique scales to 1000's of threads. Do you
implementation where you can measure the scalability?
On 12/19/2014 10:06 AM, Erik Österlund wrote:
> Hi all,
> The StoreLoad barrier in G1’s reference post write barrier is stinging in my eyes so I sat down today and tried to figure out how to remove it and think I found a solution. Thought I’d share my thoughts here and see if I have understood the problem or if I’m missing something.
> The wanted invariant is that the post-barrier happens, well yeah after the reference write (duhh… deep insight I know). However, even TSO architectures allow the use of store buffers and hence enforce the use of painful StoreLoad fences.
> So abstracting away some unimportant things in the barrier (like SATB, young gen check, cross region check etc), the wanted control flow for the write barrier is:
> T1: write reference
> T1: load card, check if dirty, if so skip ahead
> So my understanding of the problem is that due to aforementioned reasons, these two operations are allowed to re-order due to the reference write being stuck for a while in the store buffer. The current solution is to emit an expensive StoreLoad fence before the loading the card to check if it is dirty.
> Without that fence and other precautions, this possible ordering could happen:
> __ card is visibly dirty __
> Step 1. T1 (mutator): load card -> is dirty, skip rest of barrier
> Step 2. T2 (conc refinement): clean and process card with the old reference globally visible at this point
> Step 3. T1 (mutator): write ref (lazy serialization due to store buffer) <— reference write is forgotten
> As we can see, in step 3 the mutator reference write is unfortunately forgotten due to the reference write staying in the store buffer until after the card has been processed by the concurrent refinement thread.
> The good news is that this race can be solved with asymmetric dekker synchronization, which comes in different flavours. The idea is to allow the reference write and card load to be reordered, as long as they both happen before references are processed by the concurrent refinement thread. We achieve this by moving the synchronization overheads to the concurrent refinement thread instead of the mutator post write barrier.
> To do this we introduce a new operation, here called rendezvous, which will make sure that remote CPU store buffers have been drained. Given the existence of this operation, card refinement could be rewritten into this form:
> 1. *card = clean
> 2. storeload <— guarantee local store buffer is drained so mutators see the card is clean
> 3. rendezvous <— guarantee remote store buffer is drained so their reference writes are visible, even in case of the unfortunate reordering
> 4. process card
> Or in an amortised form:
> 1. for all cards in buffer, clean
> 2. storeload
> 3. rendezvous
> 4. for all cards in buffer, process card
> The rendezvous will prevent the unfortunate race in the previous example by making sure both step 1 and 3 happen before step 2 in total order.
> Now of course, this brings us to the question of how to implement the rendezvous operation. Windows has a neat system call, FlushProcessWriteBuffers, which will do exactly what we need using IPI. AFAIK it’s also required of calls to mprotect to flush the store buffers of remote CPUs in a similar fashion which could be used in UNIX systems. Or one could simply wait long enough for the ~32ish store buffer entries have for sure been serialized. Or put one thread on each CPU (using affinity) and make a handshake with them using a condition variable - the context switch should already serialize the store buffer.
> If I have not misunderstood the problem, it would be nice to get rid of the evil StoreLoad fence, and I could volunteer to provide a fix, but I would need some help to do performance regression and testing. The idea relies on moving the synchronization costs to the concurrent refinement threads and amortise those synchronization costs away by batching card refinement.
More information about the hotspot-gc-dev