Removing G1 Reference Post Write Barrier StoreLoad Barrier
erik.osterlund at lnu.se
Mon Dec 22 19:51:45 UTC 2014
So I hacked together an implementation using mprotect which seems to be the most scalable way of issuing a global store buffer serialization using IPI.
The local running time of the flush for the CPU running the fence is ~640 nano seconds, so a bit costly. However, the good news is that it should scale well with the number of threads; the IPI message to flush store buffers is sent to each CPU core the process has threads scheduled on, and it does not matter how many threads there are.
In order to deal with the local synchronization costs I made card processing buffered as I mentioned in the previous post, so it batches 128 cards and processes them, using only a single expensive global fence every 128 cards (amortized 5 nano seconds per card). Of course the size of this buffer can be tweaked to meet a satisfactory tradeoff between pause times/throughput.
Mutators and concurrent refinement threads use this batched card processing with asymmetric dekker synchronization. But card processing in safepoints reverts to the old variant instead. This is safe because once we enter a safepoint, we already know that pending stores from mutators are already serialized.
I have attached a patch file for the curious reader with the code if anyone wants to play around with performance and try it out. Ran a few DaCapo benchmarks and benchmarks like h2 that have quite a few inter regional pointers seems to benefits from it (~3600 msec vs ~3800 msec on average after 10 warm up iterations). But I bet you guys have your own favourite benchmarks that you want to run.
If you think the idea seems good, I could prepare a more tidied up change set.
> On 22 Dec 2014, at 17:09, Jon Masamitsu <jon.masamitsu at oracle.com> wrote:
> One concern regarding the use of asymmetric Dekker synchronization
> (ADS) is how well this technique scales to 1000's of threads. Do you have an
> implementation where you can measure the scalability?
> On 12/19/2014 10:06 AM, Erik Österlund wrote:
>> Hi all,
>> The StoreLoad barrier in G1’s reference post write barrier is stinging in my eyes so I sat down today and tried to figure out how to remove it and think I found a solution. Thought I’d share my thoughts here and see if I have understood the problem or if I’m missing something.
>> The wanted invariant is that the post-barrier happens, well yeah after the reference write (duhh… deep insight I know). However, even TSO architectures allow the use of store buffers and hence enforce the use of painful StoreLoad fences.
>> So abstracting away some unimportant things in the barrier (like SATB, young gen check, cross region check etc), the wanted control flow for the write barrier is:
>> T1: write reference
>> T1: load card, check if dirty, if so skip ahead
>> So my understanding of the problem is that due to aforementioned reasons, these two operations are allowed to re-order due to the reference write being stuck for a while in the store buffer. The current solution is to emit an expensive StoreLoad fence before the loading the card to check if it is dirty.
>> Without that fence and other precautions, this possible ordering could happen:
>> __ card is visibly dirty __
>> Step 1. T1 (mutator): load card -> is dirty, skip rest of barrier
>> Step 2. T2 (conc refinement): clean and process card with the old reference globally visible at this point
>> Step 3. T1 (mutator): write ref (lazy serialization due to store buffer) <— reference write is forgotten
>> As we can see, in step 3 the mutator reference write is unfortunately forgotten due to the reference write staying in the store buffer until after the card has been processed by the concurrent refinement thread.
>> The good news is that this race can be solved with asymmetric dekker synchronization, which comes in different flavours. The idea is to allow the reference write and card load to be reordered, as long as they both happen before references are processed by the concurrent refinement thread. We achieve this by moving the synchronization overheads to the concurrent refinement thread instead of the mutator post write barrier.
>> To do this we introduce a new operation, here called rendezvous, which will make sure that remote CPU store buffers have been drained. Given the existence of this operation, card refinement could be rewritten into this form:
>> 1. *card = clean
>> 2. storeload <— guarantee local store buffer is drained so mutators see the card is clean
>> 3. rendezvous <— guarantee remote store buffer is drained so their reference writes are visible, even in case of the unfortunate reordering
>> 4. process card
>> Or in an amortised form:
>> 1. for all cards in buffer, clean
>> 2. storeload
>> 3. rendezvous
>> 4. for all cards in buffer, process card
>> The rendezvous will prevent the unfortunate race in the previous example by making sure both step 1 and 3 happen before step 2 in total order.
>> Now of course, this brings us to the question of how to implement the rendezvous operation. Windows has a neat system call, FlushProcessWriteBuffers, which will do exactly what we need using IPI. AFAIK it’s also required of calls to mprotect to flush the store buffers of remote CPUs in a similar fashion which could be used in UNIX systems. Or one could simply wait long enough for the ~32ish store buffer entries have for sure been serialized. Or put one thread on each CPU (using affinity) and make a handshake with them using a condition variable - the context switch should already serialize the store buffer.
>> If I have not misunderstood the problem, it would be nice to get rid of the evil StoreLoad fence, and I could volunteer to provide a fix, but I would need some help to do performance regression and testing. The idea relies on moving the synchronization costs to the concurrent refinement threads and amortise those synchronization costs away by batching card refinement.
-------------- next part --------------
An HTML attachment was scrubbed...
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 10777 bytes
More information about the hotspot-gc-dev