Removing G1 Reference Post Write Barrier StoreLoad Barrier
erik.osterlund at lnu.se
Tue Dec 23 13:23:53 UTC 2014
Thanks for the comments.
> On 22 Dec 2014, at 22:19, Thomas Schatzl <thomas.schatzl at oracle.com> wrote:
> On Mon, 2014-12-22 at 20:30 +0000, Erik Österlund wrote:
>> Hi Thomas,
>> My assumption is more about fast/slow code paths than it is about
>> fast/slow threads.
> Fast/slow threads what was I have been thinking of. If mutators are
> picking up work, and are probably going to do most of the work, there
> are no distinct slow/fast threads.
>> And reference writes is something I consider a fast path. Although
>> the frequency of inter regional pointer writes is different in
>> different applications, I think that by having a storeload fence
>> in this G1 barrier, it gives rise to some awkward cases like sorting
>> large linked lists where performance becomes suboptimal, so it
>> would be neat to get rid of it and get more consistent and
>> resilient performance numbers.
> Sorting linked lists is suboptimal with current G1 with or without the
> change as every reference write potentially creates a card entry. I
> guess most time will be spent in actual refinement in this case anyway.
Yeah true but it would get better anyway: it’s possible that the concurrent refinement threads can take much of the hits if resources are available for that and even if not then many fences might trigger on the same card - it will currently fence whether the card was dirty or not. Then of course there’s no point speculating too much - I think just benchmarking it would be better so it can be quantified.
>> With that being said, the local cost for issuing this global fence
>> (~640 nano seconds on my machine and my implementation based on
>> mprotect which seems the most portable) is amortised away for both
>> concurrent refinement threads and mutators alike since they both
>> buffer cards to be processed and can batch them and amortise the cost.
>> I currently batch 128 cards at a time and the cost of the global
>> fence seems to have vanished.
> Some Google search indicates that e.g. sparc m7 systems may have up to
> 1024 cores with 4096 threads (a very extreme example). Larger Intel
> systems may also have 100+ threads. Current two socket Intel systems
> reach 32+ threads.
Wow. That’s a lot of cores. Santa, is it too late to change my list? ;)
> Mprotect will flush all store buffers of all processors every time. So
> you charge everyone (also not only the VM; consider running multiple VMs
> on a system). This is what Jon has been concerned about, how scalable is
Are you sure about this? Since it uses IPI for flushing the store buffers, I was quite certain it sends it only to those CPUs the process is scheduled to run on. I would be very surprised if that was not the case as it would be a huge blunder in the implementation of mprotect, but I haven’t checked the source code on every platform out there.
> There is chance that a (potentially much more local) StoreLoad is much
> less expensive than mprotect with a modestly large system overall.
>> If I understand you correctly, the frequency of invoking card
>> refinement from mutators might have to increase by giving them
>> smaller dirty card buffers because we can’t have too many dirty
>> cards hanging around per mutator thread if we want to have good
>> latency and lots of threads?
> This is one problem, the other that mutator threads themselves need to
> do more refinement than they do with current settings. Which means more
> full buffers with more frequent mprotect calls. There may be some
> possibility to increase the buffer size, but 128 cards seems already
> somewhat large (I have not measured it, do not even know the current
> buffer size).
I measured it in the benchmarks I ran and it was 2048 cards per buffer processed. So if the batching is the same as the buffer size, 2048, then even on a system with 1024 cores, it would be half a memory fence per card, which is arguably still always going to be strictly less than it is now. But then if somebody had even more cores, like a billion or something, it would be possible to make a more advanced variant where not all refinement threads make the global fence, instead the global fences are timestamped and if one refinement thread issues a global fence, the others won’t have to for cards cleaned before the timestamp. That way, the punishment on the other cores in such a massive system could be controlled and dealt with, without causing too much trouble really.
> DaCapo is unfortunately not a particular good benchmark on large
> systems. Even h2 runs very well with a few 100MBs of heap and is very
>> In that case, the minimum size of
>> mutator dirty card buffers could impact the batch size so the
>> constants matter here. But 128 seems like a rather small constant,
>> do you think we would run into a situation where that matters?
> No, not particularly in that situation, see above. 128 cards may already
> be quite a lot of work to do though.
If the buffer size is 2048 (don’t know if that changes dynamically?) then it shouldn’t be too much.
But the whole argument that we need smaller buffers if we have more cores and threads I do not quite understand. It’s understood that means a lot of threads will have a lot of buffers hanging around but if we have many cores we also have more threads processing them in the safepoints to reduce the pause time, right? So it seems that it’s when the ratio of threads to the number of cores available gets nasty that the buffer size should reduce to retain the same pause times. And even then, since the number of cores is constant, the rate at which card entries are produced is the same regardless of the number of threads over-saturating the system (roughly speaking), so even if private local card buffers per mutator get smaller to reduce pause times, they could be processed at the same speed as before and hence have the same batch size. By having smaller private buffers sticking to the threads and joining them together to make larger public batches to be processed, this should be dealt with in a scalable fashion I believe.
>> Personally I think that if somebody has a billion threads and don’t
>> do anything else than inter regional pointer writes and at the same
>> time expects flawless latency, then perhaps they should rethink what
>> they are doing haha!
>> Hmm or maybe a VM flag could let the user choose
>> if they have weird specific requirements? UseMembar seems to already
>> be used for situations like these.
> I think we will give your changes a few tries, at least run it through a
> few tests.
Thank you. It will be interesting to see. :)
Oh and happy christmas!
More information about the hotspot-gc-dev