RFR: 8079315: UseCondCardMark broken in conjunction with CMS precleaning
erik.osterlund at lnu.se
Mon May 11 15:59:37 UTC 2015
On 11 May 2015, at 14:41, Andrew Haley <aph at redhat.com<mailto:aph at redhat.com>> wrote:
On 05/11/2015 12:33 PM, Erik Österlund wrote:
On 11 May 2015, at 11:58, Andrew Haley <aph at redhat.com<mailto:aph at redhat.com>> wrote:
On 05/11/2015 11:40 AM, Erik Österlund wrote:
I have heard statements like this that such mechanism would not work
on RMO, but never got an explanation why it would work only on
TSO. Could you please elaborate? I studied some kernel sources for
a bunch of architectures and kernels, and it seems as far as I can
see all good for RMO too.
Dave Dice himself told me that the algorithm is not in general safe
for non-TSO. Perhaps, though, it is safe in this particular case. Of
course, I may be misunderstanding him. I'm not sure of his reasoning
but perhaps we should include him in this discussion.
I see. It would be interesting to hear his reasoning, because it is
not clear to me.
From my point of view, I can't see a strong argument for doing this on
AArch64. StoreLoad barriers are not fantastically expensive there so
it may not be worth going to such extremes. The cost of a StoreLoad
barrier doesn't seem to be so much more than the StoreStore that we
have to have anyway.
Yeah about performance I’m not sure when it’s worth removing these
fences and on what hardware.
Your algorithm (as I understand it) trades a moderately expensive (but
purely local) operation for a very expensive global operation, albeit
with much lower frequency. It's not clear to me how much we value
continuous operation versus faster operation with occasional global
stalls. I suppose it must be application-dependent.
From my perspective the idea is to move the synchronization overhead from a place where it cannot be amortized away (memory access) to a code path where it can be pretty much arbitrarily amortized away (batched cleaning). We couldn’t fence every n memory accesses, but we certainly can global fence every n cards (batched), where we can pick a suitable n where the related synchronization overheads seem to vanish.
Also the global operation is not purely, but “mostly" locally expensive for the thread performing the global fence. The cost on global CPUs is pretty much simply a normal fence (roughly). Of course there is always gonna be that one guy with 4000 CPUs which might be a bit awkward. But even then, with high enough n, shared, timestamped global fences etc, even such ridiculous scalability should be within reach.
BTW do we normally have some kind of reasonable scalability window we optimize for, and don’t care as much about optimizing for that potential one guy? ;)
In this case though, if it makes us any happier, I think we could
probably get rid of the storestore barrier too:
The latent reference store is forced to serialize anyway after the
dirty card value write is observable and about to be cleaned. So the
potential consistency violation that the card looks dirty and then
cleaning thread reads a stale reference value could not happen with
my approach even without storestore hardware protection. I didn’t
give it too much thought but on the top of my mind I can’t see any
problems. If we want to get rid of storestore too I can give it some
That is very interesting.
But you know much better than me if these fences are problematic or
Not really. AArch64 is an architecture not an implementation, and is
designed to be implemented using a wide range of techniques. Instead
of having very complex cores, some designers seem have decided it
makes sense to have many of them on a die. It may well be, though,
that some implementers will adopt an x86-like highly-superscalar
architecture with a great deal of speculative execution. I can only
predict the past... My approach with this project has been to do
things in the most straightforward way rather than trying to optimize
for whatever implementations I happen to have available.
I see your point of view: you don’t want to be that dependent on the hardware and elected to go with a straightforward synchronization solution for this reason. This makes sense. But I think since we are dealing with an optimization feature here (UseCondCardMark), I believe a less straight forward solution makes us less dependent on such hardware details. Because it is an optimization, the highest possible performance is probably expected and even important, which suddenly becomes very tightly dependent on the cost of fencing which probably varies a lot from different hardware vendors.
Conversely, the possibly less straightforward synchronization solution dodges this bullet by simply not fencing and arbitrarily amortizing away the related synchronization costs until they vanish. :)
More information about the hotspot-dev