RFR: 8079315: UseCondCardMark broken in conjunction with CMS precleaning
aleksey.shipilev at oracle.com
Tue May 12 22:49:47 UTC 2015
On 12.05.2015 23:44, Erik Österlund wrote:
> It is of course true that a hypothetical OS + hardware combination could
> /in theory/ be smart enough to not send the TLB purge message to certain
> CPUs and hence not flush latent stores there. But in practice none does
> that which I know of and certainly none that we support.
Famous last words :)
> As I said in
> the original email where I proposed the solution, I already had a look
> at our architectures (and a few more) in the linux kernel and XNU/BSD -
> and it’s safe everywhere. And as I said earlier, the closest match I
> found to out-smart the barrier is itanium that broadcasts the TLB purge
> with a special instruction rather than IPI: ptc.ga. It takes an address
> range and purges the corresponding TLB entries. However, according to
> Intel’s own documentation, even such fancy solution still flushes all
> latent memory accesses on remote CPUs regardless.
Ah, apologies, I must have missed that note. It's here:
> I don’t know what windows does because it’s open source but we only have
> x86 there and its hardware has no support for doing it any other way
> than with IPI messages which is all we need. And if we feel that scared,
> windows has a system call that does exactly what we want and with the
> architecture I propose it’s trivial to specialize the barrier for
> windows do use this instead.
I think I get what you tell, but I am not convinced. The thing about
reading stuff in the mutator is to align the actions in collector with
the actions in mutator. So what if you push the IPI to all processors.
Some lucky processor will get that interrupt *after* (e.g. too late!)
both the reference store and (reordered/stale) card mark read => same
problem, right? In other words, asking a mutator to do a fence-like op
after an already missed card mark update solves what?
Even Dice's article on asymmetric Dekker idioms that is very brave in
suggesting arcane tricks, AFAIU, doesn't cover the case of "blind"
mprotect in "slow thread" without reading the protected page in the
"fast thread". The point of Dice's mprotect construction, AFAIU, is to
resolve the ordering conundrum by reading the mprotected page in "fast
thread", so to coordinate "fast thread" with "slow thread".
> If there was to suddenly pop up a magical fancy OS + hardware solution
> that is too clever for this optimization (seems unlikely to me) then
> there are other ways of issuing such a global fence. But I don’t see the
> point in doing that now when there is no such problem in sight.
When you are dealing with a platform that has a billion of
installations, millions of developers, countless different hardware and
OS flavors, it does not seem very sane to lock in the correctness
guarantees on an undocumented implementation detail and/or guesses.
(Aside: doing that for performance is totally fine, we do that all the time)
More information about the hotspot-dev