RFR: 8079315: UseCondCardMark broken in conjunction with CMS precleaning
vitalyd at gmail.com
Tue May 12 23:43:47 UTC 2015
So let's see, mutator stores a reference, reads a stale card value, and
skips the card dirtying - the reference has been stored though. The way
this could happen is if GC thread flipped the card value to precleaned but
mutator hadn't seen it yet. However, the card must've been dirty right
before (hence mutator skipped dirtying it again), which means GC thread is
going to scan it for references and expect to find the right ref value.
The global fence is issued after GC flips the card values but before it
processes the cards. So any pending stores in remote write buffers will
flush, be made globally visible, and GC will see them by the time it goes
to process the cards.
sent from my phone
On May 12, 2015 6:50 PM, "Aleksey Shipilev" <aleksey.shipilev at oracle.com>
> On 12.05.2015 23:44, Erik Österlund wrote:
> > It is of course true that a hypothetical OS + hardware combination could
> > /in theory/ be smart enough to not send the TLB purge message to certain
> > CPUs and hence not flush latent stores there. But in practice none does
> > that which I know of and certainly none that we support.
> Famous last words :)
> > As I said in
> > the original email where I proposed the solution, I already had a look
> > at our architectures (and a few more) in the linux kernel and XNU/BSD -
> > and it’s safe everywhere. And as I said earlier, the closest match I
> > found to out-smart the barrier is itanium that broadcasts the TLB purge
> > with a special instruction rather than IPI: ptc.ga. It takes an address
> > range and purges the corresponding TLB entries. However, according to
> > Intel’s own documentation, even such fancy solution still flushes all
> > latent memory accesses on remote CPUs regardless.
> Ah, apologies, I must have missed that note. It's here:
> > I don’t know what windows does because it’s open source but we only have
> > x86 there and its hardware has no support for doing it any other way
> > than with IPI messages which is all we need. And if we feel that scared,
> > windows has a system call that does exactly what we want and with the
> > architecture I propose it’s trivial to specialize the barrier for
> > windows do use this instead.
> I think I get what you tell, but I am not convinced. The thing about
> reading stuff in the mutator is to align the actions in collector with
> the actions in mutator. So what if you push the IPI to all processors.
> Some lucky processor will get that interrupt *after* (e.g. too late!)
> both the reference store and (reordered/stale) card mark read => same
> problem, right? In other words, asking a mutator to do a fence-like op
> after an already missed card mark update solves what?
> Even Dice's article on asymmetric Dekker idioms that is very brave in
> suggesting arcane tricks, AFAIU, doesn't cover the case of "blind"
> mprotect in "slow thread" without reading the protected page in the
> "fast thread". The point of Dice's mprotect construction, AFAIU, is to
> resolve the ordering conundrum by reading the mprotected page in "fast
> thread", so to coordinate "fast thread" with "slow thread".
> > If there was to suddenly pop up a magical fancy OS + hardware solution
> > that is too clever for this optimization (seems unlikely to me) then
> > there are other ways of issuing such a global fence. But I don’t see the
> > point in doing that now when there is no such problem in sight.
> When you are dealing with a platform that has a billion of
> installations, millions of developers, countless different hardware and
> OS flavors, it does not seem very sane to lock in the correctness
> guarantees on an undocumented implementation detail and/or guesses.
> (Aside: doing that for performance is totally fine, we do that all the
More information about the hotspot-dev