RFR: 8079315: UseCondCardMark broken in conjunction with CMS precleaning
erik.osterlund at lnu.se
Thu May 21 12:54:13 UTC 2015
Sorry I thought I sent a reply earlier but it looks like I didn't. :/
Den 12/05/15 23:49 skrev Aleksey Shipilev <aleksey.shipilev at oracle.com>:
>On 12.05.2015 23:44, Erik Österlund wrote:
>> I don¹t know what windows does because it¹s open source but we only have
>> x86 there and its hardware has no support for doing it any other way
>> than with IPI messages which is all we need. And if we feel that scared,
>> windows has a system call that does exactly what we want and with the
>> architecture I propose it¹s trivial to specialize the barrier for
>> windows do use this instead.
>I think I get what you tell, but I am not convinced. The thing about
>reading stuff in the mutator is to align the actions in collector with
>the actions in mutator. So what if you push the IPI to all processors.
>Some lucky processor will get that interrupt *after* (e.g. too late!)
>both the reference store and (reordered/stale) card mark read => same
>problem, right? In other words, asking a mutator to do a fence-like op
>after an already missed card mark update solves what?
The IPI will be received for sure on all processors after mprotect begins
and before it ends. Otherwise they wouldn't serve any purpose. The purpose
of the cross call is to shoot down TLBs and make the new permissions
visible. If IPIs were to be delayed until after mprotect returns, it
simply would not work. And this is all we need.
>> If there was to suddenly pop up a magical fancy OS + hardware solution
>> that is too clever for this optimization (seems unlikely to me) then
>> there are other ways of issuing such a global fence. But I don¹t see the
>> point in doing that now when there is no such problem in sight.
>When you are dealing with a platform that has a billion of
>installations, millions of developers, countless different hardware and
>OS flavors, it does not seem very sane to lock in the correctness
>guarantees on an undocumented implementation detail and/or guesses.
>(Aside: doing that for performance is totally fine, we do that all the
I understand it might feel a bit icky to rely on OS implementations (even
though unchanged for a long time) rather than an official contract. This
solution was just a suggestion.
Interestingly enough they had pretty much the same discussion in the urcu
library mailing list. Somebody came up with such a fancy mprotect scheme
and they discussed whether it was safe, and if so if it was wise.
Conclusion was that kernel people did not want to put linux into a corner
(you probably know how they care a lot about not breaking userspace).
They have therefore been pushing for a sys_membarrier system call for
linux since 2010. It didn't make it back then because it was believed that
urcu is the only library needing such a mechanism. But now it seems like
it is making it to linux 4.1 anyway if I understood the discussions right.
Maybe we should push it a bit too so they know more people are interested
Anyway, if we want stronger contracts and not rely on the state of
OS/hardware implementation, the following seems possible:
1) Fence on all platforms: Obvious and easy. But... :( I really don't like
2) Use the already implemented ADS mechanism with pretty solid OS
contract: easy, but the global fence takes on my machine around ~15 micro
seconds and during that time mutators may be locked (globally) and unable
to make reference writes. Meh.
3) Try and do better on platforms offering better contracts:
Detect windows and linux system calls for explicit global fencing and use
them if available, otherwise fall back to naked mprotect: if a new OS
version wants to support fancy TLB sniping hardware, then that future OS
will certainly have the system call, and the naked mprotect covers
backward compatibility to current and older OS when such fancy TLB sniping
did not exist.
BSD: Fence, or some other fancy scheme such as sending fencing handshakes
via signals to JavaThreads. Use cpuid instruction to find out which
physical processors rather than threads have seen the handshake so an
oversaturated system with too many threads for its own good can exit early
without handshaking all threads, just enough to cover all cores. Darwin
can do even better with the mach microkernel allowing introspection of
thread states using thread_info. This way, "offline" threads not currently
on CPU automatically handshake (don't need flushing/interrupts). I tried
this and the global fence is slower unless the yieldpoint cooperates in
the handshaking (requires switch from global polling page to thread-local
flag like in Jikes). Then it becomes faster than mprotect schemes. But
then again, does that guy with 4000 cores have darwin? Hmm!
solaris: Tell the guys to fix a system call like linux did/is doing :p
aix: No idea I'm afraid, but maybe similar to BSD or just fence.
Now two questions remain:
1) Will it blend? Maybe.
2) Is it worth it? Maybe not if G1 is gonna replace CMS anyway.
Personally I think a global fence with no requirements on mutator barrier
instructions or mutator side global locking seems like a *VERY* useful
tool in hotspot that I think we could use more often and not only for
thread transitions. Who knows, maybe it even becomes interesting for G1.
The problem now is that exactly when the fence is a problem for G1, other
stuff is so much more of a problem that the gain in removing it seems not
worth it. But maybe if we can make that other stuff slim and smooth, it
starts becoming worth looking over that fence too. What do you think?
More information about the hotspot-dev