RFR: 8079315: UseCondCardMark broken in conjunction with CMS precleaning
aleksey.shipilev at oracle.com
Tue May 12 13:05:27 UTC 2015
On 11.05.2015 16:41, Andrew Haley wrote:
> On 05/11/2015 12:33 PM, Erik Österlund wrote:
>> Hi Andrew,
>>> On 11 May 2015, at 11:58, Andrew Haley <aph at redhat.com> wrote:
>>> On 05/11/2015 11:40 AM, Erik Österlund wrote:
>>>> I have heard statements like this that such mechanism would not work
>>>> on RMO, but never got an explanation why it would work only on
>>>> TSO. Could you please elaborate? I studied some kernel sources for
>>>> a bunch of architectures and kernels, and it seems as far as I can
>>>> see all good for RMO too.
>>> Dave Dice himself told me that the algorithm is not in general safe
>>> for non-TSO. Perhaps, though, it is safe in this particular case. Of
>>> course, I may be misunderstanding him. I'm not sure of his reasoning
>>> but perhaps we should include him in this discussion.
>> I see. It would be interesting to hear his reasoning, because it is
>> not clear to me.
>>> From my point of view, I can't see a strong argument for doing this on
>>> AArch64. StoreLoad barriers are not fantastically expensive there so
>>> it may not be worth going to such extremes. The cost of a StoreLoad
>>> barrier doesn't seem to be so much more than the StoreStore that we
>>> have to have anyway.
>> Yeah about performance I’m not sure when it’s worth removing these
>> fences and on what hardware.
> Your algorithm (as I understand it) trades a moderately expensive (but
> purely local) operation for a very expensive global operation, albeit
> with much lower frequency. It's not clear to me how much we value
> continuous operation versus faster operation with occasional global
> stalls. I suppose it must be application-dependent.
Okay, Dice's asymmetric trick is nice. In fact, that is arguably what
Parallel is using already: it serializes the mutator stores by stopping
the mutator at safepoint. Using mprotect and TLB tricks as the
serialization actions is cute and dandy.
However, I have doubts that employing the system-wide synchronization
mechanism for concurrent collector is a good thing, when we can't
predict and control the long-term performance of it. For example, we are
basically coming at the mercy of underlying OS performance with mprotect
calls. There are industrial GCs that rely on OS performance (*cough*
*cough*), you can see what do those require to guarantee performance.
Also, given the problem is specific to CMS that arguably goes away in
favor of G1, I would think introducing special-case-for-CMS barriers in
mutator code is a sane interim solution.
Especially if we can backport the G1-like barrier "filtering" in CMS
case? If I read this thread right, Erik and Thomas concluded there is no
clear benefit of introducing the mprotect-like mechanics with G1, which
probably means the overheads are bearable with appropriate mutator-side
More information about the hotspot-dev