RFR: 8079315: UseCondCardMark broken in conjunction with CMS precleaning
erik.osterlund at lnu.se
Tue May 12 17:23:26 UTC 2015
Hi Mikael and Andrew,
Unless I missed something, I don’t think we introduce that much code complexity.
Of course I agree that G1 will make fixes in CMS a bit wasted in the long run.
However, until then it would be good if CMS still works. And a few lines shared code (handful for the actual GC) seems, to me, both less painful from an engineering point of view and better performant than going through all mutator code paths that need changing (interpreter, c1, c2, for potentially many architectures).
Out of curiosity I patched the thing and my fix can be found here: http://cr.openjdk.java.net/~eosterlund/8079315/webrev.v1/
Fortunately it looks like CMS is already batching cards pretty well for me so the change turned out to be very small. I logged to see how often this global fence is triggered and it’s very rare so I feel quite convinced it won’t impact performance negatively even on “that guy’s” machine and with a terrible OS implementation.
I benchmarked it using DaCapo benchmarks locally on my computer (macbook x86_64 BSD) and there were no traces of any performance artefacts/regression.
If anyone happens to have a larger machine than my macbook, it would be interesting to take it for a spin. ;)
Disclaimer: I haven’t poked around a lot in CMS in the past, so I hope I didn’t miss any important card value transitions!
On 12 May 2015, at 14:17, Mikael Gerdin <mikael.gerdin at oracle.com<mailto:mikael.gerdin at oracle.com>> wrote:
On 2015-05-12 15:05, Aleksey Shipilev wrote:
On 11.05.2015 16:41, Andrew Haley wrote:
On 05/11/2015 12:33 PM, Erik Österlund wrote:
On 11 May 2015, at 11:58, Andrew Haley <aph at redhat.com<mailto:aph at redhat.com>> wrote:
On 05/11/2015 11:40 AM, Erik Österlund wrote:
I have heard statements like this that such mechanism would not work
on RMO, but never got an explanation why it would work only on
TSO. Could you please elaborate? I studied some kernel sources for
a bunch of architectures and kernels, and it seems as far as I can
see all good for RMO too.
Dave Dice himself told me that the algorithm is not in general safe
for non-TSO. Perhaps, though, it is safe in this particular case. Of
course, I may be misunderstanding him. I'm not sure of his reasoning
but perhaps we should include him in this discussion.
I see. It would be interesting to hear his reasoning, because it is
not clear to me.
From my point of view, I can't see a strong argument for doing this on
AArch64. StoreLoad barriers are not fantastically expensive there so
it may not be worth going to such extremes. The cost of a StoreLoad
barrier doesn't seem to be so much more than the StoreStore that we
have to have anyway.
Yeah about performance I’m not sure when it’s worth removing these
fences and on what hardware.
Your algorithm (as I understand it) trades a moderately expensive (but
purely local) operation for a very expensive global operation, albeit
with much lower frequency. It's not clear to me how much we value
continuous operation versus faster operation with occasional global
stalls. I suppose it must be application-dependent.
Okay, Dice's asymmetric trick is nice. In fact, that is arguably what
Parallel is using already: it serializes the mutator stores by stopping
the mutator at safepoint. Using mprotect and TLB tricks as the
serialization actions is cute and dandy.
However, I have doubts that employing the system-wide synchronization
mechanism for concurrent collector is a good thing, when we can't
predict and control the long-term performance of it. For example, we are
basically coming at the mercy of underlying OS performance with mprotect
calls. There are industrial GCs that rely on OS performance (*cough*
*cough*), you can see what do those require to guarantee performance.
Just to be clear, this type of synchronization is in fact already implemented in the JVM to synchronize thread states for the safepoint protocol, so it's not exactly new and unexplained territory.
However it's not clear to me that the code complexity involved with using that type of synchronization for conditional card marking in CMS is worth it.
Also, given the problem is specific to CMS that arguably goes away in
favor of G1, I would think introducing special-case-for-CMS barriers in
mutator code is a sane interim solution.
Especially if we can backport the G1-like barrier "filtering" in CMS
case? If I read this thread right, Erik and Thomas concluded there is no
clear benefit of introducing the mprotect-like mechanics with G1, which
probably means the overheads are bearable with appropriate mutator-side
I don't think it would be easy to implement barrier "filtering" in CMS.
Keep in mind that even before the storeload was added to G1's barriers they were fairly heavy-weight. CMS' barriers are not, if we start to add conditionals and storeload barriers to them the runtime overhead may increase more than what it did when we added the storeload to G1.
More information about the hotspot-dev