RFR: 8079315: UseCondCardMark broken in conjunction with CMS precleaning

Vitaly Davidovich vitalyd at gmail.com
Tue May 12 00:54:12 UTC 2015


Thanks for the explanation - this is a clever trick! :)

Out of curiosity, was there an explanation/theory why this didn't matter
for G1? Are most write barriers there eliminated via some other means?

sent from my phone
On May 11, 2015 12:51 PM, "Erik Österlund" <erik.osterlund at lnu.se> wrote:

> Hi Andrew,
> > On 11 May 2015, at 17:21, Andrew Haley <aph at redhat.com> wrote:
> >
> > On 05/11/2015 05:06 PM, Vitaly Davidovich wrote:
> >
> >>> Also the global operation is not purely, but “mostly" locally expensive
> >>> for the thread performing the global fence. The cost on global CPUs is
> >>> pretty much simply a normal fence (roughly). Of course there is always
> >>> gonna be that one guy with 4000 CPUs which might be a bit awkward.
> >
> > Well yes, but that guy with 4000 CPUs is precisely the target for
> > UseCondCardMark.
> Okay. That should be fine still as I described, but a bit expensive to
> benchmark it and fine tune I guess. I don’t have access to any such
> machines. :( If somebody does we could find out.
> >
> >>> But even then, with high enough n, shared, timestamped global
> >>> fences etc, even such ridiculous scalability should be within
> >>> reach.
> >>
> >> Is it roughly like a normal fence for remote CPUs?
> >
> > I would not think so.  Surely you'd have to interrupt every core in
> > the process and do a bunch of flushes.  A TLB flush is expensive, as
> > is interrupting the core itself.  I'm fairly sure there's no way to
> > flush a remote core's TLB without interrupting it.
> >
> Yes but in a round robin fashion using e.g. APIC on x86, not necessarily
> all globally at the same time. It’s like message passing. And the TLBs will
> only be purged for the range of the memory protection; this is a single
> page that those remote CPUs don’t even have in their TLB caches, and
> therefore no remote TLB caches will be changed.
> For e.g. x86_64, the APIC message itself will fence and then it will run
> the code to find out that no TLB entries needs changing and that’s pretty
> much it.
> This is not a scalability bottleneck at all and the constant costs I
> already know are not problematic because I use this technique quite a lot
> myself and Thomas Schatzl was kind enough to thoroughly benchmark such a
> card cleaning solution for me on G1 around new year on a number of
> benchmarks and machines. The conclusion for G1 was that it didn’t matter
> performance wise. Also that constant cost is amortized away arbitrarily by
> regulating its frequency.
> Thanks,
> /Erik
> > Andrew.

More information about the hotspot-dev mailing list