From dl at cs.oswego.edu Thu Jul 17 19:00:02 2014 From: dl at cs.oswego.edu (Doug Lea) Date: Thu, 17 Jul 2014 15:00:02 -0400 Subject: [jmm-dev] Jmm revision status Message-ID: <53C81D32.6040807@cs.oswego.edu> Things were moving along rather nicely. And then ... nothing. My sense is that people following things closely suddenly became less optimistic that we will arrive at something simple and beautiful and readily understandable after seeing Peter Sewell's proposed amendments to C++/C11 (http://www.cl.cam.ac.uk/~pes20/cpp/notes44.html) and Alan Jeffrey's unsimple follow-ups to his simple fresh start. See March list archives at http://mail.openjdk.java.net/pipermail/jmm-dev/2014-March/thread.html This seems to happen all the time with memory models. Practical necessities surrounding processors and optimizers lead to messiness. In particular, most people wanted to get rid of JSR133 user-hostile "justification sequences" and the like, but this is now far from a sure thing. (Peter's approach amounts to a special form of them.) All ideas would be welcome on how we can recover forward progress on the core model. Especially since we do have some other updates more-or-less conceptually ready to adapt to them, including actual specs for "enhanced-volatile" acquire/release and other intrinsics, and replacing the "final fields" specs with simpler constructor release guarantees. -Doug From Peter.Sewell at cl.cam.ac.uk Thu Jul 17 20:11:25 2014 From: Peter.Sewell at cl.cam.ac.uk (Peter Sewell) Date: Thu, 17 Jul 2014 22:11:25 +0200 Subject: [jmm-dev] Jmm revision status In-Reply-To: <53C81D32.6040807@cs.oswego.edu> References: <53C81D32.6040807@cs.oswego.edu> Message-ID: On 17 July 2014 21:00, Doug Lea
wrote: > > Things were moving along rather nicely. And then ... nothing. We have been working in the meantime, but (unsurprisingly) it's proved tricky... we have a draft paper at http://www.cl.cam.ac.uk/~pes20/cpp/c_concurrency_challenges.pdf that has some good news and bad news, mostly in the C context (work by Mark Batty, Kayvan Memarian, Kyndylan Nienhuis, Jean Pichon, and myself). The former consist of a machine-checked proof of DRF-SC for the C/C++11 model and a machine-checked-equivalent operational C/C++11 model. For the latter, we describe our observations that there's no fix to the thin-air problems (preserving existing optimisations) in a per-candidate-execution style, and that this is not just a problem for relaxed atomics - the mix of nonatomic and (eg) SC atomic accesses that is allowed in C also leads to thin-air difficulties (this latter isn't necessarily relevant to Java, of course). Then we explore further the operational construction we sent a mail around earlier - it's interesting in that it copes with the thin-air examples (and in general with reordering) very nicely, but difficulties with other optimisations suggest it's not really a solution for C. I guess also not for Java, but I don't have a good intuition for how much optimisation you might be prepared to give up. Meanwhile (we've just been talking at a couple of meetings), Viktor and Francesco et al. have identified yet other problems with C/C++11, especially related to whether source-to-source (or IR-to-IR) optimisations are sound w.r.t. it. Some of these are fixable, but others involve the same basic thin-air problem. Right now, I'm not aware of any fleshed out credible proposal for a decent model that allows something like relaxed atomics (implementable with just plain accesses) and current optimisations... perhaps Alan's event-structure semantics will give us a way forward. > My sense is that people following things closely suddenly became less > optimistic that we will arrive at something simple and beautiful and > readily understandable after seeing Peter Sewell's proposed amendments > to C++/C11 (http://www.cl.cam.ac.uk/~pes20/cpp/notes44.html) and Alan > Jeffrey's unsimple follow-ups to his simple fresh start. See March > list archives at > http://mail.openjdk.java.net/pipermail/jmm-dev/2014-March/thread.html > > This seems to happen all the time with memory models. Practical > necessities surrounding processors and optimizers lead to messiness. > In particular, most people wanted to get rid of JSR133 user-hostile > "justification sequences" and the like, but this is now far from a > sure thing. (Peter's approach amounts to a special form of them.) Not really the same kind of thing, I think - that operational construction involves more than just a single candidate execution, but not the bait-and-switch nature of JSR133. best, Peter > All ideas would be welcome on how we can recover forward progress on > the core model. Especially since we do have some other updates > more-or-less conceptually ready to adapt to them, including actual > specs for "enhanced-volatile" acquire/release and other intrinsics, > and replacing the "final fields" specs with simpler constructor > release guarantees. > > -Doug From boehm at acm.org Thu Jul 17 22:57:20 2014 From: boehm at acm.org (Hans Boehm) Date: Thu, 17 Jul 2014 15:57:20 -0700 Subject: [jmm-dev] Jmm revision status In-Reply-To: References: <53C81D32.6040807@cs.oswego.edu> Message-ID: A few other updates: Brian and I had a paper in MSPC 14 ( http://dl.acm.org/citation.cfm?id=2618134) that mostly summarizes the out-of-thin-air issues and solutions based on prohibiting store->load reordering. I would argue that those are still the most practical solutions we currently have. One of my colleagues at Google points out that my earlier fear that bogus branches needed to enforce load->store ordering would tie up branch prediction resources should be unfounded. It should be easy to arrange for these branches to be statically predicted correctly, in which case it appears that no prediction resources are used. I think we still need real measurements of the cost, which, at least for Java, I would expect to greatly depend on the cleverness of the compiler in delaying branches and avoiding unnecessary ones. Torvald Riegel and Paul McKenney are trying to turn C++11/C11 memory_order_consume into something useful, and have been running into some of the same problems with definition of dependencies as we have here. Although at most marginally relevant for Java, we also became aware of an ARM erratum ( http://infocenter.arm.com/help/topic/com.arm.doc.uan0004a/UAN0004A_a9_read_read.pdf, perhaps discovered by some of the other participants here?), that seems to effectively reduce the cost of prohibiting load->store reordering on ARMv7 for C++ memory_order_relaxed to zero. Apparently a substantial fraction of ARMv7 cores have a hardware erratum that requires a fence for memory_order_relaxed loads anyway. Otherwise loads from the same location may be reordered, which is disallowed for C++ memory_order_relaxed, but allowed for Java. Thus any object code that is intended to correctly support memory_order_relaxed on these processors should already prohibit load->store reordering as a side-effect. For C and C++, I expect that realistically applies to all 32-bit ARM code. Unfortunately, the required workaround seems appreciably more expensive than what we would need to just enforce load->store ordering, since it needs an actual fence. As mentioned, this does not directly change the Java situation. It also does not affect 64-bit executables intended to run on ARMv8. Hans On Thu, Jul 17, 2014 at 1:11 PM, Peter Sewell wrote: > On 17 July 2014 21:00, Doug Lea
wrote: > > > > Things were moving along rather nicely. And then ... nothing. > > We have been working in the meantime, but (unsurprisingly) it's proved > tricky... we have a draft paper at > > http://www.cl.cam.ac.uk/~pes20/cpp/c_concurrency_challenges.pdf > > that has some good news and bad news, mostly in the C context (work by > Mark Batty, Kayvan Memarian, Kyndylan Nienhuis, Jean Pichon, and > myself). The former consist of a machine-checked proof of DRF-SC for > the C/C++11 model and a machine-checked-equivalent operational C/C++11 > model. For the latter, we describe our observations that there's no > fix to the thin-air problems (preserving existing optimisations) in a > per-candidate-execution style, and that this is not just a problem for > relaxed atomics - the mix of nonatomic and (eg) SC atomic accesses > that is allowed in C also leads to thin-air difficulties (this latter > isn't necessarily relevant to Java, of course). Then we explore > further the operational construction we sent a mail around earlier - > it's interesting in that it copes with the thin-air examples (and in > general with reordering) very nicely, but difficulties with other > optimisations suggest it's not really a solution for C. I guess also > not for Java, but I don't have a good intuition for how much > optimisation you might be prepared to give up. > > Meanwhile (we've just been talking at a couple of meetings), Viktor > and Francesco et al. have identified yet other problems with C/C++11, > especially related to whether source-to-source (or IR-to-IR) > optimisations are sound w.r.t. it. Some of these are fixable, but > others involve the same basic thin-air problem. > > Right now, I'm not aware of any fleshed out credible proposal for a > decent model that allows something like relaxed atomics (implementable > with just plain accesses) and current optimisations... perhaps > Alan's event-structure semantics will give us a way forward. > > > My sense is that people following things closely suddenly became less > > optimistic that we will arrive at something simple and beautiful and > > readily understandable after seeing Peter Sewell's proposed amendments > > to C++/C11 (http://www.cl.cam.ac.uk/~pes20/cpp/notes44.html) and Alan > > Jeffrey's unsimple follow-ups to his simple fresh start. See March > > list archives at > > http://mail.openjdk.java.net/pipermail/jmm-dev/2014-March/thread.html > > > > This seems to happen all the time with memory models. Practical > > necessities surrounding processors and optimizers lead to messiness. > > In particular, most people wanted to get rid of JSR133 user-hostile > > "justification sequences" and the like, but this is now far from a > > sure thing. (Peter's approach amounts to a special form of them.) > > Not really the same kind of thing, I think - that operational > construction involves more than just a single candidate execution, but > not the bait-and-switch nature of JSR133. > > best, > Peter > > > > All ideas would be welcome on how we can recover forward progress on > > the core model. Especially since we do have some other updates > > more-or-less conceptually ready to adapt to them, including actual > > specs for "enhanced-volatile" acquire/release and other intrinsics, > > and replacing the "final fields" specs with simpler constructor > > release guarantees. > > > > -Doug > From Peter.Sewell at cl.cam.ac.uk Fri Jul 18 05:43:06 2014 From: Peter.Sewell at cl.cam.ac.uk (Peter Sewell) Date: Fri, 18 Jul 2014 07:43:06 +0200 Subject: [jmm-dev] Jmm revision status In-Reply-To: References: <53C81D32.6040807@cs.oswego.edu> Message-ID: On 18 July 2014 00:57, Hans Boehm wrote: > A few other updates: Brian and I had a paper in MSPC 14 > (http://dl.acm.org/citation.cfm?id=2618134) that mostly summarizes the > out-of-thin-air issues and solutions based on prohibiting store->load > reordering. I would argue that those are still the most practical solutions > we currently have. > > One of my colleagues at Google points out that my earlier fear that bogus > branches needed to enforce load->store ordering would tie up branch > prediction resources should be unfounded. It should be easy to arrange for > these branches to be statically predicted correctly, in which case it > appears that no prediction resources are used. > I think we still need real measurements of the cost Agreed for the last point. For C I'm a bit skeptical; for Java I wouldn't like to even guess. > , which, at least for > Java, I would expect to greatly depend on the cleverness of the compiler in > delaying branches and avoiding unnecessary ones. > > Torvald Riegel and Paul McKenney are trying to turn C++11/C11 > memory_order_consume into something useful, and have been running into some > of the same problems with definition of dependencies as we have here. There's also a bit of a question right now about "fake" data and control dependency preservation on ARM; hopefully that will become clear soon. > Although at most marginally relevant for Java, we also became aware of an > ARM erratum > (http://infocenter.arm.com/help/topic/com.arm.doc.uan0004a/UAN0004A_a9_read_read.pdf, > perhaps discovered by some of the other participants here?) (y) >, that seems to > effectively reduce the cost of prohibiting load->store reordering on ARMv7 > for C++ memory_order_relaxed to zero. Apparently a substantial fraction of > ARMv7 cores have a hardware erratum that requires a fence for > memory_order_relaxed loads anyway. Otherwise loads from the same location > may be reordered, which is disallowed for C++ memory_order_relaxed, but > allowed for Java. Thus any object code that is intended to correctly > support memory_order_relaxed on these processors should already prohibit > load->store reordering as a side-effect. For C and C++, I expect that > realistically applies to all 32-bit ARM code. Unfortunately, the required > workaround seems appreciably more expensive than what we would need to just > enforce load->store ordering, since it needs an actual fence. I do wonder how widely that workaround is actually deployed - any data? > As mentioned, this does not directly change the Java situation. It also > does not affect 64-bit executables intended to run on ARMv8. Indeed best, Peter > Hans > > > > On Thu, Jul 17, 2014 at 1:11 PM, Peter Sewell > wrote: >> >> On 17 July 2014 21:00, Doug Lea
wrote: >> > >> > Things were moving along rather nicely. And then ... nothing. >> >> We have been working in the meantime, but (unsurprisingly) it's proved >> tricky... we have a draft paper at >> >> http://www.cl.cam.ac.uk/~pes20/cpp/c_concurrency_challenges.pdf >> >> that has some good news and bad news, mostly in the C context (work by >> Mark Batty, Kayvan Memarian, Kyndylan Nienhuis, Jean Pichon, and >> myself). The former consist of a machine-checked proof of DRF-SC for >> the C/C++11 model and a machine-checked-equivalent operational C/C++11 >> model. For the latter, we describe our observations that there's no >> fix to the thin-air problems (preserving existing optimisations) in a >> per-candidate-execution style, and that this is not just a problem for >> relaxed atomics - the mix of nonatomic and (eg) SC atomic accesses >> that is allowed in C also leads to thin-air difficulties (this latter >> isn't necessarily relevant to Java, of course). Then we explore >> further the operational construction we sent a mail around earlier - >> it's interesting in that it copes with the thin-air examples (and in >> general with reordering) very nicely, but difficulties with other >> optimisations suggest it's not really a solution for C. I guess also >> not for Java, but I don't have a good intuition for how much >> optimisation you might be prepared to give up. >> >> Meanwhile (we've just been talking at a couple of meetings), Viktor >> and Francesco et al. have identified yet other problems with C/C++11, >> especially related to whether source-to-source (or IR-to-IR) >> optimisations are sound w.r.t. it. Some of these are fixable, but >> others involve the same basic thin-air problem. >> >> Right now, I'm not aware of any fleshed out credible proposal for a >> decent model that allows something like relaxed atomics (implementable >> with just plain accesses) and current optimisations... perhaps >> Alan's event-structure semantics will give us a way forward. >> >> > My sense is that people following things closely suddenly became less >> > optimistic that we will arrive at something simple and beautiful and >> > readily understandable after seeing Peter Sewell's proposed amendments >> > to C++/C11 (http://www.cl.cam.ac.uk/~pes20/cpp/notes44.html) and Alan >> > Jeffrey's unsimple follow-ups to his simple fresh start. See March >> > list archives at >> > http://mail.openjdk.java.net/pipermail/jmm-dev/2014-March/thread.html >> > >> > This seems to happen all the time with memory models. Practical >> > necessities surrounding processors and optimizers lead to messiness. >> > In particular, most people wanted to get rid of JSR133 user-hostile >> > "justification sequences" and the like, but this is now far from a >> > sure thing. (Peter's approach amounts to a special form of them.) >> >> Not really the same kind of thing, I think - that operational >> construction involves more than just a single candidate execution, but >> not the bait-and-switch nature of JSR133. >> >> best, >> Peter >> >> >> > All ideas would be welcome on how we can recover forward progress on >> > the core model. Especially since we do have some other updates >> > more-or-less conceptually ready to adapt to them, including actual >> > specs for "enhanced-volatile" acquire/release and other intrinsics, >> > and replacing the "final fields" specs with simpler constructor >> > release guarantees. >> > >> > -Doug > > From boehm at acm.org Fri Jul 18 06:19:28 2014 From: boehm at acm.org (Hans Boehm) Date: Thu, 17 Jul 2014 23:19:28 -0700 Subject: [jmm-dev] Jmm revision status In-Reply-To: References: <53C81D32.6040807@cs.oswego.edu> Message-ID: > On Thu, Jul 17, 2014 at 10:43 PM, Peter Sewell > wrote: > > > > On 18 July 2014 00:57, Hans Boehm wrote: > > > A few other updates: Brian and I had a paper in MSPC 14 > > > (http://dl.acm.org/citation.cfm?id=2618134) that mostly summarizes the > > > out-of-thin-air issues and solutions based on prohibiting store->load > > > reordering. I would argue that those are still the most practical > solutions > > > we currently have. > > > > > > One of my colleagues at Google points out that my earlier fear that > bogus > > > branches needed to enforce load->store ordering would tie up branch > > > prediction resources should be unfounded. It should be easy to > arrange for > > > these branches to be statically predicted correctly, in which case it > > > appears that no prediction resources are used. > > > > > I think we still need real measurements of the cost > > > > Agreed for the last point. For C I'm a bit skeptical; for Java I > > wouldn't like to even guess. > For C, if we look only at existing implementations, it seems that the only cost is prohibiting some compiler transformations on relaxed operations, and the cost on 64-bit ARMv8. The former seems trivial; I suspect most compilers don't reorder atomic accesses anyway. For Java, I agree. > > > > , which, at least for > > > Java, I would expect to greatly depend on the cleverness of the > compiler in > > > delaying branches and avoiding unnecessary ones. > > > > > > Torvald Riegel and Paul McKenney are trying to turn C++11/C11 > > > memory_order_consume into something useful, and have been running into > some > > > of the same problems with definition of dependencies as we have here. > > > > There's also a bit of a question right now about "fake" data and > > control dependency preservation on ARM; hopefully that will become > > clear soon. > > > Although at most marginally relevant for Java, we also became aware of an > > ARM erratum > > ( > http://infocenter.arm.com/help/topic/com.arm.doc.uan0004a/UAN0004A_a9_read_read.pdf > , > > perhaps discovered by some of the other participants here?) > > (y) > > >, that seems to > > effectively reduce the cost of prohibiting load->store reordering on > ARMv7 > > for C++ memory_order_relaxed to zero. Apparently a substantial fraction > of > > ARMv7 cores have a hardware erratum that requires a fence for > > memory_order_relaxed loads anyway. Otherwise loads from the same > location > > may be reordered, which is disallowed for C++ memory_order_relaxed, but > > allowed for Java. Thus any object code that is intended to correctly > > support memory_order_relaxed on these processors should already prohibit > > load->store reordering as a side-effect. For C and C++, I expect that > > realistically applies to all 32-bit ARM code. Unfortunately, the > required > > workaround seems appreciably more expensive than what we would need to > just > > enforce load->store ordering, since it needs an actual fence. > > I do wonder how widely that workaround is actually deployed - any data? > I suspect it's not. But I think our task is to look at performance in a currently hypothetical world where implementations are actually correct in this respect, and where we no longer see random memory-model induced failures and attribute them to alpha particles, or whatever. I think we're gradually moving towards that hypothetical world, but we're not that close, yet. (I would be surprised if there were any real large systems for which this ARM bug is the most common cause of memory-model-related failures.) Hans > > As mentioned, this does not directly change the Java situation. It also > > does not affect 64-bit executables intended to run on ARMv8. > > Indeed > best, > Peter > > > > Hans > > > > From viktor at mpi-sws.org Fri Jul 18 10:19:31 2014 From: viktor at mpi-sws.org (Viktor Vafeiadis) Date: Fri, 18 Jul 2014 12:19:31 +0200 Subject: [jmm-dev] Jmm revision status In-Reply-To: References: <53C81D32.6040807@cs.oswego.edu> Message-ID: An update from our side: I'm attaching a draft version of the paper that Peter mentioned. In this paper, we show that the combination of dependency cycles with the C11 treatment of non-atomic accesses is even more broken than ew thought, in that it makes many standard source-to-source transformations unsound (e.g., stengthening, expression evaluation linearisation, roach motel reorderings). We also found a few other technical problems with the C11 model regarding the treatment of release sequences and SC atomics, also preventing expected source-to-source program transformations. We then survey a couple of fixes to the model, such as ruling out (hb U rf) cycles, and prove (in Coq) a class of transformations that are sound under these models. Best, Viktor -------------- next part -------------- From francesco.zappa_nardelli at inria.fr Fri Jul 18 11:26:04 2014 From: francesco.zappa_nardelli at inria.fr (Francesco Zappa Nardelli) Date: Fri, 18 Jul 2014 13:26:04 +0200 Subject: [jmm-dev] Jmm revision status In-Reply-To: References: <53C81D32.6040807@cs.oswego.edu> Message-ID: <865778CE-2659-41C2-AF6F-A87285A305F4@inria.fr> Dear all > An update from our side: I'm attaching a draft version of the paper that Peter mentioned. It seems that the attachment didn't made to the list. Anyway, the draft is available from: http://www.di.ens.fr/~zappa/readings/c11comp.pdf Best -francesco From alan.jeffrey at alcatel-lucent.com Fri Jul 18 12:21:33 2014 From: alan.jeffrey at alcatel-lucent.com (Jeffrey, Alan S A (Alan)) Date: Fri, 18 Jul 2014 12:21:33 +0000 Subject: [jmm-dev] Jmm revision status In-Reply-To: <865778CE-2659-41C2-AF6F-A87285A305F4@inria.fr> References: <53C81D32.6040807@cs.oswego.edu> , <865778CE-2659-41C2-AF6F-A87285A305F4@inria.fr> Message-ID: >From our side... Sorry about the radio silence. It's been caused by the unexpected difficulty of getting the DRF proof to go through for event structures. The solution (as with all these things) is to make some "small" changes to the definitions until the theorem is true, but I didn't want to pester the list with small deltas. The good news is that I have a simple(ish) definition that appears to validate DRF, and also a candidate definition of refinement that a) is compositional, and b) validates a bunch of examples (some variable reorderings, thread inlining and roach motel). More details when everything gets nailed down, but the basic idea is still the same... A prime event structure over Sigma is a triple (E,<=,#,l) where: * (E,<=) is a partial order (think program order) * (E,#) is an irreflexive, down-closed relation (think conflicts caused by reads on variables, the event R(x,0) is in conflict with the event R(x,1)) * l: E -> Sigma (the labelling function on events) A memory model laphabet (Sigma,RWJ,RWC,Sync) is a set Sigma with three binary relations: * RWC (read-write conflict, e.g. W(x,1) in RWC(R(x,0))) * RWJ (read-write justification, e.g. W(x,1) in RWJ(R(x,1))) * Sync (synchronization, e.g. W(x,1) in Sync(R(x,0)) when x is volatile) A memory model an even structure is a prime event structure over a memory model alphabet such that (er, precise conditions will go here, hopefully nothing too surprising). A (totally ordered) configuration of an event structure is a sequence of events which is <=-down-closed, conflict-free, and repeat-free. A pre-configuration drops the conflict-free requirement. In a sequence : * "trace order" (s |= d <=to e) is whenever d occurs before e in s * "program order" (s |= d <=po e) is whenever (s |= d <=to e) and d <= e * "synchronization order" (s |= d <=so e) is whenever (s |= d <=to e) and d in Sync(e) * "happens before order" (s |= d <=hb e) is the transitive closure of <=po and <=so * "separated by synchronization order" (s |= d <=sso e) is <=hb <=so <=hb A confliguration s is sequentially consistent whenever there is a function j:E->E (think "justifier") such that for any e in s other than the initial action: * j(e) in RWJ(e) * s |= j(e) <=to e * there is no d in WWC(j(e)) and in RWC(e) such that s |= j(e) <= d <= e A confliguration s is relaxed consistent whenever it is included in a pre-configuration t and there is a function j:E->E such that for any e in s other than the initial action: * j(e) in RWJ(e) * s |= j(e) <=to e * there is no d in WWC(j(e)) and in RWC(e) such that s |= j(e) <=hb d <=hb e * if (t |= j(e) <=sso e) and (e in s) then (j(e) in s) * if (t |= d <=sso j(e)) then (d not in #(e)) I'm in the middle of proving the DRF theorem (that if all SC configurations are DRF then all RC configurations are DRF, and hence SC). The model is more complex than before (due to <=po not being the same as <=sso) and results in some hairy conditions (the last two requirements of being RC are just to get DRF to fly) but not (I hope) horribly so. Lots of work remaining, notably finishing DRF, refinement, using refinement to validate optimization, check on examples, etc. etc. Anyway this is what we've been up to in Chicago! A. ________________________________________ From: jmm-dev [jmm-dev-bounces at openjdk.java.net] on behalf of Francesco Zappa Nardelli [francesco.zappa_nardelli at inria.fr] Sent: Friday, July 18, 2014 6:26 AM To: Viktor Vafeiadis Cc: jmm-dev at openjdk.java.net Subject: Re: [jmm-dev] Jmm revision status Dear all > An update from our side: I'm attaching a draft version of the paper that Peter mentioned. It seems that the attachment didn't made to the list. Anyway, the draft is available from: http://www.di.ens.fr/~zappa/readings/c11comp.pdf Best -francesco From dl at cs.oswego.edu Fri Jul 18 12:57:48 2014 From: dl at cs.oswego.edu (Doug Lea) Date: Fri, 18 Jul 2014 08:57:48 -0400 Subject: [jmm-dev] Jmm revision status In-Reply-To: References: <53C81D32.6040807@cs.oswego.edu> Message-ID: <53C919CC.6030400@cs.oswego.edu> Thanks for all the updates! Viktor et al's draft paper seems to have the fullest discussion of possible C++/C11 fixes that entail disclaimers about cycles. In Java, even these programs require some kind of semantics. But they can be extra-extra weak. As in, the only OOTA-ish reads are those that could have occurred in certain executions that are not even legal. I wonder if there is a path to success along these lines available. -Doug From dl at cs.oswego.edu Fri Jul 18 15:38:57 2014 From: dl at cs.oswego.edu (Doug Lea) Date: Fri, 18 Jul 2014 11:38:57 -0400 Subject: [jmm-dev] Jmm revision status In-Reply-To: References: <53C81D32.6040807@cs.oswego.edu> Message-ID: <53C93F91.3070605@cs.oswego.edu> On 07/17/2014 06:57 PM, Hans Boehm wrote: > A few other updates: Brian and I had a paper in MSPC 14 > (http://dl.acm.org/citation.cfm?id=2618134) that mostly summarizes the > out-of-thin-air issues and solutions based on prohibiting store->load > reordering. I would argue that those are still the most practical solutions we > currently have. > Has anyone thought through a rule that amounts to requiring compilers preserve them (as in your section 6), but treating those relaxed/non-atomic cases where Arm/Power don't honor them as just the usual benign weirdness? Are there any cases where the consequences are any worse than other cases that we claim are benign? I think that some of Viktor et al's variants come close to this. -Doug From boehm at acm.org Fri Jul 18 22:20:46 2014 From: boehm at acm.org (Hans Boehm) Date: Fri, 18 Jul 2014 15:20:46 -0700 Subject: [jmm-dev] Jmm revision status In-Reply-To: <53C93F91.3070605@cs.oswego.edu> References: <53C81D32.6040807@cs.oswego.edu> <53C93F91.3070605@cs.oswego.edu> Message-ID: I'm not quite sure what you mean by "the usual benign weirdness"? Preserving load->store ordering (or equivalently requiring rf U hb to be acyclic) leads to an observably different memory model from what we have now. If I have Thread1: r1 = x; y = r1; Thread2: r2 = y; x = 42; r1 = r2 = 42 would no longer be allowed. (In the section 6.2 version of the spec, we'd have to write something like Thread 2: r2 = y; x = 0 * y + 42; to exhibit the difference.) I don't see how we could both prohibit this in the specification, but then not actually enforce it at the hardware level? Note that having the hardware enforce it often has a minimal or zero impact, even on ARMv8, depending on how the initial load is used. If we actually had, on ARMv8: Thread2: r2 = y; x = 42; if (r2 > 0) z = 17; I believe we can just transform the code to Thread2: r2 = y; if (r2 > 0) z = 17; x = 42; to preserve the ordering. We add no instructions, but may stall earlier for r2 to become available. This makes the cost estimates subtle and nontrivial. A really dumb implementations that just adds a conditional branch to each load (carefully, so that no branch prediction slots are consumed) would be an interesting data point. Delaying the bogus branch until the next non-data-dependent store, and omitting it if there already is an adequate branch in the interval, would be better. Also delaying stores past existing conditionals would presumably be best. I have no idea how many of these bogus branches actually remain after these transformations. The down side is that this will have a major impact on optimization of Java code. We'd be dramatically changing the ground rules, again. Hans On Fri, Jul 18, 2014 at 8:38 AM, Doug Lea
wrote: > On 07/17/2014 06:57 PM, Hans Boehm wrote: > >> A few other updates: Brian and I had a paper in MSPC 14 >> (http://dl.acm.org/citation.cfm?id=2618134) that mostly summarizes the >> out-of-thin-air issues and solutions based on prohibiting store->load >> reordering. I would argue that those are still the most practical >> solutions we >> currently have. >> >> > Has anyone thought through a rule that amounts to requiring > compilers preserve them (as in your section 6), but treating > those relaxed/non-atomic cases where Arm/Power don't honor > them as just the usual benign weirdness? Are there any > cases where the consequences are any worse than other cases > that we claim are benign? I think that some of Viktor et al's > variants come close to this. > > -Doug > > From dl at cs.oswego.edu Fri Jul 18 23:15:24 2014 From: dl at cs.oswego.edu (Doug Lea) Date: Fri, 18 Jul 2014 19:15:24 -0400 Subject: [jmm-dev] Jmm revision status In-Reply-To: References: <53C81D32.6040807@cs.oswego.edu> <53C93F91.3070605@cs.oswego.edu> Message-ID: <53C9AA8C.6010904@cs.oswego.edu> On 07/18/2014 06:20 PM, Hans Boehm wrote: > I'm not quite sure what you mean by "the usual benign weirdness"? I meant (in a thinking-out-loud way): Suppose compilers for ARM followed your sec 6 rules, but omitted any fence/fence-like instructions (as in your sec 6.1). Would this lead to different but solvable problems? But I now see that it does not. > > Preserving load->store ordering (or equivalently requiring rf U hb to be > acyclic) leads to an observably different memory model from what we have now. (You might recall that this was the heart of the "causally consistent, cache coherent" model some of us once explored, but gave up on mainly because mapping to Power/ARM was a mystery. Due to Peter et al's work, it is no longer a mystery, but instead a known practical impossibility.) -Doug From Peter.Sewell at cl.cam.ac.uk Sat Jul 19 07:30:31 2014 From: Peter.Sewell at cl.cam.ac.uk (Peter Sewell) Date: Sat, 19 Jul 2014 08:30:31 +0100 Subject: [jmm-dev] Jmm revision status In-Reply-To: References: <53C81D32.6040807@cs.oswego.edu> Message-ID: On 18 July 2014 07:19, Hans Boehm wrote: > >> On Thu, Jul 17, 2014 at 10:43 PM, Peter Sewell >> wrote: >> > >> > On 18 July 2014 00:57, Hans Boehm wrote: >> > > A few other updates: Brian and I had a paper in MSPC 14 >> > > (http://dl.acm.org/citation.cfm?id=2618134) that mostly summarizes the >> > > out-of-thin-air issues and solutions based on prohibiting store->load >> > > reordering. I would argue that those are still the most practical >> > > solutions >> > > we currently have. >> > > >> > > One of my colleagues at Google points out that my earlier fear that >> > > bogus >> > > branches needed to enforce load->store ordering would tie up branch >> > > prediction resources should be unfounded. It should be easy to >> > > arrange for >> > > these branches to be statically predicted correctly, in which case it >> > > appears that no prediction resources are used. >> > >> > > I think we still need real measurements of the cost >> > >> > Agreed for the last point. For C I'm a bit skeptical; for Java I >> > wouldn't like to even guess. > > For C, if we look only at existing implementations, it seems that the only > cost is prohibiting some compiler transformations on relaxed operations, and > the cost on 64-bit ARMv8. The former seems trivial; I suspect most > compilers don't reorder atomic accesses anyway. For the optimisation cost, I think Francesco et al. are starting to see optimisations involving atomics, but I agree that for relaxed there shouldn't (in principle) be much cost from forbidding them. > For Java, I agree. > >> > >> > > , which, at least for >> > > Java, I would expect to greatly depend on the cleverness of the >> > > compiler in >> > > delaying branches and avoiding unnecessary ones. >> > > >> > > Torvald Riegel and Paul McKenney are trying to turn C++11/C11 >> > > memory_order_consume into something useful, and have been running into >> > > some >> > > of the same problems with definition of dependencies as we have here. >> > >> > There's also a bit of a question right now about "fake" data and >> > control dependency preservation on ARM; hopefully that will become >> > clear soon. >> >> > Although at most marginally relevant for Java, we also became aware of >> > an >> > ARM erratum >> > >> > (http://infocenter.arm.com/help/topic/com.arm.doc.uan0004a/UAN0004A_a9_read_read.pdf, >> > perhaps discovered by some of the other participants here?) >> >> (y) >> >> >, that seems to >> > effectively reduce the cost of prohibiting load->store reordering on >> > ARMv7 >> > for C++ memory_order_relaxed to zero. Apparently a substantial fraction >> > of >> > ARMv7 cores have a hardware erratum that requires a fence for >> > memory_order_relaxed loads anyway. Otherwise loads from the same >> > location >> > may be reordered, which is disallowed for C++ memory_order_relaxed, but >> > allowed for Java. Thus any object code that is intended to correctly >> > support memory_order_relaxed on these processors should already prohibit >> > load->store reordering as a side-effect. For C and C++, I expect that >> > realistically applies to all 32-bit ARM code. Unfortunately, the >> > required >> > workaround seems appreciably more expensive than what we would need to >> > just >> > enforce load->store ordering, since it needs an actual fence. >> >> I do wonder how widely that workaround is actually deployed - any data? > > > I suspect it's not. But I think our task is to look at performance in a > currently hypothetical world where implementations are actually correct in > this respect, and where we no longer see random memory-model induced > failures and attribute them to alpha particles, or whatever. yes, but we also need a reasonable path towards that world. If ARM compiler implementations are not going to take the cost of that workaround (eg because the coherence problem is sufficiently rare in practice - btw, amusingly, it seemed as if compiler optimisations like CSE might actually ameliorate the problem), then it still becomes difficult to argue that they should add load->store fencing for C-relaxed or Java-nonvolatile reads. > I think we're > gradually moving towards that hypothetical world, but we're not that close, > yet. (I would be surprised if there were any real large systems for which > this ARM bug is the most common cause of memory-model-related failures.) > > Hans > >> >> > As mentioned, this does not directly change the Java situation. It also >> > does not affect 64-bit executables intended to run on ARMv8. >> >> Indeed >> best, >> Peter >> >> >> > Hans >> > >> > From francesco.zappa_nardelli at inria.fr Sat Jul 19 08:52:05 2014 From: francesco.zappa_nardelli at inria.fr (Francesco Zappa Nardelli) Date: Sat, 19 Jul 2014 10:52:05 +0200 Subject: [jmm-dev] Jmm revision status In-Reply-To: References: <53C81D32.6040807@cs.oswego.edu> Message-ID: >> For C, if we look only at existing implementations, it seems that the only >> cost is prohibiting some compiler transformations on relaxed operations, and >> the cost on 64-bit ARMv8. The former seems trivial; I suspect most >> compilers don't reorder atomic accesses anyway. > > For the optimisation cost, I think Francesco et al. are starting to > see optimisations involving atomics, but I agree that for relaxed > there shouldn't (in principle) be much cost from forbidding them. For what it is worth, we have observed: - reorderings of non-atomic accesses with relaxed atomic accesses on clang - reorderings of non-atomic accesses with atomic accesses on gcc - eliminations of non-atomic accesses across atomic accesses on gcc (should double check) We have not observed optimisations involving only atomic accesses (eg reordering two atomic accesses). The situation evolved a bit wrt last year, when we could not observe any of these. -francesco From dl at cs.oswego.edu Sat Jul 19 11:52:54 2014 From: dl at cs.oswego.edu (Doug Lea) Date: Sat, 19 Jul 2014 07:52:54 -0400 Subject: [jmm-dev] ECOOP, JVMLS Message-ID: <53CA5C16.4070205@cs.oswego.edu> Some of us on this list will be at ECOOP (http://ecoop14.it.uu.se/) the week of July 28 and others will be at JVMLS (http://openjdk.java.net/projects/mlvm/jvmlangsummit/). For those at ECOOP interested in informally meeting about memory model issues, how about trying to get together after the "Concurrency" session Wednesday afternoon? For those at JVMLS, I'm volunteering Aleksey Shipilev to try to coordinate something informal among people gathering after his JMM talk on Tuesday. -Doug From aleksey.shipilev at oracle.com Sat Jul 19 12:18:14 2014 From: aleksey.shipilev at oracle.com (Aleksey Shipilev) Date: Sat, 19 Jul 2014 16:18:14 +0400 Subject: [jmm-dev] ECOOP, JVMLS In-Reply-To: <53CA5C16.4070205@cs.oswego.edu> References: <53CA5C16.4070205@cs.oswego.edu> Message-ID: <53CA6206.8020905@oracle.com> On 07/19/2014 03:52 PM, Doug Lea wrote: > For those at JVMLS, I'm volunteering Aleksey Shipilev to > try to coordinate something informal among people gathering > after his JMM talk on Tuesday. Yes, there would be the JMM Workshop at JVMLS this year -- workshops are slated as "informal meetings" at JVMLS. My gut feeling is that we would rehash the pain points in current memory model, without touching SC-DRF/OoTA parts of JMM 9. Final field guarantees and access atomicity seem to be easy to tackle. JMM Workshop requires the background reading for most folks at JVMLS, so I've transcribed my long talk about JMM [1]. If you have any comments on that transcript, please do send me a note. Reading what's happening on this list this week, I'd think OoTA/JMM9 section needs expanding a bit :) Thanks, -Aleksey. [1] http://shipilev.net/blog/2014/jmm-pragmatics/ From dl at cs.oswego.edu Sat Jul 19 16:27:49 2014 From: dl at cs.oswego.edu (Doug Lea) Date: Sat, 19 Jul 2014 12:27:49 -0400 Subject: [jmm-dev] ECOOP, JVMLS In-Reply-To: <53CA6206.8020905@oracle.com> References: <53CA5C16.4070205@cs.oswego.edu> <53CA6206.8020905@oracle.com> Message-ID: <53CA9C85.20609@cs.oswego.edu> On 07/19/2014 08:18 AM, Aleksey Shipilev wrote: > I'd think OoTA/JMM9 section needs expanding a bit :) > > [1] http://shipilev.net/blog/2014/jmm-pragmatics/ > I tried to come up with a short MM-consumer (vs MM developer) accessible summary of OOTA and related issues. It's not all that short, in part because I tried to relate to more familiar problems. But even at that, possibly misleading/wrong in effect. I'm sure Aleksey would appreciate improvements. ... OOTA and related problems You'd think it would be easy to eliminate the possibility of "thin-air" reads by stating that, even in crazily racy programs, every read reads from some write that does not happen after it. But this does not alone suffice in the presence of access cycles in a program, as in: Read A can read from write B a value derived from a read from A, and so on. Such circularities might seem impossible or uninteresting. But because accesses may be reordered, a program that doesn't have any surface-level cycles may be equivalent to one that does. Further, a read that appears to be dependent on some other write in a way that avoids circularities might be optimized into one without the dependence (for example "0 * x" is not dependent on x). Thin-air and circularity issue turn out to be the tip of the iceberg in efforts to ensure that guarantees hold uniformly across all reorderings and optimizations. Dealing with the latter invites enumeration of all possible optimizations, nearly guaranteeing that some will be missed and/or incorrectly declared illegal (as is the case for the current JSR133 JMM). Acceptable solutions must also respect the limitations of processors and compilers making "local" decisions about orderings and values that don't always make sense when dealing with this non-local constraint. Investigations have led to discovery of some related anomalies. This general form of problem also appears similar to that of static initialization circularities in Java that are addressed by rules allowing circular accesses to uninitialized fields to see their default zero (0/0.0/false/null) values. But this is arranged in part via (reentrant) locking, but which need not be present for arbitrary accesses. (Although it seems completely defensible for any solution here to also allow/admit default-zero as a not-very-thin-air value.) Why not just get it over with by adding to the rest of the spec a no-thin-air disclaimer? This might be tolerable for (some) readers, but much less so for development and analysis tools that otherwise get stuck because of non-uniform treatment of potential circularities. The approach to avoiding this in the current (JSR133) JMM has itself been an impediment to tool development. From OGiroux at nvidia.com Sat Jul 19 16:39:10 2014 From: OGiroux at nvidia.com (Olivier Giroux) Date: Sat, 19 Jul 2014 09:39:10 -0700 Subject: [jmm-dev] Jmm revision status In-Reply-To: References: <53C81D32.6040807@cs.oswego.edu> <53C93F91.3070605@cs.oswego.edu> Message-ID: I would to make a minor plea to not assume that ARM is an exotic processor example. Sometimes it sounds like you're saying "if it's ok for ARM then it's going to be ok for everyone". From my perspective that apple didn't fall far from the tree, so it's not much of a fitness proof if it works for ARM. This being said my constructive feedback is that the relative costs of overriding the native RMO order varies a lot among RMO processors, because the space comprises many more radical designs. So forcing LD->ST is not something that's cheap for everyone to do. I don't see that changing very much in the near future. Cheers, Olivier Sent from my iPhone > On Jul 18, 2014, at 3:21 PM, "Hans Boehm" wrote: > > I'm not quite sure what you mean by "the usual benign weirdness"? > > Preserving load->store ordering (or equivalently requiring rf U hb to be > acyclic) leads to an observably different memory model from what we have > now. If I have > > Thread1: r1 = x; y = r1; > > Thread2: r2 = y; x = 42; > > r1 = r2 = 42 would no longer be allowed. > > (In the section 6.2 version of the spec, we'd have to write something like > > Thread 2: r2 = y; x = 0 * y + 42; > > to exhibit the difference.) > > I don't see how we could both prohibit this in the specification, but then > not actually enforce it at the hardware level? > > Note that having the hardware enforce it often has a minimal or zero > impact, even on ARMv8, depending on how the initial load is used. If we > actually had, on ARMv8: > > Thread2: r2 = y; x = 42; if (r2 > 0) z = 17; > > I believe we can just transform the code to > > Thread2: r2 = y; if (r2 > 0) z = 17; x = 42; > > to preserve the ordering. We add no instructions, but may stall earlier > for r2 to become available. > > This makes the cost estimates subtle and nontrivial. A really dumb > implementations that just adds a conditional branch to each load > (carefully, so that no branch prediction slots are consumed) would be an > interesting data point. Delaying the bogus branch until the next > non-data-dependent store, and omitting it if there already is an adequate > branch in the interval, would be better. Also delaying stores past > existing conditionals would presumably be best. > > I have no idea how many of these bogus branches actually remain after these > transformations. The down side is that this will have a major impact on > optimization of Java code. We'd be dramatically changing the ground rules, > again. > > Hans > > >> On Fri, Jul 18, 2014 at 8:38 AM, Doug Lea
wrote: >> >>> On 07/17/2014 06:57 PM, Hans Boehm wrote: >>> >>> A few other updates: Brian and I had a paper in MSPC 14 >>> (http://dl.acm.org/citation.cfm?id=2618134) that mostly summarizes the >>> out-of-thin-air issues and solutions based on prohibiting store->load >>> reordering. I would argue that those are still the most practical >>> solutions we >>> currently have. >> Has anyone thought through a rule that amounts to requiring >> compilers preserve them (as in your section 6), but treating >> those relaxed/non-atomic cases where Arm/Power don't honor >> them as just the usual benign weirdness? Are there any >> cases where the consequences are any worse than other cases >> that we claim are benign? I think that some of Viktor et al's >> variants come close to this. >> >> -Doug >> >> ----------------------------------------------------------------------------------- This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. ----------------------------------------------------------------------------------- From Peter.Sewell at cl.cam.ac.uk Sat Jul 19 16:45:44 2014 From: Peter.Sewell at cl.cam.ac.uk (Peter Sewell) Date: Sat, 19 Jul 2014 17:45:44 +0100 Subject: [jmm-dev] ECOOP, JVMLS In-Reply-To: <53CA9C85.20609@cs.oswego.edu> References: <53CA5C16.4070205@cs.oswego.edu> <53CA6206.8020905@oracle.com> <53CA9C85.20609@cs.oswego.edu> Message-ID: On 19 July 2014 17:27, Doug Lea
wrote: > On 07/19/2014 08:18 AM, Aleksey Shipilev wrote: >> >> I'd think OoTA/JMM9 section needs expanding a bit :) >> >> [1] http://shipilev.net/blog/2014/jmm-pragmatics/ >> > > I tried to come up with a short MM-consumer (vs MM developer) > accessible summary of OOTA and related issues. The best description we've come up with of OOTA issues is Section 4 of the draft I sent around: http://www.cl.cam.ac.uk/~pes20/cpp/c_concurrency_challenges.pdf > It's not all that short, in part because I tried to relate to > more familiar problems. But even at that, possibly > misleading/wrong in effect. > I'm sure Aleksey would appreciate improvements. > > ... > > > OOTA and related problems > > You'd think it would be easy to eliminate the possibility of > "thin-air" reads by stating that, even in crazily racy programs, every > read reads from some write that does not happen after it. But this > does not alone suffice in the presence of access cycles in a program, > as in: Read A can read from write B a value derived from a read from > A, and so on. > > Such circularities might seem impossible or uninteresting. But > because accesses may be reordered, a program that doesn't have any > surface-level cycles may be equivalent to one that does. Further, a > read that appears to be dependent on some other write in a way that > avoids circularities might be optimized into one without the > dependence (for example "0 * x" is not dependent on x). Thin-air and > circularity issue turn out to be the tip of the iceberg in efforts to > ensure that guarantees hold uniformly across all reorderings and > optimizations. Dealing with the latter invites enumeration of all > possible optimizations, nearly guaranteeing that some will be missed > and/or incorrectly declared illegal (as is the case for the current > JSR133 JMM). Acceptable solutions must also respect the limitations > of processors and compilers making "local" decisions about orderings > and values that don't always make sense when dealing with this > non-local constraint. Investigations have led to discovery of some > related anomalies. > > This general form of problem also appears similar to that of static > initialization circularities in Java that are addressed by rules > allowing circular accesses to uninitialized fields to see their > default zero (0/0.0/false/null) values. But this is arranged in part > via (reentrant) locking, but which need not be present for arbitrary > accesses. (Although it seems completely defensible for any solution > here to also allow/admit default-zero as a not-very-thin-air value.) > > Why not just get it over with by adding to the rest of the spec a > no-thin-air disclaimer? the above sentence may be misleading - if we knew how to *state* such a disclaimer (in a way compatible with enough optimisation and h/w behaviour), we'd be done already. best, Peter > This might be tolerable for (some) readers, but > much less so for development and analysis tools that otherwise get > stuck because of non-uniform treatment of potential circularities. The > approach to avoiding this in the current (JSR133) JMM has itself been > an impediment to tool development. > > > > From boehm at acm.org Sat Jul 19 18:25:09 2014 From: boehm at acm.org (Hans Boehm) Date: Sat, 19 Jul 2014 11:25:09 -0700 Subject: [jmm-dev] Jmm revision status In-Reply-To: References: <53C81D32.6040807@cs.oswego.edu> Message-ID: In my mind, the main issues are the fact that the ARM fence workaround isn't needed for Java, or for ARMv8, or for the other architectures that Olivier worries about. Although I'm sure there are implementations that will decide that performance is more important than correctness in this area, I'd be inclined to ignore those for this discussion. Otherwise we get into messy issues of occurrence frequency that I think it's really hard to get a handle on. As a wild guess, and to illustrate the problem, I conjecture that the ARM erratum is actually more of a practical issue than if we promised load->store ordering, and just declared violations to be processor bugs. I'm having a hard time coming up with programming idioms that rely on non-dependent load->store ordering. On the other hand, as Paul pointed out in the original C++ discussions, unexpected load x -> load x reordering causes all sorts of problems. (I agree that CSE statistically helps, especially in the egregious cases. But I think it often won't suffice, due to compilation unit boundaries, and because it's dangerous and generally discouraged to merge atomic accesses across loop iterations.) Hans On Sat, Jul 19, 2014 at 12:30 AM, Peter Sewell wrote: > On 18 July 2014 07:19, Hans Boehm wrote: > > > >> On Thu, Jul 17, 2014 at 10:43 PM, Peter Sewell < > Peter.Sewell at cl.cam.ac.uk> > >> wrote: > >> > > >> > On 18 July 2014 00:57, Hans Boehm wrote: > >> > > A few other updates: Brian and I had a paper in MSPC 14 > >> > > (http://dl.acm.org/citation.cfm?id=2618134) that mostly summarizes > the > >> > > out-of-thin-air issues and solutions based on prohibiting > store->load > >> > > reordering. I would argue that those are still the most practical > >> > > solutions > >> > > we currently have. > >> > > > >> > > One of my colleagues at Google points out that my earlier fear that > >> > > bogus > >> > > branches needed to enforce load->store ordering would tie up branch > >> > > prediction resources should be unfounded. It should be easy to > >> > > arrange for > >> > > these branches to be statically predicted correctly, in which case > it > >> > > appears that no prediction resources are used. > >> > > >> > > I think we still need real measurements of the cost > >> > > >> > Agreed for the last point. For C I'm a bit skeptical; for Java I > >> > wouldn't like to even guess. > > > > For C, if we look only at existing implementations, it seems that the > only > > cost is prohibiting some compiler transformations on relaxed operations, > and > > the cost on 64-bit ARMv8. The former seems trivial; I suspect most > > compilers don't reorder atomic accesses anyway. > > For the optimisation cost, I think Francesco et al. are starting to > see optimisations involving atomics, but I agree that for relaxed > there shouldn't (in principle) be much cost from forbidding them. > > > For Java, I agree. > > > >> > > >> > > , which, at least for > >> > > Java, I would expect to greatly depend on the cleverness of the > >> > > compiler in > >> > > delaying branches and avoiding unnecessary ones. > >> > > > >> > > Torvald Riegel and Paul McKenney are trying to turn C++11/C11 > >> > > memory_order_consume into something useful, and have been running > into > >> > > some > >> > > of the same problems with definition of dependencies as we have > here. > >> > > >> > There's also a bit of a question right now about "fake" data and > >> > control dependency preservation on ARM; hopefully that will become > >> > clear soon. > >> > >> > Although at most marginally relevant for Java, we also became aware of > >> > an > >> > ARM erratum > >> > > >> > ( > http://infocenter.arm.com/help/topic/com.arm.doc.uan0004a/UAN0004A_a9_read_read.pdf > , > >> > perhaps discovered by some of the other participants here?) > >> > >> (y) > >> > >> >, that seems to > >> > effectively reduce the cost of prohibiting load->store reordering on > >> > ARMv7 > >> > for C++ memory_order_relaxed to zero. Apparently a substantial > fraction > >> > of > >> > ARMv7 cores have a hardware erratum that requires a fence for > >> > memory_order_relaxed loads anyway. Otherwise loads from the same > >> > location > >> > may be reordered, which is disallowed for C++ memory_order_relaxed, > but > >> > allowed for Java. Thus any object code that is intended to correctly > >> > support memory_order_relaxed on these processors should already > prohibit > >> > load->store reordering as a side-effect. For C and C++, I expect that > >> > realistically applies to all 32-bit ARM code. Unfortunately, the > >> > required > >> > workaround seems appreciably more expensive than what we would need to > >> > just > >> > enforce load->store ordering, since it needs an actual fence. > >> > >> I do wonder how widely that workaround is actually deployed - any data? > > > > > > I suspect it's not. But I think our task is to look at performance in a > > currently hypothetical world where implementations are actually correct > in > > this respect, and where we no longer see random memory-model induced > > failures and attribute them to alpha particles, or whatever. > > yes, but we also need a reasonable path towards that world. If ARM > compiler implementations are not going to take the cost of that > workaround (eg because the coherence problem is sufficiently rare in > practice - btw, amusingly, it seemed as if compiler optimisations like > CSE might actually ameliorate the problem), then it still becomes > difficult to argue that they should add load->store fencing for > C-relaxed or Java-nonvolatile reads. > > > I think we're > > gradually moving towards that hypothetical world, but we're not that > close, > > yet. (I would be surprised if there were any real large systems for > which > > this ARM bug is the most common cause of memory-model-related failures.) > > > > Hans > > > >> > >> > As mentioned, this does not directly change the Java situation. It > also > >> > does not affect 64-bit executables intended to run on ARMv8. > >> > >> Indeed > >> best, > >> Peter > >> > >> > >> > Hans > >> > > >> > > > From dl at cs.oswego.edu Sat Jul 19 19:58:56 2014 From: dl at cs.oswego.edu (Doug Lea) Date: Sat, 19 Jul 2014 15:58:56 -0400 Subject: [jmm-dev] ECOOP, JVMLS In-Reply-To: References: <53CA5C16.4070205@cs.oswego.edu> <53CA6206.8020905@oracle.com> <53CA9C85.20609@cs.oswego.edu> Message-ID: <53CACE00.6080103@cs.oswego.edu> On 07/19/2014 12:45 PM, Peter Sewell wrote: >> Why not just get it over with by adding to the rest of the spec a >> no-thin-air disclaimer? > > the above sentence may be misleading - if we knew how to *state* such > a disclaimer (in a way compatible with enough optimisation and h/w > behaviour), we'd be done already. > Yes, thanks. I included this only because some people know that this tactic was tried with C++/C11 :-) But it needs a follow-on sentence reminding people of definitional problems mentioned in previous paragraphs. Also, in think-out-loud-mode: The most general form of OOTA Query seems to be: Can a given value be returned by a given read when the given program is run in any execution under an arbitrarily unconstrained (weak) memory model? This might be easier to characterize than some alternatives. -Doug From Peter.Sewell at cl.cam.ac.uk Sun Jul 20 00:24:23 2014 From: Peter.Sewell at cl.cam.ac.uk (Peter Sewell) Date: Sun, 20 Jul 2014 01:24:23 +0100 Subject: [jmm-dev] ECOOP, JVMLS In-Reply-To: <53CACE00.6080103@cs.oswego.edu> References: <53CA5C16.4070205@cs.oswego.edu> <53CA6206.8020905@oracle.com> <53CA9C85.20609@cs.oswego.edu> <53CACE00.6080103@cs.oswego.edu> Message-ID: On Jul 19, 2014 8:59 PM, "Doug Lea"
wrote: > > On 07/19/2014 12:45 PM, Peter Sewell wrote: >>> >>> Why not just get it over with by adding to the rest of the spec a >>> no-thin-air disclaimer? >> >> >> the above sentence may be misleading - if we knew how to *state* such >> a disclaimer (in a way compatible with enough optimisation and h/w >> behaviour), we'd be done already. >> > > Yes, thanks. I included this only because some people know that > this tactic was tried with C++/C11 :-) But it needs a follow-on > sentence reminding people of definitional problems mentioned in > previous paragraphs. > > Also, in think-out-loud-mode: The most general form of OOTA > Query seems to be: Can a given value be returned by a given > read when the given program is run in any execution under an > arbitrarily unconstrained (weak) memory model? This might > be easier to characterize than some alternatives. Easy to characterise, but it'd include executions with totally crazy control-flow-path choices and utterly broken internal invariants - so not much help for reasoning... peter > -Doug > > > > > > From dl at cs.oswego.edu Sun Jul 20 11:21:14 2014 From: dl at cs.oswego.edu (Doug Lea) Date: Sun, 20 Jul 2014 07:21:14 -0400 Subject: [jmm-dev] ECOOP, JVMLS In-Reply-To: References: <53CA5C16.4070205@cs.oswego.edu> <53CA6206.8020905@oracle.com> <53CA9C85.20609@cs.oswego.edu> <53CACE00.6080103@cs.oswego.edu> Message-ID: <53CBA62A.8090606@cs.oswego.edu> On 07/19/2014 08:24 PM, Peter Sewell wrote: > > On Jul 19, 2014 8:59 PM, "Doug Lea"
> > wrote: > > Also, in think-out-loud-mode: The most general form of OOTA > > Query seems to be: Can a given value be returned by a given > > read when the given program is run in any execution under an > > arbitrarily unconstrained (weak) memory model? This might > > be easier to characterize than some alternatives. > > Easy to characterise, but it'd include executions with totally crazy > control-flow-path choices and utterly broken internal invariants - so not much > help for reasoning... > Perhaps not *much* help, but this extremely coarse approximation would sometimes rule out some OOTA-ish phenomena. As in: Q: Can my racy code print 42? A1: Yes, here's why.... A2: No, because even under the most unconstrained model, it cannot. A3: Sorry, dunno. You probably want to change your code. Development and analysis tools hit cases like A3 all the time, although not usually because of spec limitations. -Doug From Stephan.Diestelhorst at arm.com Thu Jul 24 10:30:07 2014 From: Stephan.Diestelhorst at arm.com (Stephan Diestelhorst) Date: Thu, 24 Jul 2014 11:30:07 +0100 Subject: [jmm-dev] Jmm revision status In-Reply-To: References: <53C81D32.6040807@cs.oswego.edu> Message-ID: <2075660.tnU7XVJeyC@mymac-ubuntu> Peter, On Friday 18 July 2014 06:43:06 Peter Sewell wrote: > On 18 July 2014 00:57, Hans Boehm wrote: > > Torvald Riegel and Paul McKenney are trying to turn C++11/C11 > > memory_order_consume into something useful, and have been running > > into some of the same problems with definition of dependencies as we > > have here. > > There's also a bit of a question right now about "fake" data and > control dependency preservation on ARM; hopefully that will become > clear soon. what is the question, here? Is that a semantical question, or are you wondering about the performance? -- Sincerely, Stephan Stephan Diestelhorst Staff Engineer, ARM R&D Systems +44 (0)1223 405662 -- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you. ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2557590 ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2548782 From Peter.Sewell at cl.cam.ac.uk Thu Jul 24 10:50:26 2014 From: Peter.Sewell at cl.cam.ac.uk (Peter Sewell) Date: Thu, 24 Jul 2014 11:50:26 +0100 Subject: [jmm-dev] Jmm revision status In-Reply-To: <2075660.tnU7XVJeyC@mymac-ubuntu> References: <53C81D32.6040807@cs.oswego.edu> <2075660.tnU7XVJeyC@mymac-ubuntu> Message-ID: On 24 July 2014 11:30, Stephan Diestelhorst wrote: > Peter, > > On Friday 18 July 2014 06:43:06 Peter Sewell wrote: > > On 18 July 2014 00:57, Hans Boehm wrote: > > > Torvald Riegel and Paul McKenney are trying to turn C++11/C11 > > > memory_order_consume into something useful, and have been running > > > into some of the same problems with definition of dependencies as we > > > have here. > > > > There's also a bit of a question right now about "fake" data and > > control dependency preservation on ARM; hopefully that will become > > clear soon. > > what is the question, here? Is that a semantical question, or are you > wondering about the performance? > the former... > > -- > Sincerely, > Stephan > > Stephan Diestelhorst > Staff Engineer, > ARM R&D Systems > +44 (0)1223 405662 > > -- IMPORTANT NOTICE: The contents of this email and any attachments are > confidential and may also be privileged. If you are not the intended > recipient, please notify the sender immediately and do not disclose the > contents to any other person, use it for any purpose, or store or copy the > information in any medium. Thank you. > > ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, > Registered in England & Wales, Company No: 2557590 > ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, > Registered in England & Wales, Company No: 2548782 > >