From dl at cs.oswego.edu Tue Apr 1 11:58:19 2014 From: dl at cs.oswego.edu (Doug Lea) Date: Tue, 01 Apr 2014 07:58:19 -0400 Subject: [jmm-dev] avoiding thin-air via an operational out-of-order semantics construction In-Reply-To: References: <53397C22.90004@cs.oswego.edu> Message-ID: <533AA9DB.9070004@cs.oswego.edu> On 03/31/2014 01:20 PM, Peter Sewell wrote: > On 31 March 2014 15:30, Doug Lea
wrote: >> On 03/31/2014 09:01 AM, Peter Sewell wrote: >>> >>> http://www.cl.cam.ac.uk/~pes20/cpp/notes44.html (Jean Pichon + Peter >>> Sewell) >>> At present it only covers reorderings, not eliminations >> I'm not sure eliminations per se are a target? > > Some eliminations could enable other reorderings, so (except perhaps > for ACCESS_ONCE) I think we have to deal with them. Removing an unused access should not be too interesting in itself. The same visible outcomes should be allowed whether you actually remove them or not. So the interesting part is introducing substitutions that cause accesses to not be used. >> >> >> There seem to be some non-horrible ways to enable all inter-thread >> analyses that might ever arise in practice: Allow actions of one >> thread to be moved to another under any SC interleaving that preserves >> liveness. > > Not sure I really understand what you mean there. Got a slightly > concrete example in mind? Not that any of them are realistic, but see TC1 and some others at http://www.cs.umd.edu/~pugh/java/memoryModel/unifiedProposal/testcases.html The main idea is that by "manually" time-slicing across threads, you can reduce whole-program inter-thread analysis to whole-program intra-thread analysis, but only under closed-world SC. Of all possible global analyses, one would think that at least this form should be allowed. And in fact, taken alone, it seems always allowable in any plausible memory model. The uncertain part is whether and how you can apply value constraints etc from such an analysis to non-SC executions. A quick first stab at this in the style of your or Alan's approaches looks problematic, so I'll explore more before further commenting. -Doug From Peter.Sewell at cl.cam.ac.uk Tue Apr 1 13:55:32 2014 From: Peter.Sewell at cl.cam.ac.uk (Peter Sewell) Date: Tue, 1 Apr 2014 14:55:32 +0100 Subject: [jmm-dev] avoiding thin-air via an operational out-of-order semantics construction In-Reply-To: <20140331200309.GO4284@linux.vnet.ibm.com> References: <20140331200309.GO4284@linux.vnet.ibm.com> Message-ID: On 31 March 2014 21:03, Paul E. McKenney wrote: > On Mon, Mar 31, 2014 at 02:01:27PM +0100, Peter Sewell wrote: >> Dear all, >> >> building on the summary of the thin-air problem that we sent around >> earlier (http://www.cl.cam.ac.uk/~pes20/cpp/notes42.html, Mark >> Batty+Peter Sewell), we've sketched a semantics that gives the >> desired non-thin-air behaviour for at least some examples, for >> C11-style relaxed atomics / Linux-C ACCESS_ONCE accesses / Java >> volatiles. At present it only covers reorderings, not eliminations >> and introductions, though we think it can probably be extended to >> those. >> >> The basic idea is to assume that compilers do not do inter-thread >> value range optimisations, using that to let us construct a >> thread-local out-of-order operational semantics out of a conventional >> thread-local (value-passing) labelled transition system. We then >> combine the resulting thread semantics with an operational >> non-multi-copy-atomic storage subsystem, as in our PLDI11 Power model, >> to be sound w.r.t. Power and ARM hardware. The result is (we think) >> attractively simple to understand. The big question is whether that >> assumption is reasonable (or can be made so), and we'd welcome >> thoughts on that. >> >> The semantics is described here: >> >> http://www.cl.cam.ac.uk/~pes20/cpp/notes44.html (Jean Pichon + Peter Sewell) >> >> (we've also got a formal version in Lem and a command-line prototype >> tool for exploring it). >> >> Comments would be very welcome > > I like the restriction that a given action can be hoisted ahead of > conditionals only if that action is performed in all cases, and if there > are no preceding fences or preceding actions to the same location as > the action in question. > I take it that the need to exclude value-range > analysis is also behind your discomfort with Linus Torvalds's and Torvald > Riegel's recent proposal for dependency ordering. I've not got that swapped in right now, but I think it suffered from different issues. I'm perfectly fine with excluding value-range analysis so long as compiler folk can be persuaded to. > On status of variables, some ways of approximating the different sets: > > o Auto variables whose address is never taken are local. > (Yes, you could infer one auto variable's address from > one of its counterparts, but when you do that you are > taking your program's life into your hands anyway!) > > o Variables marked __thread whose addresses are not taken > are local. The Linux kernel has a similar notion of > per-CPU variables. > > o Memory just allocated is private until an address is > published somewhere. > > All that aside, I don't know of any way to tell the compiler that a > given location that was previously public has now been privatized > (such privatization is common in RCU code). > > On memory action elimination and introduction, when you say (for example) > read-after-write, how long after the write? > I am guessing that if > there is a fence or a RMW atomic between the read and the write, the > read cannot be eliminated. We'd have to think (in the little language that Jean and I have worked out some details, there are no RMWs). > One simples example that would be broken by > overly energetic read-after-write elimination is the queued spinlock, > which initialized its queue element, then spins waiting for the lock > holder to change the value. One might perhaps want to distinguish between eliminating finitely many occurrences of a read and eliminating all of them. > (But you could reasonably argue that such > uses should have volatile semantics.) > > On non-atomics, you lost me on the "if there is a data race on non-atomics > performed out-of-order, then there is one with non-atomics performed > in order". This might be because I am assuming that non-out-of-order > non-atomics might benefit from hardware control dependencies, and perhaps > you are excluding those. I think so - our semantics doesn't explicitly know about h/w control dependency preservation. thanks, peter > For C/C++, data races on non-atomic accesses > result in undefined behavior, which can certainly include out-of-thin-air > values, right? > > Thanx, Paul > From boehm at acm.org Tue Apr 1 20:47:58 2014 From: boehm at acm.org (Hans Boehm) Date: Tue, 1 Apr 2014 13:47:58 -0700 Subject: [jmm-dev] avoiding thin-air via an operational out-of-order semantics construction In-Reply-To: References: <20140331200309.GO4284@linux.vnet.ibm.com> Message-ID: It seems to me that there are much more common compiler analyses than value-range analysis per se, that cause problems here. I think that any sort of conventional alias or pointer analysis would also be disallowed, as is, since it's based on reasoning that no thread can perform certain actions. That's not an issue until we get to references, but we will clearly have to get there. I think escape analysis has similar issues. It allows the compiler to prove that certain "shared" variable reads are in fact locally determined, and thus not all edges in the local transition system need to be considered. This does strike me as an improvement over prior attempts to avoid over-ordering relaxed memory accesses, but I remain very worried about the complexity we're going to end up with, especially if we also need a a Power-like model of the memory system. I'm perfectly happy to ignore progress issues, like the read-elimination issue for now. My conclusion from C++ experiences is that we need to weasel-word progress guarantees anyway (we don't get much in the way of absolute guarantees from the hardware, and if the hardware doesn't make progress, Java doesn't either), and we can probably do that after the fact in a way that's unlikely to be misunderstood. Partial correctness seems hard enough. Hans On Tue, Apr 1, 2014 at 6:55 AM, Peter Sewell wrote: > On 31 March 2014 21:03, Paul E. McKenney > wrote: > > On Mon, Mar 31, 2014 at 02:01:27PM +0100, Peter Sewell wrote: > >> Dear all, > >> > >> building on the summary of the thin-air problem that we sent around > >> earlier (http://www.cl.cam.ac.uk/~pes20/cpp/notes42.html, Mark > >> Batty+Peter Sewell), we've sketched a semantics that gives the > >> desired non-thin-air behaviour for at least some examples, for > >> C11-style relaxed atomics / Linux-C ACCESS_ONCE accesses / Java > >> volatiles. At present it only covers reorderings, not eliminations > >> and introductions, though we think it can probably be extended to > >> those. > >> > >> The basic idea is to assume that compilers do not do inter-thread > >> value range optimisations, using that to let us construct a > >> thread-local out-of-order operational semantics out of a conventional > >> thread-local (value-passing) labelled transition system. We then > >> combine the resulting thread semantics with an operational > >> non-multi-copy-atomic storage subsystem, as in our PLDI11 Power model, > >> to be sound w.r.t. Power and ARM hardware. The result is (we think) > >> attractively simple to understand. The big question is whether that > >> assumption is reasonable (or can be made so), and we'd welcome > >> thoughts on that. > >> > >> The semantics is described here: > >> > >> http://www.cl.cam.ac.uk/~pes20/cpp/notes44.html (Jean Pichon + Peter > Sewell) > >> > >> (we've also got a formal version in Lem and a command-line prototype > >> tool for exploring it). > >> > >> Comments would be very welcome > > > > I like the restriction that a given action can be hoisted ahead of > > conditionals only if that action is performed in all cases, and if there > > are no preceding fences or preceding actions to the same location as > > the action in question. > > > I take it that the need to exclude value-range > > analysis is also behind your discomfort with Linus Torvalds's and Torvald > > Riegel's recent proposal for dependency ordering. > > I've not got that swapped in right now, but I think it suffered from > different issues. I'm perfectly fine with excluding value-range > analysis so long > as compiler folk can be persuaded to. > > > On status of variables, some ways of approximating the different sets: > > > > o Auto variables whose address is never taken are local. > > (Yes, you could infer one auto variable's address from > > one of its counterparts, but when you do that you are > > taking your program's life into your hands anyway!) > > > > o Variables marked __thread whose addresses are not taken > > are local. The Linux kernel has a similar notion of > > per-CPU variables. > > > > o Memory just allocated is private until an address is > > published somewhere. > > > > All that aside, I don't know of any way to tell the compiler that a > > given location that was previously public has now been privatized > > (such privatization is common in RCU code). > > > > On memory action elimination and introduction, when you say (for example) > > read-after-write, how long after the write? > > I am guessing that if > > there is a fence or a RMW atomic between the read and the write, the > > read cannot be eliminated. > > We'd have to think (in the little language that Jean and I have worked > out some details, there are no RMWs). > > > One simples example that would be broken by > > overly energetic read-after-write elimination is the queued spinlock, > > which initialized its queue element, then spins waiting for the lock > > holder to change the value. > > One might perhaps want to distinguish between eliminating finitely > many occurrences of a read and eliminating all of them. > > > (But you could reasonably argue that such > > uses should have volatile semantics.) > > > > On non-atomics, you lost me on the "if there is a data race on > non-atomics > > performed out-of-order, then there is one with non-atomics performed > > in order". This might be because I am assuming that non-out-of-order > > non-atomics might benefit from hardware control dependencies, and perhaps > > you are excluding those. > > I think so - our semantics doesn't explicitly know about h/w control > dependency preservation. > > thanks, > peter > > > > For C/C++, data races on non-atomic accesses > > result in undefined behavior, which can certainly include out-of-thin-air > > values, right? > > > > Thanx, Paul > > > From Peter.Sewell at cl.cam.ac.uk Tue Apr 1 23:21:02 2014 From: Peter.Sewell at cl.cam.ac.uk (Peter Sewell) Date: Wed, 2 Apr 2014 00:21:02 +0100 Subject: [jmm-dev] avoiding thin-air via an operational out-of-order semantics construction In-Reply-To: References: <20140331200309.GO4284@linux.vnet.ibm.com> Message-ID: On 1 April 2014 21:47, Hans Boehm wrote: > It seems to me that there are much more common compiler analyses than > value-range analysis per se, that cause problems here. > > I think that any sort of conventional alias or pointer analysis would also > be disallowed, as is, since it's based on reasoning that no thread can > perform certain actions. That's not an issue until we get to references, > but we will clearly have to get there. That's a good point, though there are some partial mitigations: a) we only need this for shared accesses, so whenever the compiler can persuade itself that a location is thread-local, any value restriction becomes legitimate b) it would be easy to restrict to type-correct values (modulo exactly what C one is aiming at - whether you want to enable type-based alias analysis or not, for example) c) as Viktor points out in other email, we could add a dynamic value-range-restriction transition and aborting if that gets violated (or adding a proof obligation). That's not so nice for an executable semantics, but not hard to deal with semantically. Whether those take us to a viable proposal I don't know. > I think escape analysis has similar issues. It allows the compiler to prove > that certain "shared" variable reads are in fact locally determined, and > thus not all edges in the local transition system need to be considered. Semantically we have to figure out the shared accesses in any case. > This does strike me as an improvement over prior attempts to avoid > over-ordering relaxed memory accesses, but I remain very worried about the > complexity we're going to end up with, especially if we also need a a > Power-like model of the memory system. The latter is (I'd say) the "simple" part of the Power model - any non-multi-copy-atomic operational semantics will have to have something a bit like it. Overall complexity is hard to assess without coping with all the C11 memory orders and RMWs, but, if this does get us a satisfactory non-thin-air definition,I don't see any reason to think it'd be unmanageable. > I'm perfectly happy to ignore progress issues, like the read-elimination > issue for now. My conclusion from C++ experiences is that we need to > weasel-word progress guarantees anyway (we don't get much in the way of > absolute guarantees from the hardware, and if the hardware doesn't make > progress, Java doesn't either), and we can probably do that after the fact > in a way that's unlikely to be misunderstood. Partial correctness seems > hard enough. y, indeed thanks, Peter > Hans > > > On Tue, Apr 1, 2014 at 6:55 AM, Peter Sewell > wrote: >> >> On 31 March 2014 21:03, Paul E. McKenney >> wrote: >> > On Mon, Mar 31, 2014 at 02:01:27PM +0100, Peter Sewell wrote: >> >> Dear all, >> >> >> >> building on the summary of the thin-air problem that we sent around >> >> earlier (http://www.cl.cam.ac.uk/~pes20/cpp/notes42.html, Mark >> >> Batty+Peter Sewell), we've sketched a semantics that gives the >> >> desired non-thin-air behaviour for at least some examples, for >> >> C11-style relaxed atomics / Linux-C ACCESS_ONCE accesses / Java >> >> volatiles. At present it only covers reorderings, not eliminations >> >> and introductions, though we think it can probably be extended to >> >> those. >> >> >> >> The basic idea is to assume that compilers do not do inter-thread >> >> value range optimisations, using that to let us construct a >> >> thread-local out-of-order operational semantics out of a conventional >> >> thread-local (value-passing) labelled transition system. We then >> >> combine the resulting thread semantics with an operational >> >> non-multi-copy-atomic storage subsystem, as in our PLDI11 Power model, >> >> to be sound w.r.t. Power and ARM hardware. The result is (we think) >> >> attractively simple to understand. The big question is whether that >> >> assumption is reasonable (or can be made so), and we'd welcome >> >> thoughts on that. >> >> >> >> The semantics is described here: >> >> >> >> http://www.cl.cam.ac.uk/~pes20/cpp/notes44.html (Jean Pichon + Peter >> >> Sewell) >> >> >> >> (we've also got a formal version in Lem and a command-line prototype >> >> tool for exploring it). >> >> >> >> Comments would be very welcome >> > >> > I like the restriction that a given action can be hoisted ahead of >> > conditionals only if that action is performed in all cases, and if there >> > are no preceding fences or preceding actions to the same location as >> > the action in question. >> >> > I take it that the need to exclude value-range >> > analysis is also behind your discomfort with Linus Torvalds's and >> > Torvald >> > Riegel's recent proposal for dependency ordering. >> >> I've not got that swapped in right now, but I think it suffered from >> different issues. I'm perfectly fine with excluding value-range >> analysis so long >> as compiler folk can be persuaded to. >> >> > On status of variables, some ways of approximating the different sets: >> > >> > o Auto variables whose address is never taken are local. >> > (Yes, you could infer one auto variable's address from >> > one of its counterparts, but when you do that you are >> > taking your program's life into your hands anyway!) >> > >> > o Variables marked __thread whose addresses are not taken >> > are local. The Linux kernel has a similar notion of >> > per-CPU variables. >> > >> > o Memory just allocated is private until an address is >> > published somewhere. >> > >> > All that aside, I don't know of any way to tell the compiler that a >> > given location that was previously public has now been privatized >> > (such privatization is common in RCU code). >> > >> > On memory action elimination and introduction, when you say (for >> > example) >> > read-after-write, how long after the write? >> > I am guessing that if >> > there is a fence or a RMW atomic between the read and the write, the >> > read cannot be eliminated. >> >> We'd have to think (in the little language that Jean and I have worked >> out some details, there are no RMWs). >> >> > One simples example that would be broken by >> > overly energetic read-after-write elimination is the queued spinlock, >> > which initialized its queue element, then spins waiting for the lock >> > holder to change the value. >> >> One might perhaps want to distinguish between eliminating finitely >> many occurrences of a read and eliminating all of them. >> >> > (But you could reasonably argue that such >> > uses should have volatile semantics.) >> > >> > On non-atomics, you lost me on the "if there is a data race on >> > non-atomics >> > performed out-of-order, then there is one with non-atomics performed >> > in order". This might be because I am assuming that non-out-of-order >> > non-atomics might benefit from hardware control dependencies, and >> > perhaps >> > you are excluding those. >> >> I think so - our semantics doesn't explicitly know about h/w control >> dependency preservation. >> >> thanks, >> peter >> >> >> > For C/C++, data races on non-atomic accesses >> > result in undefined behavior, which can certainly include >> > out-of-thin-air >> > values, right? >> > >> > Thanx, Paul >> > > > From Stephan.Diestelhorst at arm.com Thu Apr 3 12:17:02 2014 From: Stephan.Diestelhorst at arm.com (Stephan Diestelhorst) Date: Thu, 3 Apr 2014 13:17:02 +0100 Subject: [jmm-dev] Sequential Consistency In-Reply-To: <20140224225421.GE8264@linux.vnet.ibm.com> References: <5308C959.80502@cs.oswego.edu> <5309BA22.9090900@oracle.com> <20140224225421.GE8264@linux.vnet.ibm.com> Message-ID: <1654668.eQXiBj21o0@ubuntu> On Monday 24 February 2014 22:54:21 Paul E. McKenney wrote: > On Sun, Feb 23, 2014 at 01:06:42PM +0400, Aleksey Shipilev wrote: > > On 02/22/2014 07:59 PM, Doug Lea wrote: [...] > > Being the language guy, I think the hardware not being able to provide > > the sane SC primitives should pay up the costs. The hardware which makes > > it relatively easy to implement the non-tricky language memory model > > should be in the sweet spot. > > All hardware I know of has a non-trivial penalty for its SC primitives, > so there is a place for non-SC algorithms. I am not quite sure, how HW can pay a cost. Ultimately, it is program performance or correctness that suffers from too much / too little SC-ification. (And yes, there is a myriad of reasons for correctness issues, even with SC.) So in my mind, the question is how do we (1) make it easy for the programmer to specify the right intent of when SC (etc.) is needed (or how can we automate the process) and then (2) which patterns will emerge from that that we (the HW dudes) need to efficiently support without going SC everywhere. -- Thanks, Stephan Stephan Diestelhorst Staff Engineer ARM R&D Systems +44 (0)1223 405662 -- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you. ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2557590 ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2548782