Spin Loop Hint support: Draft JEP proposal
john.r.rose at oracle.com
Fri Oct 9 21:32:01 UTC 2015
On Oct 8, 2015, at 11:56 PM, Gil Tene <gil at azul.com> wrote:
>> On Oct 8, 2015, at 6:18 PM, John Rose <john.r.rose at oracle.com> wrote:
>> On Oct 8, 2015, at 12:39 AM, Gil Tene <gil at azul.com> wrote:
> … If/when MONITOR/MWAIT becomes available in user mode, it will join ARM v8 and SPARC M7 in a common useful paradigm.
>> Also, from a cross-platform POV, a boolean would provide an easy to use "hook" for profiling how often the polling is failing. Failure frequency is an important input to the tuning of spin loops, isn't it? Why not feed that info through to the JVM?
> I don't follow. Perhaps I'm missing something. Spin loops are "strange" in that they tend to not care about how "fast" they spin, but do care about their reaction time to a change in the thing(s) they are spinning on. I don't think profiling will help here…
There might be a question of how to scale (or otherwise arrange) the delay until next poll. But I see the x86 PAUSE instruction has no delay parameter. (It's my first time reading the SDM entry. Now I see why it was dubbed spinLoopHint.)
> E.g. in the example tests for this JEP on Ivy Bridge Xeons, adding an intrinsified spinLoopHint() to the a simple spin volatile value loop appears to reduce the "spin throughput" by a significant ratio (3x-5x for L1-sharing threads), but also reduces the reaction time by 35-50%.
>>> and if/when it does, I'm not sure the semantics of passing the boolean through are enough to cover the actual way to use such hardware when it becomes available.
>> The alternative is to have the JIT pattern-match for loop control around the call to Thread.yield. That is obviously less robust than having the user thread the poll condition bit through the poll primitive.
> I dont' think that's the alternative. The alternative(s) I suggest require no analysis by the JIT:
Got it. No analysis, no profiling needed.
> The main means of spin loop hinting I am suggesting is a simple no args hint. [Folks seem to be converging on using Thread as the home for this stuff, so I'll use that]:
> (for Java 9, a varhandle variant of the above reflection based model is probably more appropriate. I spelled this with the reflection form for readability by pre-varhandles-speakers).
OK, I agree it would be better to use reified (encapsulated) memory locations and explicit wait-to-poll operations on them.
> Neither of these forms require any specific JIT matching or exploration. We know the first form is fairly robust on architectures that support stuff like PAUSE. The second form will probably be robust both architectures that support MWAIT or WFE, and on those that support PAUSE (those just won't watch anything).
> On how this differs from a single boolean parameter: My notion (in the example above) of a single poll variable would be one that specifically designates the poll variable as a field (or maybe array index as an option), rather than provide a boolean parameter that is potentially evaluated based on data read from more than one memory location.
> The issue is that while it's an easy fit if the boolean is computed based on evaluating a single address, it becomes fragile if multiple addresses are involved and the hardware can only watch one (which is the current trend for ARM v8, SPARC M7, and a potential MONITOR/WAIT x86). It would be "hard" for a JIT to figure out which of the addresses read to compute the bollean should be watched in the spin.
Yes, it's hard to break down a multi-input boolean into its component effects. Not impossible, but hard. Given that, the extra complexity of "exploring" a loop looking for exit gating is probably insignificant.
I was thinking the javadoc for a boolean-accepting spin loop hint would ask programmers to pick one input, and tell them that the benefits go down if they use complex predicates. But the VH-based API is much crisper.
> And getting it wrong can have potentially surprising consequences (not just lack of benefit, but terribly slow execution due to waiting for something that is not going to be externally modified and timing out each time before spinning).
> None of these are "right". And there is nothing in the semantics that suggests which one to expect.
> You could fall back and say that you would only get the benefit if there is exactly one address used in deriving the boolean, but this would probably make it hard to code to and maintain. A form that forces you to specific the polling parameter would be less generic in expression, but will be less fragile to program to as well, IMO.
> I guess that's where we differ: I don't see a benefit in profiling the spin loop, so we disagree on (a). And hence (b) is not relevant…
> Maybe I'm mis-reading what you mean by "profiling" and "optimizing" above?
You are probably reading me correctly. When tuning such things in software it is useful to know (or guess correctly) what is the likely time until the polling condition changes. Online profiling can derive some of that information, if history predicts the future. The boolean parameter provides a direct and explicit source to collect that information.
>> The idea would be that programmers would take a little extra thought when using yield(Z)Z, and get paid immediately from good profiling. They would get paid again later if and when platforms analyze data dependencies on the Z.
>> If there's no initial payoff, then, yes, it is hard asking programmers to expend extra thought that only benefits on some platforrms.
> Whatever the choices end up being, we could provide multiple signatures or APIs. E.g. I think that the no-args spinLoopHint() is the de-facto spinning model for x86 and Power (and have been for over a decade for everything outside of Java). So it's a safe bet and a natural form. The spin-execute-something-while-watching-a-single-address model is *probably* a good fit for some relatively young but very useful hardware capabilities, and can probably be captured in a long-lasting API as well.
> More complicated boolean-derived-from-pretty-much-anything or multi-address watching schemes are IMO too early to evaluate. E.g. they could potentially leverage some just-around-the-corner (or recently arrived) features like TSX and NCAS schemes, but since there is relatively little experience with using such things for spinning (outside of Java), it is probably pre-mature to solidify a Java API for them.
> BTW, even with user-mode MWAIT and cousins, and with the watch-a-single-address API forms, we may be looking at two separate motivations, and may want to consider a hint of which one is intended. E.g. one of spinLoopHint()'s main drivers is latency improvement, and the other is power reduction (with potential speed benefits or just power savings benefits). It appears that on x86 a PAUSE provides both, so there is no choice needed there. But MWAIT may be much more of a power-centric approach that sacrifices latency, and that may be OK for some and un-OK for others. We may want to have API variants that allow a hint about whether power-reduction or latency-reduction is the preferred driver.
Bottom line: I'm content to wait for a VH-based poll operator.
Thanks for the clear explanations.
More information about the core-libs-dev