RFR(XL): 8185640: Thread-local handshakes
erik.osterlund at oracle.com
Fri Oct 27 06:51:48 UTC 2017
Regarding confitional branch on the TLS, I have the following to say:
1) Mikael Gerdin tried an earlier prorotype doing that, and found that indirect lod was more desirable for now. The reason is that the performance of the branch variant is more sensitive to chip details such as the number of branch ports on the reservation stations (double branch ports were introduced in haswell). On some chips the branch would marginally win, on some it would marginally lose. But there are more pathological cases for the branch, like e.g. a nonsense loop that does not do anything but loop. Arguably that is a nonsense benchmark though. But since the indirect load was less sensitive to the chip details, always performed well consistently, was more predictable and deterministic, that approach was selected. Perhaps this decision may change in a few years, but it seems a bit early for that now.
2) As for the number of bytes in the code stream of the global testl (x86) to a conditional branch on TLS, you can get an optimal encoding of the branch variant of the same length, 6 bytes, on x86. The optimal testb on offset zero is 4 bytes and a short branch is 2 bytes. For the curious reader: in the past (years ago now) I prototyped getting the optimal machine encoding of a TLS conditional branch poll. I ended up exposing different thread pointers to the TLS register at an offset into Thread in the JIT to be able to get that offset zero, and changing locking code to deal with the owner being misaligned, and all sorts of fun. But it ultimately didn't seem to make any measurable difference at all. But I got the T-shirt anyway.
Hope this explains why the indirect load was chosen over the conditional branch.
> On 27 Oct 2017, at 01:54, Hohensee, Paul <hohensee at amazon.com> wrote:
> As a reference point, Android Java branches on a flag in the TLS rather than issuing a poisoned page probe. On x86 at least, there’s no performance disadvantage: branch prediction makes the compare-and-branch pair a single-cycle operation in the vast majority of cases.
> The interpreter was built at a time when branches had non-zero cost, as evidenced by the prediction bits in the sparc64 predicted branch instructions. The compare-and-branch code sequence takes up icache space in the interpreter (vs. zero for switching the dispatch table) and icache is still a limited resource on modern processors, so that’s an argument for switching dispatch tables. For compiled code, compare-and-branch takes a bit more space than the current poison page probe, but not enough to matter imo. Compiled code is executed far more than interpreter code, so I’d go with optimizing compiled code performance.
> On 10/26/17, 10:20 AM, "hotspot-dev on behalf of Andrew Haley" <hotspot-dev-bounces at openjdk.java.net on behalf of aph at redhat.com> wrote:
> On 26/10/17 18:00, Erik Osterlund wrote:
>> Hi Andrew,
>>>> On 26 Oct 2017, at 18:05, Andrew Haley <aph at redhat.com> wrote:
>>>> On 26/10/17 15:39, Erik Österlund wrote:
>>>> The reason we do not poll the page in the interpreter is that we
>>>> need to generate appropriate relocation entries in the code blob for
>>>> the PCs that we poll on, so that we in the signal handler can look
>>>> up the code blob, walk the relocation entries, and find precisely
>>>> why we got the trap, i.e. due to the poll, and precisely what kind
>>>> of poll, so we know what trampoline needs to be taken into the
>>> Not really, no. If we know that we're in the interpreter and the
>>> faulting address is the safepoint poll, then we can read all of the
>>> context we need from the interpreter registers and the frame.
>> That sounds like what I said.
> Not exactly. We do not need to generate any more relocation entries.
>> But the cost of the conditional branch is empirically (this was
>> attempted and measured a while ago) approximately the same as the
>> indirect load during "normal circumstances". The indirect load was
>> only marginally better.
> That's interesting. The cost of the SEGV trap going through the
> kernel is fairly high, and I'm now wondering if, for very fast
> safepoint responses, we'd be better off not doing it. The cost of the
> write protect, given that it probably involves an IPI on all
> processors, isn't cheap either.
>>>> While constructing something that does that is indeed possible, it
>>>> simply did not seem worth the trouble compared to using a branch in
>>>> these paths. The same reasoning applies for the poll performed in
>>>> the native wrapper when waking up from native and transitioning into
>>>> Java. It performs a conditional branch instead of indirect load to
>>>> avoid signal handler logic for polls that are not performance
>>> If we're talking about performance, the existing bytecode interpreter
>>> is exquisitely carefully coded, even going to the extent of having
>>> multiple dispatch tables for safepoint- and non-safepoint cases.
>>> Clearly the original authors weren't thinking that code was not
>>> performance critical or they wouldn't have done what they did. I
>>> suppose, though, that the design we have is from the early days when
>>> people diligently strove to make the interpreter as fast as possible.
>> On the other hand, branches have become a lot faster in "recent"
>> years, and this one is particularly trivial to predict. Therefore I
>> prefer to base design decisions on empirical measurements. And
>> introducing that complexity for an close to insignificantly faster
>> interpreter poll does not seem encouraging to me. Do you agree?
> Perhaps. It's interesting that the result falls one way in compiled
> code and the other in interpreted code. If the choice is so very
> finely balanced, though, it sort-of makes sense.
> Andrew Haley
> Java Platform Lead Engineer
> Red Hat UK Ltd. <https://www.redhat.com>
> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
More information about the hotspot-dev