RFR: 8186915 - AARCH64: Intrinsify squareToLen and mulAdd
aph at redhat.com
Wed Sep 6 12:43:23 UTC 2017
On 06/09/17 12:50, Dmitrij wrote:
> On 06.09.2017 12:53, Andrew Haley wrote:
>> On 05/09/17 18:34, Dmitrij Pochepko wrote:
>>> As you can see, it's up to 26% worse throughput with wider multiplication.
>>> The reasons for this is:
>>> 1. mulAdd uses 32-bit multiplier (unlike multiplyToLen intrinsic) and it
>>> can’t be changed within the function signature. Thus we can’t fully
>>> utilize the potential of 64-bit multiplication.
>>> 2. umulh instruction is more expensive than mul instruction.
>> Ah, my apologies. I wasn't thinking about mulAdd, but about
>> squareToLen(). But did you look at the way x86 uses 64-bit
> Yes. It uses single x86 mulq instruction which performs 64x64
> multiplication and placing 128 bit result in 2 registers. There is no
> such single instruction on aarch64 and the most effective aarch64
> instruction sequence i've found doesn't seem to be as fast as mulq.
I think there is effectively a 64x64 - >128-bit instruction: it's just
that you have to represent it as a mul and a umulh. But I take your
>> One other thing I
>> haven't checked: is the multiplyToLen() intrinisc called when
>> squareToLen() is absent?
> It could have been a good alternative, but it's not used instead of
> squareToLen when squareToLen is not implemented. A java implementation
> of squareToLen will be eventually compiled and used instead:
Please compare your squareToLen wih the
MacroAssembler::multiply_to_len we already have.
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
More information about the hotspot-compiler-dev