RFR: 8186915 - AARCH64: Intrinsify squareToLen and mulAdd
dmitrij.pochepko at bell-sw.com
Wed Sep 6 11:50:24 UTC 2017
On 06.09.2017 12:53, Andrew Haley wrote:
> On 05/09/17 18:34, Dmitrij Pochepko wrote:
>> As you can see, it's up to 26% worse throughput with wider multiplication.
>> The reasons for this is:
>> 1. mulAdd uses 32-bit multiplier (unlike multiplyToLen intrinsic) and it
>> can’t be changed within the function signature. Thus we can’t fully
>> utilize the potential of 64-bit multiplication.
>> 2. umulh instruction is more expensive than mul instruction.
> Ah, my apologies. I wasn't thinking about mulAdd, but about
> squareToLen(). But did you look at the way x86 uses 64-bit
Yes. It uses single x86 mulq instruction which performs 64x64
multiplication and placing 128 bit result in 2 registers. There is no
such single instruction on aarch64 and the most effective aarch64
instruction sequence i've found doesn't seem to be as fast as mulq.
Simplier 32x32bit multiplication works faster according to my measurements.
>> I haven't implemented wider multiplication for squareToLen intrinsic,
>> since it'll require much more code due to more corner cases. Also,
>> squaring algorithm in BigInteger doesn't handle more than 127 integers
>> in one squareToLen call(large integer arrays are divided to smaller
>> parts for squaring, so, 1..127 integers are squared at once), which
>> makes all additional off-loop penalties expensive in comparison to loop
>> execution time.
> Should we intrinsify squareToLen() at all?
Yes, we should intrinsify it, because we can see performance boost. Not
as significant as for x86 but still noticeable.
> It's only used AFAICS by
> C1 and interpreter when doing integer crypto.
This intrinsic is known to
squareToLen is called in BigInteger multiplication in case it's
multiplied by itself
and in pow(...) method:
> One other thing I
> haven't checked: is the multiplyToLen() intrinisc called when
> squareToLen() is absent?
It could have been a good alternative, but it's not used instead of
squareToLen when squareToLen is not implemented. A java implementation
of squareToLen will be eventually compiled and used instead:
More information about the hotspot-compiler-dev