[10] RFR: 8186915 - AARCH64: Intrinsify squareToLen and mulAdd

Dmitrij dmitrij.pochepko at bell-sw.com
Wed Sep 6 11:50:24 UTC 2017

On 06.09.2017 12:53, Andrew Haley wrote:
> On 05/09/17 18:34, Dmitrij Pochepko wrote:
>> As you can see, it's up to 26% worse throughput with wider multiplication.
>> The reasons for this is:
>> 1. mulAdd uses 32-bit multiplier (unlike multiplyToLen intrinsic) and it
>> can’t be changed within the function signature. Thus we can’t fully
>> utilize the potential of 64-bit multiplication.
>> 2. umulh instruction is more expensive than mul instruction.
> Ah, my apologies.  I wasn't thinking about mulAdd, but about
> squareToLen().  But did you look at the way x86 uses 64-bit
> multiplications?
Yes. It uses single x86 mulq instruction which performs 64x64 
multiplication and placing 128 bit result in 2 registers. There is no 
such single instruction on aarch64 and the most effective aarch64 
instruction sequence i've found doesn't seem to be as fast as mulq. 
Simplier 32x32bit multiplication works  faster according to my measurements.
>> I haven't implemented wider multiplication for squareToLen intrinsic,
>> since it'll require much more code due to more corner cases. Also,
>> squaring algorithm in BigInteger doesn't handle more than 127 integers
>> in one squareToLen call(large integer arrays are divided to smaller
>> parts for squaring, so, 1..127 integers are squared at once), which
>> makes all additional off-loop penalties expensive in comparison to loop
>> execution time.
> Should we intrinsify squareToLen() at all?

Yes, we should intrinsify it, because we can see performance boost. Not 
as significant as for x86 but still noticeable.
>    It's only used AFAICS by
> C1 and interpreter when doing integer crypto.
This intrinsic is known to 
squareToLen is called in BigInteger multiplication in case it's 
multiplied by itself 
and in pow(...) method: 

>    One other thing I
> haven't checked: is the multiplyToLen() intrinisc called when
> squareToLen() is absent?
It could have been a good alternative, but it's not used instead of 
squareToLen when squareToLen is not implemented. A java implementation 
of squareToLen will be eventually compiled and used instead: 


More information about the hotspot-compiler-dev mailing list