RFR: 8186915 - AARCH64: Intrinsify squareToLen and mulAdd
aph at redhat.com
Wed Sep 6 09:53:05 UTC 2017
On 05/09/17 18:34, Dmitrij Pochepko wrote:
> As you can see, it's up to 26% worse throughput with wider multiplication.
> The reasons for this is:
> 1. mulAdd uses 32-bit multiplier (unlike multiplyToLen intrinsic) and it
> can’t be changed within the function signature. Thus we can’t fully
> utilize the potential of 64-bit multiplication.
> 2. umulh instruction is more expensive than mul instruction.
Ah, my apologies. I wasn't thinking about mulAdd, but about
squareToLen(). But did you look at the way x86 uses 64-bit
> I haven't implemented wider multiplication for squareToLen intrinsic,
> since it'll require much more code due to more corner cases. Also,
> squaring algorithm in BigInteger doesn't handle more than 127 integers
> in one squareToLen call(large integer arrays are divided to smaller
> parts for squaring, so, 1..127 integers are squared at once), which
> makes all additional off-loop penalties expensive in comparison to loop
> execution time.
Should we intrinsify squareToLen() at all? It's only used AFAICS by
C1 and interpreter when doing integer crypto. One other thing I
haven't checked: is the multiplyToLen() intrinisc called when
squareToLen() is absent?
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
More information about the hotspot-compiler-dev