RFR: 8186915 - AARCH64: Intrinsify squareToLen and mulAdd
dmitrij.pochepko at bell-sw.com
Mon Sep 25 15:46:43 UTC 2017
please take a look at v2. I've modified code to use multiplyToLen in
squareToLen. Additional benefit: no more code in common part. I've left
I've also rerun benchmark on ThunderX and got these results:
On 22.09.2017 11:12, Andrew Haley wrote:
> On 21/09/17 19:19, Dmitrij Pochepko wrote:
>> thank you for looking into this and trying on APM(I have no access to
>> this h/w).
>> I've used modified benchmark you've sent and run it on ThunderX and
>> implSquareToLen still shows better results than implMultiplyToLen in
>> most cases on ThunderX (up to 10% on size=127. results:
> For 10%, it's not worth doing, given the risks and that it's not used
> by crypto operations when C2-compiled.
>> However, since performance difference for APM is more than on
>> ThunderX, I think it'll be more logical to return back to your idea
>> and call multiplyToLen intrinsic inside squareToLen. Alternative
>> solution is to generate different code for APM and ThunderX, but I
>> prefer to have single version in case of such relatively small
>> difference in performance and it's still much faster than without
>> intrinsic at all. What do you think?
> Yes. Calling multiplyToLen would be fine.
>> fyi: regarding size 200 and 1000 - it's incorrect to measure these
>> sizes for squareToLen, because squareToLen is never called for size
>> more than 127(I've mentioned it before).
> It's not incorrect: it's a test for asymptotic behaviour.
More information about the hotspot-compiler-dev