[10] RFR(M): 8185976: PPC64: Implement MulAdd and SquareToLen intrinsics

Doerr, Martin martin.doerr at sap.com
Thu Aug 17 10:43:34 UTC 2017

Hi Gustavo,

thanks for the update.

1. Sign extending offset and len
Right, sign and zero extending is equivalent for offset and len because they are guaranteed to be >=0 (by checks in Java). But you can only rely on bit 32 (IBM notation) to be 0. Bit 0-31 may contain garbage.
rldicl was incorrect. My mistake, sorry for that. Correct would be rldic which also clears the least significant bits.
len should also get fixed e.g. by replacing cmpdi by extsw_ in muladd.

2. Using 8 byte instructions for int
The code which feeds stdu is endianess specific. Doesn't work on all PPC64 platforms.

3.Regarding Andrew's point: Superseded by Montgomery?
The Montgomery change got backported to jdk8u (JDK-8150152  in 8u102). I'd expect the performance improvement of these intrinsics to be irrelevant for crypto.rsa. Did you measure with an older jdk8 release?
(I think the change is still acceptable as the intrinsics could be used elsewhere and the implementation also exists on other platforms.)

Best regards,

-----Original Message-----
From: Gustavo Serra Scalet [mailto:gustavo.scalet at eldorado.org.br] 
Sent: Mittwoch, 16. August 2017 18:50
To: Doerr, Martin <martin.doerr at sap.com>; 'hotspot-compiler-dev at openjdk.java.net' <hotspot-compiler-dev at openjdk.java.net>
Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and SquareToLen intrinsics

Hi Martin,

Thanks for dedicated review. It took me a while to be able to work on this but I hope to have your points solved. Please check below the review as well as my comments quoting your email:

> -----Original Message-----
> First of all, C2 does not perform sign extend when calling stubs. The
> int parms need to get zero/sign extended. (Could even be done without
> extra instructions by replacing sldi -> rldicl, cmpdi -> extsw_ in some
> cases.)

Does it make a difference on my case?

I guess you are talking about mulAdd preparation code. The only aspect I found about him is to force the cast from 32 bits -> 64 bits by cleaning higher bits. Offset is a signed integer but it can't be negative anyway. 

So I changed from:
sldi   (R5_ARG3, R5_ARG3, 2);

rldicl (R5_ARG3, R5_ARG3, 2, 32);  // always positive

> macroAssembler_ppc.cpp:
> - Indentation should be 2 spaces.


> stubGenerator_ppc:cpp:
> - or_, addi_ should get replaced by orr, addi when CR0 result is not
> needed.


> - Where is lplw initialized?

It should be initialized with 0, I missed that...

> - I believe that the updating load/store instructions e.g. lwzu don't
> perform well on some processors. At least using stwu 2 times in the loop
> doesn't make sense.

You are right. I could manipulate the bits differently and ended up with a single stdu in the loop. Neat! Although I could not reduce the total number of instructions.

> - Note: It should be possible to use 8 byte instead of 4 byte
> instructions: MacroAssembler::multiply64, addc, adde. But I'm not
> requesting to change that because I guess it would make the code very
> complicated, especially when supporting both endianess versions.

Yes, that would require a new analysis on this code. May we consider it next? As you said, I prefer having an initial version that looks as simple as the original java code.
> - The squareToLen stub implementation is very close the Java
> implementation. So it'd be interesting to understand what C2 doesn't do
> as well as the hand written assembly code. Do you know that? (Not
> absolutely necessary for accepting this change as long as the stub is
> measurably faster.)

I don't know either. Basically I chose doing it because I noticed some performance gain on SpecJVM2008 when analyzing X64. Then, taking a closer look, I didn't notice any AVX or some special instructions on X64 so I decided to try it on ppc64 by using some basic assembly.


> Best regards,
> Martin
> -----Original Message-----
> From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-
> bounces at openjdk.java.net] On Behalf Of Gustavo Serra Scalet
> Sent: Donnerstag, 10. August 2017 19:22
> To: 'hotspot-compiler-dev at openjdk.java.net' <hotspot-compiler-
> dev at openjdk.java.net>
> Subject: FW: [10] RFR(M): 8185976: PPC64: Implement MulAdd and
> SquareToLen intrinsics
> -----Original Message-----
> From: ppc-aix-port-dev [mailto:ppc-aix-port-dev-
> bounces at openjdk.java.net] On Behalf Of Gustavo Serra Scalet
> Sent: terça-feira, 8 de agosto de 2017 17:19
> To: ppc-aix-port-dev at openjdk.java.net
> Subject: [10] RFR(M): 8185976: PPC64: Implement MulAdd and SquareToLen
> intrinsics
> Hi,
> Could you please review this specific PPC64 change to hotspot? By
> implementing these intrinsics I noticed a small improvement with
> microbenchmarks analysis. On SpecJVM2008's crypto.rsa benchmark, only
> when backporting to JDK8 an improvement was noticed.
> JBS: https://bugs.openjdk.java.net/browse/JDK-8185976
> Webrev: https://gut.github.io/openjdk/webrev/JDK-8185976/webrev/
> Motivation for this implementation:
> https://twitter.com/ijuma/status/698309312498835457
> Best regards,
> Gustavo Serra Scalet

More information about the hotspot-compiler-dev mailing list