RSA intrinsics [Was: RFR(L): 8069539: RSA acceleration]
aph at redhat.com
Thu Jun 4 18:41:53 UTC 2015
On 04/06/15 19:32, Anthony Scarpino wrote:
> On 06/04/2015 10:08 AM, Andrew Haley wrote:
>> I'm sorry this is a rather long email, and I pray for your patience.
>> I'm getting close to something I can put forward for review. The
>> performance is encouraging.
>> [ Some background: The kernel of RSA and Diffie-Hellman key exchange
>> is Montgomery multiplication. Optimizing RSA basically comes down to
>> optimizing Montgomery multiplication. The core of OpenJDK's RSA is
>> BigInteger.oddModPow(). ]
>> My Montgomery multiply routine (mostly portable C, with a small
>> assembly-language insert) executes a 1024-bit multiply/reduce in about
>> 2000ns; the hand-coded assembly language equivalent in OpenSSL is
>> faster (as you'd expect) but not very much faster: about 1700ns. In
>> other words, compiled C is only about 20% slower.
>> Firstly, this is pretty remarkable performance by GCC (Yay! Go GCC!)
>> and it shows I'm on the right track. It also shows that there isn't a
>> huge amount to be gained by hand-coding Montgomery multiplication, but
>> anybody who fancies their hand can try to improve on GCC. This is
>> also very nice because porting it to non-x86 hardware is fairly
>> straightforward -- certainly far easier than writing a large assembly-
>> language routine.
> I'm sure when I see the code it will become clearer, but I'm guessing
> you are taking a C version of Montgomery multiply, using GCC to turn it
> into assembly with some cpu acceleration instructions and putting that
> into an intrinsic?
I have written a Montgomery multiply routine: it is mostly C, with a
tiny bit (really, just a few instructions) of inline assembly
language. It's called from a HotSpot intrinsic.
>> It would be nice to keep all of the data in an array of jlongs for the
>> duration of oddModPow(). Here's one idea: I could write a version of
>> oddModPow() which is guaranteed never to use the Java version of the
>> Montgomery multiply. This would use a JNI method which calls the
>> native Montgomery multiply routine, guaranteeing that that we always
>> use that native routine, even from the interpreter and C1. Then I
>> could keep all the internal state in native word order, and all this
>> word-swapping would just go away. This would have the additional
>> benefit that it would be faster when using the interpreter and C1.
>> So, we'd have two versions of oddModPow() in BigInteger, and switch
>> between them depending on whether the platform had support for a
>> native Montgomery multiplication.
> I'm assuming the JNI method is closer to OpenSSL numbers?
It's the same Montgomery multiplication as the intrinsic, but called
from a JNI method instead of a HotSpot intrinsic.
>> The downside of having two versions of oddModPow() would, of course,
>> be some code duplication.
>> Or am I just making too much fuss about this? Maybe I should be happy
>> with what I've got.
> Looking at the 2k RSA sign numbers 2.3x better after your patch vs
> hs-comp. I'd be happy with that improvement.
OK. I kinda guessed that would be the response, really.
More information about the hotspot-compiler-dev