RFR: 8046943: RSA Acceleration
Andrew Haley
aph at redhat.com
Fri Jun 12 11:06:22 UTC 2015
http://cr.openjdk.java.net/~aph/8046943-hs-1/
http://cr.openjdk.java.net/~aph/8046943-jdk-1/
Before:
rsa 512 bits 0.000134s 0.000009s 7444.2 111853.3
rsa 1024 bits 0.000674s 0.000029s 1483.9 34456.9
rsa 2048 bits 0.003945s 0.000100s 253.5 9994.7
rsa 4096 bits 0.027015s 0.000373s 37.0 2681.6
After:
rsa 512 bits 0.000059s 0.000004s 17022.3 224141.1
rsa 1024 bits 0.000258s 0.000013s 3871.5 78851.0
rsa 2048 bits 0.001506s 0.000042s 664.1 23844.3
rsa 4096 bits 0.010214s 0.000153s 97.9 6516.0
There are some issues we need to discuss.
The runtime code is in sharedRuntime_x86_64.cpp even though it's
mostly written in C. My thinking here is that porters will want to
tweak the code for their back ends, or maybe write it in assembler.
It's little-endian and would need some reworking for big-endian
machines. But maybe there should be a common version of the code in
share/ ?
Should it be in optoRuntime instead? It could be called from C1 or
even the interpreter, but it's C2-only right now.
I've done nothing about 32-bit targets, but I think they would
benefit.
I had to make some small changes to the Java library.
1. Montgomery multiplication works on whole words. Given that this
is a 64-bit implementation I had to force the size of the arrays in
oddModPow to be a multiple of 64 bits, i.e. the length of the arrays
must be even. Given that RSA and Diffie-Hellman keys are usually a
multiple of 64 bits in length I don't think this will be a real
performance issue in practice.
2. I needed a 64-bit inverse rather than a 32-bit inverse. This is a
single computation, done once for each modular exponentiation, so it
makes an immeasurably small difference to the total runtime.
3. I fused squaring and multiplication into a single
montgomeryMultiply method. This has two advantages. Firstly, we only
need a single intrinsic, and secondly the decision whether to use
multiply or squaring can be made in the runtime library. If the
target does not support the montgomeryMultiply intrinsic there is no
slowdown when using C2 because it removes the if (a == b) test in
if (a == b) {
product = squareToLen(a, len, product);
} else {
product = multiplyToLen(a, len, b, len, product);
}
Andrew.
More information about the hotspot-compiler-dev
mailing list