RFR: 8134869: AARCH64: GHASH intrinsic is not optimal
aph at redhat.com
Tue Sep 1 14:45:17 UTC 2015
I've been looking at the intrinsic we have for GHASH. While it is
decent as it goes, its performance is considerably worse than some
other implementations of GHASH on the same processor.
Thanks are due to Alexander Alexeev who did a fine job implementing
the x86 algorithm on AArch64, but the result is not optimal. on
AArch64 we have the advantage of a bit-reversal instruction which x86
parts don't have, and this makes it possible to write a fully
little-endian implementation of GHASH which is far more idiomatic on
AArch64 than the big-endian implementation the x86 version uses. This
gets us an overall performance improvement of AES/GCM of 10-20%.
I've also taken the opportunity to add a lot of comments. The
algorithms used are (fairly) obscure and most open source software
implementations don't really explain what they're doing. In
particular, the bizarre representation of polynomials in GF(2) (where
byte ordering is little endian but bit ordering is big endian) is very
confusing and surely deserves a comment or two.
One other remark: the AES/GCM implementation has a lot of overhead.
Some profile data (on x86) looks like this:
samples cum. samples % cum. % image name symbol name
479605 479605 36.8408 36.8408 31156.jo aescrypt_encryptBlock
301014 780619 23.1224 59.9632 31156.jo ghash_processBlocks
196563 977182 15.0990 75.0621 31156.jo int com.sun.crypto.provider.GCTR.doFinal(byte, int, int, byte, int)
50061 1027243 3.8454 78.9076 31156.jo void TestAESEncode.run()
48159 1075402 3.6993 82.6069 31156.jo void TestAESDecode.run()
18506 1093908 1.4215 84.0284 libjvm.so TypeArrayKlass::allocate_common(int, bool, Thread*)
GCTR.doFinal() doesn't need do anything except increment a counter
and call aescrypt_encryptBlock, but it still takes 15% of the total
runtime. Intrinsifying GCTR.update() would solve this problem.
More information about the hotspot-dev