RFR: 8134869: AARCH64: GHASH intrinsic is not optimal

Andrew Haley aph at redhat.com
Tue Sep 1 14:45:17 UTC 2015

I've been looking at the intrinsic we have for GHASH.  While it is
decent as it goes, its performance is considerably worse than some
other implementations of GHASH on the same processor.

Thanks are due to Alexander Alexeev who did a fine job implementing
the x86 algorithm on AArch64, but the result is not optimal.  on
AArch64 we have the advantage of a bit-reversal instruction which x86
parts don't have, and this makes it possible to write a fully
little-endian implementation of GHASH which is far more idiomatic on
AArch64 than the big-endian implementation the x86 version uses.  This
gets us an overall performance improvement of AES/GCM of 10-20%.

I've also taken the opportunity to add a lot of comments.  The
algorithms used are (fairly) obscure and most open source software
implementations don't really explain what they're doing.  In
particular, the bizarre representation of polynomials in GF(2) (where
byte ordering is little endian but bit ordering is big endian) is very
confusing and surely deserves a comment or two.


One other remark: the AES/GCM implementation has a lot of overhead.
Some profile data (on x86) looks like this:

samples  cum. samples  %        cum. %     image name               symbol name
479605   479605        36.8408  36.8408    31156.jo                 aescrypt_encryptBlock
301014   780619        23.1224  59.9632    31156.jo                 ghash_processBlocks
196563   977182        15.0990  75.0621    31156.jo                 int com.sun.crypto.provider.GCTR.doFinal(byte[], int, int, byte[], int)
50061    1027243        3.8454  78.9076    31156.jo                 void TestAESEncode.run()
48159    1075402        3.6993  82.6069    31156.jo                 void TestAESDecode.run()
18506    1093908        1.4215  84.0284    libjvm.so                TypeArrayKlass::allocate_common(int, bool, Thread*)

GCTR.doFinal() doesn't need do anything except increment a counter
and call aescrypt_encryptBlock, but it still takes 15% of the total
runtime.  Intrinsifying GCTR.update() would solve this problem.


More information about the hotspot-dev mailing list