RFR (S) : 8014362 : Need to expose some processor features via Unsafe interface

David Chase david.r.chase at oracle.com
Fri May 10 14:21:18 PDT 2013

On 2013-05-10, at 4:44 PM, Aleksey Shipilev <aleksey.shipilev at oracle.com> wrote:

> On 05/11/2013 12:33 AM, David Chase wrote:
>> FIX:
>> 1) add a bit to cpuFeatureFlags for CLMUL.
>> 2) expose the CLMUL bit in StdCpuid1Ecx
>> 3) add a flag (in the style of UseAES)
>> 4) add an unsafe intrinsic.
> 5) provide fast-path for C1 in c1_GraphBuilder.cpp
> 6) provide fast-path for C2 in library_call.cpp

I'm not sure I understand these two; isn't there just one place that the unsafe methods go?
Perhaps "intrinsic" is the wrong word -- there is no inline substitution, just a call to a C method
that accesses the flags and returns them.

Since the method is only called once, there's no need for it to be terribly fast.

> But by far the better option seems to invert control: expose the
> PCLMULQDQ-compatible method on Java side, and then intrinsify it in
> compiler, like we do with AES. See vmSymbols.hpp and library_call.cpp,
> search for "_aescrypt_encryptBlock".

It's not useful to intrinsify a single instruction without a 128-bit datatype mapped to xmm registers exposed in the JVM; otherwise it is too slow, which defeats the purpose of doing this.  I've benchmarked it in C -- it's much slower than the alternative, if you do it with single instructions connected to "ordinary" code.

This code:

            for (i = 4; i < len_128bit - 3 ; i+= 4) {
                x0a = b[i];
                x1a = b[i+1];
                x2a = b[i+2];
                x3a = b[i+3];

                x0b = __builtin_ia32_pclmulqdq128(K, x0, 0x00);
                x0 = __builtin_ia32_pclmulqdq128(K, x0, 0x11);
                x1b = __builtin_ia32_pclmulqdq128(K, x1, 0x00);
                x1 = __builtin_ia32_pclmulqdq128(K, x1, 0x11);

                x2b = __builtin_ia32_pclmulqdq128(K, x2, 0x00);
                x2 = __builtin_ia32_pclmulqdq128(K, x2, 0x11);
                x3b = __builtin_ia32_pclmulqdq128(K, x3, 0x00);
                x3 = __builtin_ia32_pclmulqdq128(K, x3, 0x11);

                x0 ^= x0a ^ x0b;
                x1 ^= x1a ^ x1b;
                x2 ^= x2a ^ x2b;
                x3 ^= x3a ^ x3b;

has to compile into something that uses 13 128-bit xmm registers., with the only memory accesses being those 128-bit loads at the top of the loop.


More information about the hotspot-compiler-dev mailing list