RFR(M):8214751: X86: Support for VNNI instruction
vladimir.kozlov at oracle.com
Thu Dec 6 19:59:29 UTC 2018
What applications benefit this optimizations?
This optimization may prevent some constant folding and others IGVN optimizations and RA since
MulAddS2INode is generated too early I think. The only benefit we will have only if vectors are
generated. Can you generate vectors without MulAddS2INode? Or create MulAddS2INode just before
vectorization and expand it if vectorization failed? I would prefer first solution to have a struct
in SuperWord code which find such pattern and try to vectorize it.
You need to add test to verify correctness of results.
Add UseAVX == 0 check to predicates which use SSE2 code. Otherwise they may be selected even if
UseAVX > 0.
On 12/3/18 8:58 PM, Deshpande, Vivek R wrote:
> Hi All
> Could you please review the VNNI VPDPWSSD instruction support with autovectorization.
> It can vectorize this operation in the loop:
> out[i] += ((in1[2*i] * in2[2*i]) + (in1[2*i+1] * in2[2*i+1]));
> More information on VNNI can be found here:
> The initial performance gains with micro on skylake with AVX3 is 10.8x.
> and it generates
> vmovdqu xmm3, xmmword ptr [rbp+r8*2+0x10]
> vmovdqu xmm6, xmmword ptr [rdx+r8*2+0x10]
> vpmaddwd xmm3, xmm6, xmm3
> vpaddd xmm3, xmm3, xmmword ptr [r9+rdi*4+0x10]
> vmovdqu xmmword ptr [r9+rdi*4+0x10], xmm3
> It can generate vpdpwssd instruction on cascadelake.
> The webrev is here:
> The jbs entry for the same is here:
More information about the hotspot-compiler-dev