CR for RFR 8149421
vladimir.kozlov at oracle.com
Thu Feb 11 18:08:42 UTC 2016
On 2/11/16 9:55 AM, Berg, Michael C wrote:
> Yes, that is pretty close to it, the unrolled loop, after it initially succeeds as an atomic or unity unroll segment of one vector size is what I replicated to the drain loop, before super unrolling occurs. In fact, it's precisely what we need.
Will you do more to improve it?
> I migrated the ss/sd to insns k1 usage for uniformity reasons. You will notice some b and w SIMD components going the other way, that is preparatory, not a bug yet, but could have been if left, to use k0 which has all bits set for masking in the auto code generation path. The only exception is stub code, for which a webrev will soon be made available which has programmable versions of w and b components that do not fit in the auto code generation path, and for which k1 contents are left to the responsibility of the stub writer. The others insns are set to false like movdqu are in preparation for programmable SIMD, which will need to apply programmed masks into fix-up segments. Since programmable SIMD is for int/float, long/double sizes only there will be no conflict. Basically the w and b components do not have enough ISA mapping to complete more than very basic vector expressions, so we confine the usage model by idiom wrt masking and exclude them from programmable SIMD.
Please, add comments about this in InstructionAttr. And add comments to all fields of InstructionAttr - shortly describe
them. It will help us in a future to set correct values.
I may need to be educated about "programmable SIMD" :)
> -----Original Message-----
> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
> Sent: Wednesday, February 10, 2016 10:05 PM
> To: hotspot-compiler-dev at openjdk.java.net
> Cc: Berg, Michael C
> Subject: Re: CR for RFR 8149421
> What are the changes in assembler_x86.cpp? You changed no_mask_reg arguments value. Was it bug?
> Looks like you copy-paste code from insert_pre_post_loops() which is fine.
> One thing is worry me is that due to ratio of unrolling done before vectorization and vector size you can have several repetitive vector operations. It would be nice if we do unrolling equal vector size then do vectorization to generate one vector instruction, then clone to create vector_post_loop. And then unroll main more.
> Or you are already doing something like that?
> On 2/9/16 3:16 PM, Berg, Michael C wrote:
>> Hi Folks,
>> I would like to contribute vectorized post loops. This patch is
>> initially targeted for x86. The design is versatile so as to be
>> portable to other targets as well. This code poses the addition of
>> atomic unrolled drain loops which precede fix-up segments and which
>> are significantly faster than scalar code. The requirement is that the
>> main loop is super unrolled after vectorization. I see up to 54% uplift on micro benchmarks on x86 targets for loops which pass superword vectorization and which meet the above criteria. Also scimark metrics in SpecJvm2008 like lu.small and fft.small show the usage of this design for benefit on x86.
>> Bug-id: https://bugs.openjdk.java.net/browse/JDK-8149421
More information about the hotspot-compiler-dev