CR for RFR 8151573

Vladimir Kozlov vladimir.kozlov at
Wed Mar 16 15:41:57 UTC 2016

On 3/15/16 5:29 PM, Berg, Michael C wrote:
> Vladimir:
> The why programmable SIMD depends upon this that all versions of the final post loop have range checks in them until very late, after register allocation, and might be cleaned up in cfg optimizations, but are not always so. With multiversioning, we always remove the range checks in our key loop.

I understand that we can get some benefits. But in general case they will not be visible.

> With regards to the pre loop, pre loops have special checks too do they not, requiring flow in many cases?
> Programmable SIMD needs tight loops to accurately facilitate masked iteration mapping.

Yes, after you explained me vector masking I now understand why it could be used for post loop.

After thinking about this I would suggest for you to look on arraycopy and generate_fill stubs instead in 
stub_Generator_x86*.cpp (may be only 64-bit). They also have post loops but changes would be only platform specific, 
smaller and easy to understand and test. Also arraycopy and 'fill' code are used very frequently by Java applications so 
we may get more benefits than optimizing general loops.


> Regards,
> Michael
> -----Original Message-----
> From: Vladimir Kozlov [mailto:vladimir.kozlov at]
> Sent: Tuesday, March 15, 2016 4:37 PM
> To: Berg, Michael C; 'hotspot-compiler-dev at'
> Subject: Re: CR for RFR 8151573
> As we all know we can always construct microbenchmarks which shows 30% - 50% difference. When in real application we will never see difference. I still don't see a real reason why we should spend time and optimize
> *POST* loops. We already have vectorized post loop to improve performance. Note, additional loop opts code will rise its maintenance cost.
> Why "programmable SIMD" depends on it? What about pre-loop?
> Thanks,
> Vladimir
> On 3/15/16 4:14 PM, Berg, Michael C wrote:
>> Correction below...
>> -----Original Message-----
>> From: hotspot-compiler-dev
>> [mailto:hotspot-compiler-dev-bounces at] On Behalf Of
>> Berg, Michael C
>> Sent: Tuesday, March 15, 2016 4:08 PM
>> To: Vladimir Kozlov; 'hotspot-compiler-dev at'
>> Subject: RE: CR for RFR 8151573
>> Vladimir for programmable SIMD which is the optimization which uses this implementation, I get the following on micros and code in general that look like this:
>>       for(int i = 0; i < process_len; i++)
>>       {
>>         d[i]= (a[i] * b[i]) + (a[i] * c[i]) + (b[i] * c[i]);
>>       }
>> The above code makes 9 vector ops.
>> For float with vector length VecZ, I get as much as 1.3x and for int as much as 1.4x uplift.
>> For double and long on VecZ it is smaller, but then so is the value of vectorization on those types anyways.
>> The value process_len is some fraction of the array length in my measurements.  The idea of the metrics Is to pose a post loop with a modest amount of iterations in it.  For instance N is the max trip of the post loop, and N is 1..VecZ-1 size, then for float we could do as many as 15 iterations in the fixup loop.
>> An example would be array_length = 512, process_len is a range of 81..96, we create a VecZ loop which was superunrolled 4 times with vector length 16, or unroll of 64, we align process 4 iterations, and the vectorized post loop is executed 1 time, leaving the remaining work in the final post loop, in this case possibly a mutilversioned post loop.  We start that final loop at iteration 81 so we always do at least 1 iteration fixup, and as many as 15.  If we left the fixup loop as a scalar loop that would mean 1 to 15 iterations plus our initial loops which have {4,1,1} iterations as a group or 6 to get us to index 80.  By vectorizing the fixup loop to one iteration we now always have 7 iterations in our loops for all ranges of 81..96, without this optimization and programmable SIMD, we would have the initial 6 plush 1 to 15 more, or a range of 7 to 21 iterations.
>> Would you prefer I integrate this with programmable SIMD and submit the patches as one?
>> I thought it would be easier to do them separately.  Also, exposing the post loops to this path offloads cfg processing to earlier compilation, making the graph less complex through register allocation.
>> Regards,
>> Michael
>> -----Original Message-----
>> From: Vladimir Kozlov [mailto:vladimir.kozlov at]
>> Sent: Tuesday, March 15, 2016 2:42 PM
>> To: Berg, Michael C; 'hotspot-compiler-dev at'
>> Subject: Re: CR for RFR 8151573
>> Hi Michael,
>> Changes are significant so they have to be justified. Especially since we are in later stage of jdk9 development. Do you have performance numbers (not only for microbenchmarhks) which show the benefit of these changes?
>> Thanks,
>> Vladimir
>> On 3/15/16 2:04 PM, Berg, Michael C wrote:
>>> Hi Folks,
>>> I would like to contribute multi-versioning post loops for range
>>> check elimination.  Beforehand cfg optimizations after register
>>> allocation were where post loop optimizations were done for range
>>> checks.  I have added code which produces the desired effect much
>>> earlier by introducing a safe transformation which will minimally
>>> allow a range check free version of the final post loop to execute up
>>> until the point it actually has to take a range check exception by
>>> re-ranging the limit of the rce'd loop, then exit the rce'd post loop
>>> and take the range check exception in the legacy loops execution if required.
>>> If during optimization we discover that we know enough to remove the
>>> range check version of the post loop, mostly by exposing the load
>>> range values into the limit logic of the rce'd post loop, we will
>>> eliminate the range check post loop altogether much like cfg
>>> optimizations did, but much earlier.  This gives optimizations like
>>> programmable SIMD (via SuperWord) the opportunity to vectorize the
>>> rce'd post loops to a single iteration based on mask vectors which
>>> map to the residual iterations. Programmable SIMD will be a follow on
>>> change set utilizing this code to stage its work. This optimization
>>> also exposes the rce'd post loop without flow to other optimizations.
>>> Currently I have enabled this optimization for x86 only.  We base
>>> this loop on successfully rce'd main loops and if for whatever reason, multiversioning fails, we eliminate the loop we added.
>>> This code was tested as follows:
>>> Bug-id:
>>> webrev:
>>> Thanks,
>>> Michael

More information about the hotspot-compiler-dev mailing list