RFR(S): 8076284: Improve vectorization of parallel streams

Civlin, Jan jan.civlin at intel.com
Wed Apr 22 03:26:36 UTC 2015


I use MSVS12 (Microsoft (R) C/C++ Optimizing Compiler Version 18.00.21005.1 for x64) but found today that my last night patch may not compile in previous versions of the Microsoft compiler.
Here is the patch that compiles in both: MSVS12 and MSVS10.

Please use this patch.

Thank you,


-----Original Message-----
From: Civlin, Jan 
Sent: Monday, April 20, 2015 11:45 PM
To: hotspot-compiler-dev at openjdk.java.net
Cc: Civlin, Jan
Subject: RE: RFR(S): 8076284: Improve vectorization of parallel streams


Here is the description and new patch with the changes you recommended (except the last one - see below my explanation).

The patch description.

This patch provides on-demand vectorization/SIMD'ing of a <method> specified in JVM command as -XX:CompileCommand=option,<method>,Vectorize.
This optimization may be globally disabled by setting the flag -XX: -AllowVectorizeOnDemand (by default it is true). 

For each method that was specified with Vectorize option we do the following: 

1. On each iteration of loop unroll for a given method (loopopts.cpp) we generate the next _clone_idx (which will be common for all the nodes cloned in this iteration); and on each node cloning we hash _idx of the origin of the node that is cloned (_idx_clone_orig ) and the _clone_idx (cm.verify_insert_and_clone).
CloneMap belongs to Compile and is created in CompilerWrapper.  

2. In SuperWord optimization, after max_depth has been built, we are hoisting the loads. 
For this we for each Load_X (subject of the hoisting) find some Load_0 that has the same origin as Load_X but belongs to the first iteration, i.e. if the MemNode::Memory input of Load_0 is memory Phi (collected previously in memory_slice) we set this Phi also as the MemNode::Memory input of Load_X. After this rebuild of the graph we restart the Superword optimization.

The major routines here.
- SuperWord::mark_generations: computes _ii_first (the index _clone_idx) of the nodes that have MemNode::Memory input coming from a phi node in some slice; computes list of the nodes in the first and last iterations of the loop.
- SuperWord::hoist_loads_in_graph: for each memory slice (a phi node) visits each load that has this phi as a memory input and then for each other load that has the same origin makes the memory input coming from the phi.  This routine does not use marking generations mechanism.
- SuperWord::pack_parallel - this routine is called only if SuperWord fails to produce any pack after extend_packlist(); it is another algorithm for packing instructions into SIMD.
It goes thru the list of all instructions in the _iteration_first, and if it is a Load, Store, Add or Mul it starts a new pack and adds this instruction to this pack. Then the algorithm circulates thru the iterations of the loop (gen < _ii_order.length()) and over the instructions list and finds the node with the origin coinciding with the origin of the nodes already in the pack - then it adds this node to the pack. Once packs are built, SuperWord returns to the normal processing (combine_packs()).
Note, that neither 2 or 3 goes thru the data dependency analysis, since the correctness of parallelization was guaranteed by the user.
Note, that some checks in added code could be omitted. But we are not assume that there is no optimization (now or in the future) that can change the graph structure between loop unrolling and the SuperWord, so we prefer to run many (probably unneeded) checks in the SuperWord. 

>>Can you also utilize changes done by Michael Berg for reduction 
>>optimization (the code in jdk9/hs-comp already)? I mean marking some 
>>nodes before unrolling and searching Phis.
Michael Berg and I looked at what you suggested (phi marking before unroll?) and we both think this marking is very different than what I do: Michael's phi marking is just one bit per node (in the flags) whereas I collect the _idx_clone_orig and the unroll generation (clone_idx).

 -----Original Message-----
From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
Sent: Thursday, April 16, 2015 3:31 PM
To: Civlin, Jan; hotspot-compiler-dev at openjdk.java.net
Subject: Re: RFR(S): 8076284: Improve vectorization of parallel streams

Hi Jan,

You did not describe your changes in details (what they do).

IgnoreVectorizeMethod flag should positive and enabled by default. 
Rename it to AllowVectorizeOnDemand (or something similar):

+  product(bool, AllowVectorizeOnDemand, true,

Instead of next you should add intrinsic definition to
and classfile/vmSymbols.hpp and then check method()->intrinsic_id():

+    if (strcmp("forEachRemaining", method()->name()->as_quoted_ascii()) 
== 0 && method()->signature() != 0
+      && method()->signature()->as_symbol() != 0 && 
method()->signature()->as_symbol()->as_quoted_ascii() != 0 ) {
+      if 
+        set_do_vector_loop(true);
+      }
+    }

And that should be under flag too because in general forEachRemaining 
should be vectorized only if it is safe.

Can you also utilize changes done by Michael Berg for reduction 
optimization (the code in jdk9/hs-comp already)? I mean marking some 
nodes before unrolling and searching Phis.


On 4/13/15 3:33 AM, Civlin, Jan wrote:
> Hi All,
>   We would like to contribute the improvement of vectorization of
>   parallel streams  from Intel.
> The contribution Bug ID: 8076284.
> Please review this patch:
> Bug-id: https://bugs.openjdk.java.net/browse/JDK-8076284
> webrev: http://cr.openjdk.java.net/~kvn/8076284/webrev/
>       *Description*
> Improve vectorization of the unordered parallel streams (by vectorizing
> forEachRemaining method).
> For example, this forEach will be vectorized:
> java.util.stream.IntStream iStream = java.util.stream.IntStream.range(0,
> RANGE - 1).parallel();
> iStream.forEach( id -> c[id] = c[id] + c[id+1] );
> It also enables on-demand loop vectorization in a given method (by
> providing more hints to SuperWord optimization).
> For example, use -XX:CompileCommand=option,computeCall,Vectorizeto
> vectorize this loop
> void computeCall(double [] Call, double  puByDf, double  pdByDf)
> {
> for(int i = timeStep; i > 0; i--)
> for(int j = 0; j <= i - 1; j++)
> Call[j] = puByDf * Call[j + 1] + pdByDf * Call[j];
> }
> This enhancement is contributed by Intel and sponsored by the hotspot
> compiler team.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: webrev.042115.1.tar.bz2
Type: application/octet-stream
Size: 935230 bytes
Desc: webrev.042115.1.tar.bz2
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150422/9c4f65f2/webrev.042115.1.tar-0001.bz2>

More information about the hotspot-compiler-dev mailing list