[9] RFR(S): 8159016: Over-unrolled loop is partially removed

Tobias Hartmann tobias.hartmann at oracle.com
Wed Jun 22 13:22:46 UTC 2016

Hi Vladimir,

thanks for the review!

On 21.06.2016 21:59, Vladimir Kozlov wrote:
> Thank you, Tobias
> Your changes are good to have.
> You can still have a problem if unrolling is done before CCP phase which calculates array size and/or limit.
> Try to delay size calculation in test until CCP phase. Something like next (please, verify):
>         // Simplified during CCP:
>         int size = 2;
>         for (; size < 4; size *= 2);
>         Object arr[] = new Object[size - 1];

With a similar piece of code, I'm able to delay size calculation to after CCP but then the out of bound stores are not removed because the trip count Phi has type "int" (instead of "int:>=1"). I think this is because the CastIINode::Value() optimization for CastIIs with _carry_dependency is not able to determine a more restricted type before CCP.
> Why over-unrolling check code in IdealLoopTree::policy_unroll() does not work?

The code after "// Protect against over-unrolling when init or/and limit are not constant" can only detect over-unrolling if the type of the iv-Phi is narrow enough. In this case, the Phi's type is "#int:>=1:www".
> I think the trip count check before main loop should collapse whole main loop code. Then we will not have this problem. Why it is not collapsed?

The trip count check before the main loop is guarded by Opaque nodes and not folded until after loop unrolling. I think it's also better to unroll the loop at least twice instead of over-unrolling and then removing the loop.

> About test.
> You do & 3 two times:
> int lim = (arg & 3);
> test(i & 3);
> Use 11_000 iterations because 10000 may not trigger compilation.

Right, I already fixed this but accidentally uploaded an old version of the test. We actually don't even need 10_000 iterations because the test runs with -Xcomp to avoid gathering "too much" profiling information for the loop. A loop with 42 iterations (for some reason we need some iterations) triggers the problem 100% on my machine.

> Let array escape to avoid EA. For example, return it.

Done, but this triggers another problem: We now crash in PhaseMacroExpand::expand_allocate_common() in line 1600 while adding a membar after an InitializeNode because all memory users were removed. The problem is that now parts of the post loop are removed before we can determine that the entire loop is dead (and it is dead because the pre-loop executes one iteration and the unrolled main loop executes the remaining two iterations). The post loop is only removed after the Opaque1Node that guards the predicate is removed which happens later in macro expansion.

I think it should be safe to just add a guard for != NULL like we do in PhaseMacroExpand::process_users_of_allocation(). I verified that the post loop is removed after the Opaque1Node goes away.

New webrev:

I'm re-running the testing.


> Thanks,
> Vladimir
> On 6/21/16 8:19 AM, Tobias Hartmann wrote:
>> Hi,
>> please review the following fix:
>> https://bugs.openjdk.java.net/browse/JDK-8159016
>> http://cr.openjdk.java.net/~thartmann/8159016/webrev.00/
>> I was able to reproduce the problem with a simple test:
>>  Object arr[] = new Object[3];
>>  int lim = (arg & 3);
>>  for (int i = 0; i < lim; ++i) {
>>    arr[i] = new Object();
>>  }
>> The loop is split into pre, main and post loops. The pre loop is executed for at least one iteration, initializing arr[0]. The main loop is unrolled twice, initializing at least arr[1], arr[2], arr[3] and arr[4]. The arr[3] and arr[4] stores are always out of bounds and removed. However, C2 is unable to remove the "over-unrolled", dead main loop. As a result, there is a control path from the main loop to the PostScalarRce loop (see JDK-8151573) without a memory path since the last store was replaced by TOP. We crash because we use a memory edge from a non-dominating region.
>> The out of bound stores in the over-unrolled loop are removed because their range check dependent CastII nodes are removed by the first optimization in CastIINode::Ideal():
>>  CastII (AddI Phi 2) -> AddI (CastII Phi) 2 -> AddI TOP 2
>> The CastIINodes are replaced by TOP because the type of the loop Phi is >= 1 and the CastII is [1..2], i.e. there is no value >= 1 that is in the [1..2] range if +2 is added.
>> The underlying problem is the over-unrolling of the loop. Since lim < 3, we always only execute at least 3 loop iterations. With the pre loop executing at least one iteration, the main loop should not be unrolled more than once. This problem is similar to JDK-4834191 where we over-unrolled a loop with a constant limit.
>> I changed the implementation of IdealLoopTree::compute_exact_trip_count() to not only compute the exact trip count if the loop limit is constant but to also set the _trip_count field if we are able to determine an upper bound. Like this, the checks in IdealLoopTree::policy_unroll() prevent additional loop unrolling if the trip_count became too small and we keep the maximally unrolled version.
>> An alternative fix would be to disable the CastII optimization for range check dependent CastII nodes but this does not fix the underlying problem of over-unrolling.
>> Tested with regression test, JPRT and RBT (running).
>> Thanks,
>> Tobias

More information about the hotspot-compiler-dev mailing list