C2: Unrolling and hoisting trivial expressions out of loops

Andrew Haley aph at redhat.com
Fri Apr 27 15:08:10 UTC 2018

```AArch64 doesn't have a [reg + offset + (index << n)] addressing mode.
On AArch64 (and probably other targets) we generate dreadful code in C2
when hoisting constants out of loops:

for (int i = 0; i < left.length && i < right.length; ++i) {
dp += left[i] * right[i ];
}

generates this in the loop prologue:

str	x12, [sp,#56]
str	x12, [sp,#64]
str	x12, [sp,#72]
str	x12, [sp,#80]
str	x12, [sp,#88]

... and in the loop body

ldr	w11, [x6,w10,sxtw #2]
ldr	x12, [sp,#32]
ldr	w13, [x12,w10,sxtw #2]
ldr	x11, [sp,#40]
ldr	w13, [x11,w10,sxtw #2]
ldr	x12, [sp,#48]
ldr	w12, [x12,w10,sxtw #2]
ldr	x11, [sp,#64]
ldr	w11, [x11,w10,sxtw #2]
ldr	x12, [sp,#56]
ldr	w12, [x12,w10,sxtw #2]
ldr	x12, [sp,#72]
ldr	w13, [x12,w10,sxtw #2]
ldr	x0, [sp,#80]
ldr	w0, [x0,w10,sxtw #2]

... etc.

So, we are spilling trivial offsets and reloading them in a loop.
Each load of a trivial offset from the stack takes 5 cycles, whereas
recalculating it would take 1 cycle.

I did the experiment of defining a pattern which generates two

operand indIndexScaledOffsetI2L(iRegP reg, iRegI ireg, immIScale scale, immLU12 off)
%{
constraint(ALLOC_IN_RC(ptr_reg));

and this solves the problem.  There is no prologue which calculates
the offsets, and the loop looks like this:

But it feels like a bit of a kludge.  It would be nicer if C2 didn't
hoist trivial expressions of the form (reg+offset) out of loops.
Alternatively, I could turn down the amount of unrolling so we only,
say, unroll 8 times, but this too feels like a kludge.  Could we
dissuade C2 from hoisting trivial reg+const expressions?

[This simple example requires UseSuperWord to be disabled.]

--
Andrew Haley