C2: Unrolling and hoisting trivial expressions out of loops

Andrew Haley aph at redhat.com
Fri Apr 27 15:08:10 UTC 2018

AArch64 doesn't have a [reg + offset + (index << n)] addressing mode.
On AArch64 (and probably other targets) we generate dreadful code in C2
when hoisting constants out of loops:

        for (int i = 0; i < left.length && i < right.length; ++i) {
            dp += left[i] * right[i ];

generates this in the loop prologue:

   add	x12, x11, #0x18
   str	x12, [sp,#56]
   add	x12, x13, #0x18
   str	x12, [sp,#64]
   add	x12, x11, #0x1c
   str	x12, [sp,#72]
   add	x12, x13, #0x1c
   str	x12, [sp,#80]
   add	x12, x11, #0x20
   str	x12, [sp,#88]

... and in the loop body

   ldr	w11, [x6,w10,sxtw #2]
   ldr	x12, [sp,#32]
   ldr	w13, [x12,w10,sxtw #2]
   madd	w0, w13, w11, w0
   ldr	x11, [sp,#40]
   ldr	w13, [x11,w10,sxtw #2]
   ldr	x12, [sp,#48]
   ldr	w12, [x12,w10,sxtw #2]
   ldr	x11, [sp,#64]
   madd	w13, w13, w12, w0
   ldr	w11, [x11,w10,sxtw #2]
   ldr	x12, [sp,#56]
   ldr	w12, [x12,w10,sxtw #2]
   madd	w11, w12, w11, w13
   ldr	x12, [sp,#72]
   ldr	w13, [x12,w10,sxtw #2]
   ldr	x0, [sp,#80]
   ldr	w0, [x0,w10,sxtw #2]

... etc.

So, we are spilling trivial offsets and reloading them in a loop.
Each load of a trivial offset from the stack takes 5 cycles, whereas
recalculating it would take 1 cycle.

I did the experiment of defining a pattern which generates two
instructions, an add and an indexed load:

operand indIndexScaledOffsetI2L(iRegP reg, iRegI ireg, immIScale scale, immLU12 off)
  match(AddP (AddP reg off) (LShiftL (ConvI2L ireg) scale));

and this solves the problem.  There is no prologue which calculates
the offsets, and the loop looks like this:

  0x000003ffad233e78: add	xscratch1, x4, #0x1c
  0x000003ffad233e7c: ldr	w12, [xscratch1,x17,lsl #2]
  0x000003ffad233e80: madd	w13, w13, w11, w0
  0x000003ffad233e84: add	xscratch1, x3, #0x1c
  0x000003ffad233e88: ldr	w11, [xscratch1,x17,lsl #2]
  0x000003ffad233e8c: add	xscratch1, x4, #0x20
  0x000003ffad233e90: ldr	w15, [xscratch1,x17,lsl #2]
  0x000003ffad233e94: madd	w11, w11, w12, w13
  0x000003ffad233e98: add	xscratch1, x3, #0x20
  0x000003ffad233e9c: ldr	w13, [xscratch1,x17,lsl #2]
  0x000003ffad233ea0: add	xscratch1, x4, #0x24
  0x000003ffad233ea4: ldr	w12, [xscratch1,x17,lsl #2]
  0x000003ffad233ea8: madd	w13, w13, w15, w11
  0x000003ffad233eac: add	xscratch1, x3, #0x24
  0x000003ffad233eb0: ldr	w11, [xscratch1,x17,lsl #2]
  0x000003ffad233eb4: add	xscratch1, x4, #0x28
  0x000003ffad233eb8: ldr	w15, [xscratch1,x17,lsl #2]
  0x000003ffad233ebc: madd	w11, w11, w12, w13
  0x000003ffad233ec0: add	xscratch1, x3, #0x28

But it feels like a bit of a kludge.  It would be nicer if C2 didn't
hoist trivial expressions of the form (reg+offset) out of loops.
Alternatively, I could turn down the amount of unrolling so we only,
say, unroll 8 times, but this too feels like a kludge.  Could we
dissuade C2 from hoisting trivial reg+const expressions?

[This simple example requires UseSuperWord to be disabled.]

Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

More information about the hotspot-compiler-dev mailing list