[9] RFR(S): 8078497: C2's superword optimization causes unaligned memory accesses

Tobias Hartmann tobias.hartmann at oracle.com
Wed Apr 29 12:44:39 UTC 2015


please review the following patch.


Background information (simplified):
After loop optimizations, C2 tries to vectorize memory operations by finding and merging adjacent memory accesses in the loop body. The superword optimization matches MemNodes with the following pattern as address input:

  ptr + k*iv + constant [+ invar]

where iv is the loop induction variable, k is a scaling factor that may be 0 and invar is an optional loop invariant value. C2 then picks the memory operation with most similar references and tries to align it in the main loop by setting the number of pre-loop iterations accordingly. Other adjacent memory operations that conform to this alignment are merged into packs. After extending and filtering these packs, final vector operations are emitted.

The problem is that some architectures (for example, Sparc) require memory accesses to be aligned and in special cases the superword optimization generates code that contains unaligned vector instructions.

(1) If two memory operations have different loop invariant offset values and C2 decides to align the main loop to one of them, we can always set the invariant of the other operation such that we break the alignment constraint. For example, if we have a store to a byte array a[i+inv1] and a load from a byte array b[i+inv2] (where i is the induction variable), C2 may decide to align the main loop such that (i+inv1 % 8) == 0 and replace the adjacent byte stores by a 8-byte double word store. Also the byte loads will be replaced by a double word load. If we then set inv2 = inv1 + 1 at runtime, the load will fail with a SIGBUS because the access is not 8-byte aligned, i.e., (i+inv2 % 8) != 0. Vectorization should not take place in this case.

(2) If C2 decides to align the main loop to a memory operation, the necessary adjustment of the induction variable by the pre-loop is computed such that the resulting offset in the main loop is aligned to the vector width (see 'SuperWord::get_iv_adjustment'). If the offset of a memory operation is independent of the loop induction variable (i.e., scale k is 0), the iv adjustment should be 0. 

(3) If the loop span is greater than the vector width 'vw' of a memory operation, the superword optimization assumes that the pre-loop is never able to align this operation in the main loop (see 'SuperWord::ref_is_alignable'). This is wrong because if the loop span is a multiple of vw and depending on the initial offset, it may very well be possible to align the operation to vw.

These problems originally only showed up in the string density code base (JDK-8054307) where we have a putChar intrinsic that writes a char value to two entries of a byte array. To reproduce the issue with JDK9 we have to make sure that the memory operations:
(i) are independent,
(ii) have different invariants,
(iii) are not too complex (because then vectorization will not take place).

To guarantee (i) we either need two different arrays (for example, byte[] and char[]) for the load and store operations or the same array but different offsets. Since the offsets should include a loop invariant part (ii), the superword optimization will not be able to determine that the runtime offset values do not overlap. We therefore use Unsafe.getChar/putChar to read/write a char value from/to a byte/char array and thereby guarantee independence.

I came up with the following test (see 'TestVectorizationWithInvariant.java'):
  byte[] src = new byte[1000];
  byte[] dst = new char[1000];

  for (int i = (int) CHAR_ARRAY_OFFSET; i < 100; i = i + 8) {
    // Copy 8 chars from src to dst
    unsafe.putChar(dst, i + 0, unsafe.getChar(src, off + 0));
    unsafe.putChar(dst, i + 14, unsafe.getChar(src, off + 14));

Problem (1) shows up since the main loop will be aligned to the StoreC[i + 0] which has no invariant. However, the LoadUS[off + 0] has the loop invariant 'off'. Setting off to BYTE_ARRAY_OFFSET + 2 will break the alignment of the emitted double word load and result in a crash.

The LoadUS[off + 0] in above example is independent of the loop induction variable and therefore has no scale value. Because of problem (2), the iv adjustment is computed as -4 (see 'SuperWord::get_iv_adjustment').

  offset = 16 iv_adjust = -4 elt_size = 2 scale = 0 iv_stride = 16 vect_size 8

The regression test contains additional test cases that also trigger problem (3). 

(1) I added a check to 'SuperWord::find_adjacent_refs' to make sure that memory accesses with different invariants are only vectorized if the architecture supports unaligned memory accesses.
(2) 'SuperWord::get_iv_adjustment' is modified such that it returns 0 if the memory operation is independent of iv.
(3) In 'SuperWord::ref_is_alignable' we need to additionally check if span is divisible by vw. If so, the pre-loop will add multiples of vw to the initial offset and if that initial offset is divisible by vm, the final offset (after the pre-loop) will be divisible as well and we should return true. I added comments to the code describing this in detail.

- original test with string-density codebase
- regression test


More information about the hotspot-compiler-dev mailing list