Vector API performance of two-vector rearrange() overload

Kai Burjack kburjack at
Fri Feb 12 10:13:05 UTC 2021

I was just checking on the performance of shuffle operations, and noticed
that the solution proposed in has
been implemented, making single-vector shuffle/rearrange quite fast now,

Next, I was implementing 4x4 matrix inversion with Vector API and hit
another performance roadblock. Essentially, I needed some way to use SSE's
MOVHLPS and MOVLHPS which is emitted for example by LLVM's
`__builtin_shufflevector(v0, v1, 0, 4, 1, 5)`.

While searching for an alternative on how to do this with Vector API, I
discovered the two-vector overload of rearrange() and came to the
conclusion that the above LLVM builtin could be expressed via Vector API by
using `v0.rearrange(SPECIES_128.shuffleFromValues(0, 4, 1, 5), v1)`,
however there is a HUGE performance overhead in using that two-vector
overload. In particular, when also using indexes to actually select lanes
from the second vector argument, which in the documentation of rearrange()
is called "exceptional indexes".

Even when not using "exceptional" indexes, this two-vector overload of
rearrange() is rather slow.
I compared this:

`v0.rearrange(SPECIES_128.shuffleFromValues(2, 3, 0, 1), v0)`

to the equivalent form:

`v0.rearrange(SPECIES_128.shuffleFromValues(2, 3, 0, 1))`

and there was also a rather big performance difference.

I was just wondering whether the two-vector overload of rearrange() hasn't
yet seen a fast intrinsification with MOVHLPS/MOVLHPS?


More information about the panama-dev mailing list