[aarch64-port-dev ] RFR(L): 8231441: AArch64: Initial SVE backend support
ningsheng.jian at arm.com
Fri Aug 21 07:56:19 UTC 2020
Thanks a lot for looking at this!
On 8/20/20 8:29 PM, Vladimir Ivanov wrote:
> Hi Ningsheng,
> Impressive work, Ningsheng!
> "Since the bottom 128 bits are shared with the NEON, we extend current
> register mask definition of V0-V31 registers. Currently, c2 uses one bit
> mask for a 32-bit register slot, so to define at most 2048 bits we will
> need to add 64 slots in AD file. That's a really large number, and will
> also break current regmask assumption."
> Can you, please, elaborate on the last point? What RegMask assumptions
> are broken for 2048-bit vectors? I'm looking at  and try to
> understand the motivation for the changes in shared code.
Current regmask is handled by an array of ints, so an element of regmask
array can handle at most 32*32=1024 bits. Some regmask handling
functions, e.g. clear_to_sets() for alignment, need to be re-examined
for the support of 2048 bits. And we may even want to support non
power-of-two physical reg sizes, that could be a lot more work.
> Compared to x86 w/ AVX512, architectural state for vector registers is
> 4x larger in the worst case (ignoring predicate registers for now). Here
> are the relevant constants on x86:
> // the number of reserved registers + machine registers.
> #define REG_COUNT 545
> // Size of register-mask in ints
> #define RM_SIZE 22
> My estimate is that for AArch64 with SVE support the constants will be:
> REG_COUNT < 2500
> RM_SIZE < 100
> which don't look too bad.
Right, but given that most real hardware implementations will be no
larger than 512 bits, I think. Having a large bitmask array, with most
bits useless, will be less efficient for regmask computation.
> Also, I don't see any changes related to stack management. So, I assume
> it continues to be managed in slots. Any problems there? As I
> understand, wide SVE registers are caller-save, so there may be many
> spills of huge vectors around a call. (Probably, not possible with C2
> auto-vectorizer as it is now, but Vector API will expose it.)
Yes, the stack is still managed in slots, but it will be allocated with
real vector register length instead of 'virtual' slots for VecA. See the
usages of scalable_reg_slots(), e.g. in chaitin.cpp:1587. We have also
applied the patch to vector api, and did find a lot of vector spills
with expected correct results.
> Have you noticed any performance problems? If that's the case, then
> AVX512 support on x86 would benefit from similar optimization as well.
Do you mean register allocation performance problems? I did not notice
that before. Do you have any suggestion on how to measure that?
> FTR there was a similar exercise  on x86 to abstract away exact sizes
> of vector registers, but it didn't have to worry about RA since all the
> operands were already available. Also, vectors of all different sizes
> may be used. So, it makes it hard to compare.
I've also noticed that. That's an excellent work indeed. It could save a
lot of backend match rules for different vector register sizes, which
was one of the concerns when we started to work on SVE RA, if we defined
all regmasks for different SVE vector register sizes. And yes, our
current approach will also solve that problem. :-)
> Best regards,
> Vladimir Ivanov
>  http://cr.openjdk.java.net/~njian/8231441/webrev.03-ra/
>  https://bugs.openjdk.java.net/browse/JDK-8230015
More information about the aarch64-port-dev