[aarch64-port-dev ] RFR(L): 8231441: AArch64: Initial SVE backend support
vladimir.x.ivanov at oracle.com
Fri Aug 21 22:34:40 UTC 2020
Thanks for clarifications, Ningsheng.
Let me share my thoughts on the topic and I'll start with summarizing
the experience of migrating x86 code to generic vectors.
JVM has quite a bit of special logic to support vectors. It hasn't
exhausted the complexity budget yet, but it's quite close to the limit
(as you probably noticed). While extending x86 backend to support Vector
API, we pushed it over the limit and had to address some of the issues.
The ultimate goal was to move to vectors which represent full-width
hardware registers. After we were convinced that it will work well in AD
files, we encountered some inefficiencies with vector spills: depending
on actual hardware, smaller (than available) vectors may be used (e.g.,
integer computations on AVX-capable CPU). So, we stopped half-way and
left post-matching part intact: depending on actual vector value width,
appropriate operand (vecX/vecY/vecZ + legacy variants) is chosen.
(I believe you may be in a similar situation on AArch64 with NEON vs SVE
where both 128-bit and wide SVE vectors may be used at runtime.)
Now back to the patch.
What I see in the patch is that you try to attack the problem from the
opposite side: you introduce new concept of a size-agnostic vector
register on RA side and then directly use it during matching: vecA is
used in aarch64_sve.ad and aarch64.ad relies on vecD/vecX.
Unfortunately, it extends the implementation in orthogonal direction
which looks too aarch64-specific to benefit other architectures and x86
particular. I believe there's an alternative approach which can benefit
both aarch64 and x86, but it requires more experimentation.
If I were to start from scratch, I would choose between 3 options:
#1: reuse existing VecX/VecY/VecZ ideal registers and limit supported
vector sizes to 128-/256-/512-bit values.
#2: lift limitation on max size (to 1024/2048 bits), but ignore
#3: introduce support for full range of vector register sizes
(128-/.../2048-bit with 128-bit step);
I see 2 (mostly unrelated) limitations: maximum vector size and
My understanding is that you don't try to accurately represent SVE for
now, but lay some foundations for future work: you give up on
non-power-of-2 sized vectors, but still enable support for arbitrarily
sized vectors (addressing both limitations on maximum size and size
granularity) in RA (and it affects only spills). So, it is somewhere
between #2 and #3.
The ultimate goal is definitely #3, but how much more work will be
required to teach the JVM about non-power-of-2 vectors? As I see in the
patch, you don't have auto-vectorizer support yet, but Vector API will
provide access to whatever size hardware exposes. What do you expect on
hardware front in the near/mid-term future? Anything supporting vectors
larger than 512-bit? What about 384-bit vectors?
I don't have a good understanding where SVE/SVE2-capable hardware is
moving and would benefit a lot from your insights about what to expect.
If 256-/512-bit vectors end up as the only option, then #1 should fit
For larger vectors #2 (or a mix of #1 and #2) may be a good fit. My
understanding that existing RA machinery should support 1024-bit vectors
well. So, unless 2048-bit vectors are needed, we could live with the
framework we have right now.
If hardware has non-power-of-2 vectors, but JVM doesn't support them,
then JVM can work with just power-of-2 portion of them (384-bit => 256-bit).
Giving up on #3 for now and starting with less ambitious goals (#1 or
#2) would reduce pressure on RA and give more time for additional
experiments to come with a better and more universal
support/representation of generic/size-agnostic vectors. And, in a
longer term, help reducing complexity and technical debt in the area.
Some more comments follow inline.
>> Compared to x86 w/ AVX512, architectural state for vector registers is
>> 4x larger in the worst case (ignoring predicate registers for now).
>> Here are the relevant constants on x86:
>> // the number of reserved registers + machine registers.
>> #define REG_COUNT 545
>> // Size of register-mask in ints
>> #define RM_SIZE 22
>> My estimate is that for AArch64 with SVE support the constants will be:
>> REG_COUNT < 2500
>> RM_SIZE < 100
>> which don't look too bad.
> Right, but given that most real hardware implementations will be no
> larger than 512 bits, I think. Having a large bitmask array, with most
> bits useless, will be less efficient for regmask computation.
Does it make sense to limit the maximum supported size to 512-bit then
(at least, initially)? In that case, the overhead won't be worse it is
on x86 now.
>> Also, I don't see any changes related to stack management. So, I
>> assume it continues to be managed in slots. Any problems there? As I
>> understand, wide SVE registers are caller-save, so there may be many
>> spills of huge vectors around a call. (Probably, not possible with C2
>> auto-vectorizer as it is now, but Vector API will expose it.)
> Yes, the stack is still managed in slots, but it will be allocated with
> real vector register length instead of 'virtual' slots for VecA. See the
> usages of scalable_reg_slots(), e.g. in chaitin.cpp:1587. We have also
> applied the patch to vector api, and did find a lot of vector spills
> with expected correct results.
I'm curious whether similar problems may arise for spills. Considering
wide vector registers are caller-saved, it's possible to have lots of
256-byte values to end up on stack (especially, with Vector API). Any
concerns with that?
>> Have you noticed any performance problems? If that's the case, then
>> AVX512 support on x86 would benefit from similar optimization as well.
> Do you mean register allocation performance problems? I did not notice
> that before. Do you have any suggestion on how to measure that?
I'd try to run some applications/benchmarks with -XX:+CITime to get a
sense how much RA may be affected.
More information about the aarch64-port-dev