PPC64 VSX load/store instructions in stubs

Doerr, Martin martin.doerr at sap.com
Thu May 12 09:33:03 UTC 2016

Hi Gustavo,

thanks for providing the webrevs. The change looks basically good.

I only have the following concerns:
- We basically support configuring dscr by various DSCR switches. Your code resets the value to hardware default instead of the possibly modified values. We're currently only using default DSCR values, but maybe we may want to play with them in the future.
We could use a static variable for the default dscr value. It could be modified in VM_Version::config_dscr() and used by your restore code (load_const_optimized(tmp1, ...) instead of li(tmp1, 0)).

- The PPC-elf64abi-1.9 says: "Functions must ensure that the appropriate bits in the vrsave register are set for any vector registers they use. ...". I think not touching vrsave is the right thing for AIX and ppc64le, but I think we will either have to skip the optimization on ppc64 big endian or handle vrsave. Do you agree?

Best regards,

-----Original Message-----
From: Gustavo Romero [mailto:gromero at linux.vnet.ibm.com] 
Sent: Mittwoch, 11. Mai 2016 23:07
To: Volker Simonis <volker.simonis at gmail.com>
Cc: Doerr, Martin <martin.doerr at sap.com>; Simonis, Volker <volker.simonis at sap.com>; ppc-aix-port-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; brenohl at br.ibm.com
Subject: Re: PPC64 VSX load/store instructions in stubs
Importance: High

Hi Volker, Hi Martin

Sincere apologies for the long delay.

My initial approach to test the VSX load/store was from an
extracted snippet regarding just the mass copy loop "grafted" inside an inline
asm, performing isolated tests with "perf" tool focused only on aligned source and
destination (best case).

The extracted code, called "Original" in the plot below (black line), is here:

That extracted, after some experiments, evolved into this one that employs VSX
load/store, Data Stream deepest pre-fetch, d-cache touch, and backbranch aligned
to 32-byte:

All runs where "pinned" using `numactl --cpunodebind --membind` to avoid any
scheduler decision that could add noise to the measure.

VSX, deepest data pre-fetch, d-cache touch, and 32-bytes align proved to be better
in the isolated code (red line) in comparison to the original extracted code
(black line):

So I proceeded to implement the VSX loop in OpenJDK based on the best case
result (VSX, pre-fetch deepest, d-cache touch, and backbranch target align -
goetz TODO note).

OpenJDK 8 webrev:

OpenJDK 9 webrev:

I've tested the change on OpenJDK 8 using this script that calls
System.arraycopy() on shorts:

The results for all data alignment cases:

Martin, I added the vsx test to the feature-string. Regarding the ABI, I'm just
using two VSR: vsr0 and vsr1, both volatile.

Volker, as the loop unrolling was removed now the loop copies 16 elemets a time,
like the non-VSX loop, and not 32 elements. I just verified the change on Little
endian. Sorry I didn't understand your question regarding "instructions for
aligned load/stores". Did you mean instructions for unaligned load/stores? I think
both fixed-point (ld/std) and VSX instructions will do load/store slower in
unaligned scenario. However VMX load/store is different and expects aligned
operands. Thank you very much for opening the bug

I don't have the profiling per function for each SPEC{jbb,jvm} benchmark
in order to determine which one would stress the proposed change better.
Could I use a better benchmark?

Thank you!

Best regards,

On 05-04-2016 14:23, Volker Simonis wrote:
> Hi Gustavo,
> thanks a lot for your contribution.
> Can you please describe if you've run benchmarks and which performance
> improvements you saw?
> With your change if we're running on Power 8, we will only use the
> fast path for arrays with at least 32 elements. For smaller arrays, we
> will fall-back to copying only 2 elements at a time which will be
> slower than the initial version which copied 4 at a time in that case.
> Did you verified your changes on both, little and big endian?
> And what about unaligned memory accesses? As far as I read,
> lxvd2x/stxvd2x still work, but may be slower. I saw there also exist
> instructions for aligned load/stores. Would it make sens
> (performance-wise) to use them for the cases where we can be sure that
> we have aligned memory accesses?
> Thank you and best regards,
> Volker
> On Fri, Apr 1, 2016 at 10:36 PM, Gustavo Romero
> <gromero at linux.vnet.ibm.com> wrote:
>> Hi Martin, Hi Volker
>> Currently VSX load/store instructions are not being used in PPC64 stubs,
>> particularly in arraycopy stubs inside generate_arraycopy_stubs() like,
>> but not limited to, generate_disjoint_{byte,short,int,long}_copy.
>> We can speed up mass copy using VSX (Vector-Scalar Extension) load/store
>> instruction in processors >= POWER8, the same way it's already done for
>> libc memcpy().
>> This is an initial patch just for jshort_disjoint_arraycopy() VSX vector
>> load/store:
>> http://81.de.7a9f.ip4.static.sl-reverse.com/202539/webrev
>> What are your thoughts on that? Is there any impediment to use VSX
>> instructions in OpenJDK at the moment?
>> Thank you.
>> Best regards,
>> Gustavo

More information about the hotspot-dev mailing list