[aarch64-port-dev ] aarch64: RFR: Block zeroing by 'DC ZVA'

Edward Nevill edward.nevill at gmail.com
Mon Apr 25 09:09:33 UTC 2016


On Wed, 2016-04-20 at 18:08 +0100, Edward Nevill wrote:
> On Tue, 2016-04-19 at 14:19 +0100, Andrew Haley wrote:
> > On 04/19/2016 01:54 PM, Long Chen wrote:
> > > Would this be fine?
> > 
> > It might well be.  I'd like Ed to do a few measurements of large and
> > small block zeroing.  My guess is that a reasonably small unrolled loop
> > doing  STP ZR, ZR  will work better than anything else, but we'll see.
> OK. So I started by doing some basic measurements of how long it takes to clear a cache line on 3 different partners HW using 3 different methods.

I have redone these benchmarks using a JMH test provided by Andrew Haley. Thanks Andrew!

The test is here


And the results are here


As a reminder, the different patches are


Uses stp instead of str (no use of dc zva)


Long Chen's V01 patch


Long Chen's V02 patch


<zero single word to align base to 128 bit aligned address>
if (!small) {
  <zero remainder of first cache line using unrolled stp>
  <zero cache lines using dc zva>
<zero tail using unrolled stp>
<zero final word>


Long Chen's v02 patch modified to avoid unaligned stp instructions.

>From this it seems that patches bzero3 and bzero4 produce better performance on all except very small zeros <= 16 bytes.

bzero3 significantly larger than bzero4 and would probably need outlining.

Also, this cutoff point from using stp/str instead of dc zva is set at 2 x cache lines (to guarantee there is at least 1 use of dc zva). A larger value may be better.

What I propose next, is only to look at bzero3 and bzero4, to modify bzero3 to out of line the dc zva loop and to look at the cutoff point from stp/str to dc zva to determine thr optimum cutoff point.



More information about the hotspot-compiler-dev mailing list