[aarch64-port-dev ] RFR(s): AArch64: 8149080: Recoginize disjoint array copy in stub code
hui.shi at linaro.org
Sat Feb 6 11:52:19 UTC 2016
Thanks Andrew and Edward!
Code sequence for backward and forward array copy is almost same except
prefetch. Performance test is based on
with "java StringConcatTest 5000". I tried disabling prefetch and compare
performance between backward and forward array copy (all forward with my
patch, force all backward by commenting out branch to nooverlap target and
force jshort_disjoint_copy generate conjoint copy), forward array copy is
much faster than backward. backward is about 85s and forward copy is about
This test is try to reflect common cases like string builder/buffer append
and string concatenation, these are disjoint array copy and forward array
copy is better than backward in following two aspects:
1. Forward array copy can prefetch dest address needed in next string
Most string append/concatenation operations will append chars after early
appened char arrays.
For example, str = str1 + str2 + str3
1. when append str1 in forward order, result value array(str.value) will
be prefetched beyond str1's length with hardware prefetcher
2. when store str2.value into str.value, str.value is already prefetched,
less cache miss when copy str2.value into str.value
If copy in backward order, after copy str1.value into str.value, it's
address before str.value get prefetched, this is not useful for next
Checking following PMU events on A57 (
array copy has more accurate hardware prefetcher result (more issued
request is used).Compare with/without prefetch instruction in forward copy,
no performance different, hardware prefechter might good enough.
0x167 Level 2 prefetcher request used (or demanded)
0x168 Level 2 prefetcher request issued
In forward array copy 94% generated request is useful (r167/r168)
In backward array copy 67% issued request is usedful
testing array copy not in append mode, each array copy performs on separate
address. Run with "java DiscreteCopy 3000" Forward copy takes 58s and
backward array copy takes 70s. Gap is decreased.
2. Backward array copy might cause much more unaligned memory access in
Current array copy implementation is:
1. peel source array address for 16 bytes alignment (backward will
perform peel from source end) copy 8,4,2,1 bytes
2. perform copy_longs
3. tail copy less than 16 bytes, copy 8,4,2,1 bytes
In string append/concatenation cases, source string value array is
usually 8 bytes or 16 bytes align. Suppose source address is 16 byte align
and size is n*16+14;.
With forward array copy: n ld/st pair, then store 8 bytes align, then
store 4 bytes align, then store 2 bytes align.
With backward array copy need peel source end address first (checking
copy_memory_small): store 8 bytes unaligned, store 4 bytes unaligned,
store 2 bytes aligned, n ld/st pair.
Perform unaligned access profiling with perf on DiscreteCopy, massive
unaligned access for backward array copy, while not found for forward array
array copy with 16 bytes aligned size, performance is identical for
backward and forward array copy, both are about 64s.
Perform forward array copy when possible will not make things worse and
benefit common cases like string append/concatenation. This is the original
logic when generate conjoint array copy, this patch complete this logic by
recognize all disjoint array copy. Does this make sense?
On 5 February 2016 at 22:37, Andrew Haley <aph at redhat.com> wrote:
> On 02/05/2016 02:32 PM, Edward Nevill wrote:
> > On Fri, 2016-02-05 at 12:58 +0000, Andrew Haley wrote:
> >> On 02/05/2016 12:47 PM, Hui Shi wrote:
> >>> Arraycopy without overlapping is faster than overlapped copy.
> >> The only thing which varies is the direction of copying. I'm not
> >> aware of anything which makes one direction faster than the other.
> >> Measurements, please.
> > Copy backwards doesn't prefetch. The difference with and without
> > prefetch can be very significant on some micro-arches.
> > if (direction == copy_forwards && PrefetchCopyIntervalInBytes > 0)
> > __ prfm(Address(s, PrefetchCopyIntervalInBytes), PLDL1KEEP);
> > I have done some experiments with prefetch enabled for backwards copy
> > and it shows almost identical performance to forwards copy.
> OK, so let's do that, then.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the hotspot-compiler-dev