[aarch64-port-dev ] RFR: 8187472 - AARCH64: array_equals intrinsic doesn't use prefetch for large arrays
dmitrij.pochepko at bell-sw.com
Fri Mar 16 16:21:38 UTC 2018
Hi Andrew, all,
I modified patch according to your comments.
First, regarding size statistics: I looked into string size statistics
which was measured for JEP254: Compact Strings
Conclusions are that 75% of strings <= 32 symbols, so we can take this
number into account. This also means that changing string equals code to
be capable of prefetching(and thus adding branch for that) is not
practical. Arrays are another story. So, I split array equals and string
String equals remains basically unchanged(the only noticeable change is
that I removed size calculations for UTF strings (asr to convert
length-in-bytes into length-in-characters, and than converting it back
via lsr). It save 2 instructions for this code path: about 5%
improvement on small sizes.
Array equals was completely re-written.
For array equals I have 2 algorithms, controlled by vm option:
UseSimpleArrayEquals. First algorithm(simple one) is basically the same
as original one with slightly another branch logic. It seems like Cortex
A7* series prefer shorter code, which fits better for speculative
execution. All other CPUs I was able to check(Cortex A53 and Cavium h/w)
prefer algorithm with large loop and possible prefetching.
Regarding your comment about addresses at the end of memory page: you're
right. It is indeed possible in theory to use this code for some
substring, however, due to the nature of algorithm for array equals(it
reads array length from array header directly and then jump to first
array element), memory access will always be 8-byte aligned, then it's
safe to use 8-byte loads for array tails. So, I believe this issue is
now naturally resolved, since I left string equals logic unchanged.
I modified benchmark to have better measurements accuracy by increasing
amount of data processed in each iteration
has following improvement average numbers for array equals:
Cavium "ThunderX 2":
length 1..8: 10-20%
length 9..16: 1%
length 17..32: 5-10%
large(512+): almost 2 times
Cavium "Thunder X":
length 1..8: 1%
length 9..16: 2-3%
length 17..32: 5%
large arrays(512+): up to 5 times
Cortex A53 (R-Pi):
all ranges are about 5% average (note: large results dispersion (about
10%) on R-PI)
basically no changes because implementation is almost the same (better
by about 0.5% average, but dispersion is about 4%, so it's not a
statistically significant result)
More detailed benchmarking results can be found here:
updated webrev: http://cr.openjdk.java.net/~dpochepk/8187472/webrev.08/
Testing: passed jtreg hotspot tests on AArch64 fastdebug build. no new
failures found in comparison with non-patched build.
I also additionally used "brute-force" test which checks all array
equality combinations for any given length ranges:
On 08.02.2018 13:11, Andrew Haley wrote:
> On 07/02/18 19:39, Dmitrij Pochepko wrote:
>> In general, this patch changes very short arrays handling(performing
>> 8-byte read instead of few smaller reads, using the fact of 8-byte
>> alignment) and jumping into stub with large 64-byte read loop for larger
>> Measurements(measured array length 7,64,128,256,512,1024,100000.
>> Improvement in %. 80% improvement means that new version is 80% faster,
>> i.e. 5 times.):
>> ThunderX: 2%, -4%, 0%, 2%, 32%, 55%, 80%
>> ThunderX2: 0%, -3%, 17%, 19%, 29%, 31%, 47%
>> Cortex A53 at 533MHz: 8%, -1%, -2%, 4%, 6%, 5%, 3%
>> Cortex A73 at 903MHz: 8%, -3%, 0%, 7%, 8%, 9%, 8%
>> Note: medium sizes are a bit slower because of additional branch
>> added(which checks size and jumps to stub).
> This indentation is messed up:
> @@ -5201,40 +5217,23 @@
> // length == 4.
> if (log_elem_size > 0)
> lsl(cnt1, cnt1, log_elem_size);
> - ldr(tmp1, Address(a1, cnt1));
> - ldr(tmp2, Address(a2, cnt1));
> + ldr(tmp1, Address(a1, cnt1));
> + ldr(tmp2, Address(a2, cnt1));
> I'm not convinced that this works correctly if passed the address of a pair
> of arrays at the end of a page. Maybe it isn't used on sub-arrays today
> in HotSpot, but one day it might be.
> It pessimizes a very common case of strings, those of about 32 characters.
> Please think again. Please also think about strings that are long enough
> for the SIMD loop but differ in their early substrings.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the hotspot-compiler-dev