[aarch64-port-dev ] RFR: 8187472 - AARCH64: array_equals intrinsic doesn't use prefetch for large arrays
dmitrij.pochepko at bell-sw.com
Fri Apr 6 11:15:42 UTC 2018
Thank you for review
On 06.04.2018 10:00, Dmitry Samersoff wrote:
> This version looks good to me.
> On 16/03/2018 19:21, Dmitrij Pochepko wrote:
>> Hi Andrew, all,
>> I modified patch according to your comments.
>> First, regarding size statistics: I looked into string size statistics
>> which was measured for JEP254: Compact Strings
>> Conclusions are that 75% of strings <= 32 symbols, so we can take this
>> number into account. This also means that changing string equals code to
>> be capable of prefetching(and thus adding branch for that) is not
>> practical. Arrays are another story. So, I split array equals and string
>> String equals remains basically unchanged(the only noticeable change is
>> that I removed size calculations for UTF strings (asr to convert
>> length-in-bytes into length-in-characters, and than converting it back
>> via lsr). It save 2 instructions for this code path: about 5%
>> improvement on small sizes.
>> Array equals was completely re-written.
>> For array equals I have 2 algorithms, controlled by vm option:
>> UseSimpleArrayEquals. First algorithm(simple one) is basically the same
>> as original one with slightly another branch logic. It seems like Cortex
>> A7* series prefer shorter code, which fits better for speculative
>> execution. All other CPUs I was able to check(Cortex A53 and Cavium h/w)
>> prefer algorithm with large loop and possible prefetching.
>> Regarding your comment about addresses at the end of memory page: you're
>> right. It is indeed possible in theory to use this code for some
>> substring, however, due to the nature of algorithm for array equals(it
>> reads array length from array header directly and then jump to first
>> array element), memory access will always be 8-byte aligned, then it's
>> safe to use 8-byte loads for array tails. So, I believe this issue is
>> now naturally resolved, since I left string equals logic unchanged.
>> Now, benchmarks:
>> I modified benchmark to have better measurements accuracy by increasing
>> amount of data processed in each iteration
>> (http://cr.openjdk.java.net/~dpochepk/8187472/ArrayAltEquals.java) and
>> has following improvement average numbers for array equals:
>> Cavium "ThunderX 2":
>> length 1..8: 10-20%
>> length 9..16: 1%
>> length 17..32: 5-10%
>> large(512+): almost 2 times
>> Cavium "Thunder X":
>> length 1..8: 1%
>> length 9..16: 2-3%
>> length 17..32: 5%
>> large arrays(512+): up to 5 times
>> Cortex A53 (R-Pi):
>> all ranges are about 5% average (note: large results dispersion (about
>> 10%) on R-PI)
>> Cortex A73:
>> basically no changes because implementation is almost the same (better
>> by about 0.5% average, but dispersion is about 4%, so it's not a
>> statistically significant result)
>> More detailed benchmarking results can be found here:
>> updated webrev: http://cr.openjdk.java.net/~dpochepk/8187472/webrev.08/
>> Testing: passed jtreg hotspot tests on AArch64 fastdebug build. no new
>> failures found in comparison with non-patched build.
>> I also additionally used "brute-force" test which checks all array
>> equality combinations for any given length ranges:
>> On 08.02.2018 13:11, Andrew Haley wrote:
>>> On 07/02/18 19:39, Dmitrij Pochepko wrote:
>>>> In general, this patch changes very short arrays handling(performing
>>>> 8-byte read instead of few smaller reads, using the fact of 8-byte
>>>> alignment) and jumping into stub with large 64-byte read loop for larger
>>>> Measurements(measured array length 7,64,128,256,512,1024,100000.
>>>> Improvement in %. 80% improvement means that new version is 80% faster,
>>>> i.e. 5 times.):
>>>> ThunderX: 2%, -4%, 0%, 2%, 32%, 55%, 80%
>>>> ThunderX2: 0%, -3%, 17%, 19%, 29%, 31%, 47%
>>>> Cortex A53 at 533MHz: 8%, -1%, -2%, 4%, 6%, 5%, 3%
>>>> Cortex A73 at 903MHz: 8%, -3%, 0%, 7%, 8%, 9%, 8%
>>>> Note: medium sizes are a bit slower because of additional branch
>>>> added(which checks size and jumps to stub).
>>> This indentation is messed up:
>>> @@ -5201,40 +5217,23 @@
>>> // length == 4.
>>> if (log_elem_size > 0)
>>> lsl(cnt1, cnt1, log_elem_size);
>>> - ldr(tmp1, Address(a1, cnt1));
>>> - ldr(tmp2, Address(a2, cnt1));
>>> + ldr(tmp1, Address(a1, cnt1));
>>> + ldr(tmp2, Address(a2, cnt1));
>>> I'm not convinced that this works correctly if passed the address of a
>>> of arrays at the end of a page. Maybe it isn't used on sub-arrays today
>>> in HotSpot, but one day it might be.
>>> It pessimizes a very common case of strings, those of about 32
>>> Please think again. Please also think about strings that are long enough
>>> for the SIMD loop but differ in their early substrings.
More information about the hotspot-compiler-dev