RFR: AARCH64: optimize string compare intrinsic
dmitrij.pochepko at bell-sw.com
Fri May 4 11:28:37 UTC 2018
On 03.05.2018 22:41, Paul Sandoz wrote:
> Hi Dmitrij,
>> On May 3, 2018, at 11:58 AM, Dmitrij Pochepko <dmitrij.pochepko at bell-sw.com> wrote:
>> Hi Paul,
>> Actually, vectorizedMismatch has more in common with array equals, which is a more generic version of the same algorithm.
> Since vectorizedMismatch is also used for lexicographical array comparison it still might be applicable for string comparison *if* the character encodings of the two strings are the same.
> Opportunistically, my hope was that the string comparison intrinsics code could be reduced to focus on strings of different encodings, thereby potentially simplifying HotSpot code. That could apply across all platforms that support the vectorizedMismatch intrinsic.
Yes, we can consider reusing vectorizedMismatch code. However, there are
a few issues which make me suggest we postpone considering reusing
vectorisedMismatch until we get the performance numbers.
1. there is no vectorizedMismatch intrinsic implemented for AARCH64 as I
mentioned before. This patch optimizes existing string compareTo.
vectorizedMismatch is a separate piece of work which is underway by
Boris, and we will check the performance impact of using
vectorizedMismatch in String::compareTo when this is implemented. It may
not get into JDK 11.
2. vectorizedMismatch intrinsic contract moves type information from
compile time to runtime (to have single entry point), so, no compile
time type information is available to intrinsic. This information is
moved to parameter: log2ArrayIndexScale. As a result, hotspot can use
single but not optimal intrinsic (as always, generic vs fast).
vectorizedMismatch implementation will have to add a few branches to
check this log2ArrayIndexScale in runtime (checking 1, 2, 4 or 8 -byte
type is used, so, +1 branch on shortest code path and +3 branches on
longest). This is where compareTo and vectorizedMismatch are different.
For example, String::compareTo for UTF-16/UTF-16 encoding won't use
single byte read and no such branch is needed, while single
vectorizedMismatch version will have 2 options:
a) use 1-3 additional branches to check log2ArrayIndexScale parameter at
start and then jump to a specific entry point
b) use 1-3 additional branches to check odd/even byte length at start or
Both options will add additional branches which will noticeably affect
performance for short strings. Depending on the specifics of branch
predictor and code shape, it may affect the performance by tens of
percents. Since about 75% of strings are 32 symbols or less, it's quite
important to care about small strings. That's probably one of the
reasons why x86 implementation has separate intrinsics for
vectorizedMismatch and String::compareTo.
I remember a discussion, when improving large strings handling for array
equals by 4-6 times for large strings was rejected because small strings
performance was slowed down by 2% in the same patch. The root cause was
1 added branch, and we're now talking about 3 branches worst case. This
is why I'd suggest having the performance numbers for vectorisedMismatch
on the table (we are working on it) and have the discussion around
reusing it in String::comapareTo after we have them.
>> Unfortunately, vectorizedMismatch intrinsic is not yet implemented for AARCH64 (we're working on it as well and will try to reuse the code assuming there is no significant performance impact).
>> CC'ing Boris, who is working on vectorizedMismatch.
>> On 01.05.2018 01:21, Paul Sandoz wrote:
>>> Hi Dmitrij,
>>> Here is a somewhat lateral thought, it might have some legs...
>>> For the case when the encoding of the compared strings are the same have you considered changing the string compare implementations to use the array mismatch functionality (see jdk.internal.util.ArraysSupport.vectorizedMismatch) and then optimize that for AARCH64, if not already done so. It may simplify things in some respects but it would also broaden the performance impact to arrays and buffers.
>>>> On Apr 28, 2018, at 11:29 AM, Dmitrij Pochepko <dmitrij.pochepko at bell-sw.com> wrote:
>>>> Hi all,
>>>> please review patch for 8202326: AARCH64: optimize string compare intrinsic
>>>> This patch introduces string compareTo stub, which uses large loops with prefetch instructions. Stub is called for long strings and improves String::compareTo up to 4 times on systems without hardware prefetching (ThunderX) and up to 2 times on systems with hardware prefetching (ThunderX2). Also inlined code is re-arranged with more optimal pipelining, which helps in-order systems, so small strings are also slightly improved.
>>>> There are no noticeable regressions according to benchmark results.
>>>> I created benchmark to measure improvement: http://cr.openjdk.java.net/~dpochepk/8202326/StringCompareBench.java
>>>> Execution matrix is large and can be seen here: http://cr.openjdk.java.net/~dpochepk/8202326/str_compare.xls
>>>> Raw results are *.txt files here: http://cr.openjdk.java.net/~dpochepk/8202326/
>>>> webrev: http://cr.openjdk.java.net/~dpochepk/8202326/webrev.01/
>>>> CR: https://bugs.openjdk.java.net/browse/JDK-8202326
>>>> testing: I run jtreg hotspot tests: compiler/* gc/* runtime/* using fastdebug build and found no new failures. I also run long "bruteforce" test which checks all combinations of different character index for all strings up to size 512: http://cr.openjdk.java.net/~dpochepk/8202326/StrCmpTest.java
>>>> Additional note: this patch depends on zip2 instruction encoding fix: JDK-8202395
More information about the hotspot-compiler-dev