RFR: AARCH64: optimize string compare intrinsic
dmitrij.pochepko at bell-sw.com
Thu May 10 12:50:19 UTC 2018
On 09.05.2018 01:36, Paul Sandoz wrote:
>> On May 8, 2018, at 6:31 AM, Dmitrij Pochepko <dmitrij.pochepko at bell-sw.com> wrote:
>> On 04.05.2018 16:46, Andrew Haley wrote:
>>> On 05/04/2018 02:24 PM, Dmitrij Pochepko wrote:
>>>> Do you suggest to change vectorizedMismatch from generic single entry
>>>> point to 4 versions (1,2,4 and 8 -byte) each optimized for respective
>>>> size(and possible re-using code generation logic)? Then it can be
>>>> re-used for same-encoded strings without penalties, indeed, but it
>>>> requires changes in jdk.internal.util.ArraysSupport.java
>>> I don't see why it's absolutely necessary. On the other hand, it might
>>> be an excellent idea to have a switch statement in the Java code wich
>>> will almost always optimized away. It's worth trying.
>> As this is a separate possibly multiplatform effort which potentially affects common code and x86 platform intrinsic implementation I created a separate enhancement for this issue: https://bugs.openjdk.java.net/browse/JDK-8202783
> Thank you (i cannot look at the issue right now as JBS is down for maintenance).
> I don’t understand why you need to convert vectorizedMismatch to 4 versions, the stub is generated using the most optimal vector instructions for the platform (on x86). A single version should suffice with surrounding Java code detecting thresholds and managing the tail elements. See the wrapping Java methods in jdk.internal.util.ArraysSupport and similar methods in java.nio.BufferMismatch. (Note that fixed thresholds are used, and i did not do any measurements on platforms with larger vector sizes to determine if the thresholds should be adjusted, but vectorizedMismatch implementation will use smaller vectors sizes if need be.)
I was looking at existing vectorizedMismatch implementation (x86). It
handles whole array length, including tail handling (which is basically
duplicates java code tail handling), so, last "for" block is always
skipped. I suspect it was done this way for better performance. In case
final solution will be to handle whole array in vectorizedMismatch, I
think we have one more way to improve it, since, for example, 2-byte
version doesn't need 1-byte loop inside intrinsic code(and doesn't need
respective check). That was my thoughts and reason for considering 4
versions. In case vectorizedMismatch will be used strictly for 8-byte
loads, then there is obviously no need to have 4 implementations.
>> Let me know if you have any comments on the patch.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the hotspot-compiler-dev