RFR: AARCH64: optimize string compare intrinsic
paul.sandoz at oracle.com
Thu May 10 16:19:05 UTC 2018
> On May 10, 2018, at 5:50 AM, Dmitrij Pochepko <dmitrij.pochepko at bell-sw.com> wrote:
> On 09.05.2018 01:36, Paul Sandoz wrote:
>>> On May 8, 2018, at 6:31 AM, Dmitrij Pochepko <dmitrij.pochepko at bell-sw.com> <mailto:dmitrij.pochepko at bell-sw.com> wrote:
>>> On 04.05.2018 16:46, Andrew Haley wrote:
>>>> On 05/04/2018 02:24 PM, Dmitrij Pochepko wrote:
>>>>> Do you suggest to change vectorizedMismatch from generic single entry
>>>>> point to 4 versions (1,2,4 and 8 -byte) each optimized for respective
>>>>> size(and possible re-using code generation logic)? Then it can be
>>>>> re-used for same-encoded strings without penalties, indeed, but it
>>>>> requires changes in jdk.internal.util.ArraysSupport.java
>>>> I don't see why it's absolutely necessary. On the other hand, it might
>>>> be an excellent idea to have a switch statement in the Java code wich
>>>> will almost always optimized away. It's worth trying.
>>> As this is a separate possibly multiplatform effort which potentially affects common code and x86 platform intrinsic implementation I created a separate enhancement for this issue: https://bugs.openjdk.java.net/browse/JDK-8202783 <https://bugs.openjdk.java.net/browse/JDK-8202783>
>> Thank you (i cannot look at the issue right now as JBS is down for maintenance).
>> I don’t understand why you need to convert vectorizedMismatch to 4 versions, the stub is generated using the most optimal vector instructions for the platform (on x86). A single version should suffice with surrounding Java code detecting thresholds and managing the tail elements. See the wrapping Java methods in jdk.internal.util.ArraysSupport and similar methods in java.nio.BufferMismatch. (Note that fixed thresholds are used, and i did not do any measurements on platforms with larger vector sizes to determine if the thresholds should be adjusted, but vectorizedMismatch implementation will use smaller vectors sizes if need be.)
> I was looking at existing vectorizedMismatch implementation (x86). It handles whole array length, including tail handling (which is basically duplicates java code tail handling), so, last "for" block is always skipped. I suspect it was done this way for better performance.
Yes, the specification is written such that a vectorizedMismatch implementation can bail out at any index and the Java code can deal with the rest, but the implementation does not have to, it's an optimization decision. Note that the Java code for the tail also deals with lengths under the threshold to call vectorizedMismatch.
> In case final solution will be to handle whole array in vectorizedMismatch, I think we have one more way to improve it, since, for example, 2-byte version doesn't need 1-byte loop inside intrinsic code(and doesn't need respective check).
As lengths get larger it might matter less. I think its worth doing some measurements to help inform which direction to take.
> That was my thoughts and reason for considering 4 versions. In case vectorizedMismatch will be used strictly for 8-byte loads, then there is obviously no need to have 4 implementations.
>>> Let me know if you have any comments on the patch.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the hotspot-compiler-dev