[10] RFR(S): JDK-8184943: AARCH64: Intrinsify hasNegatives

Dmitrij Pochepko dmitrij.pochepko at bell-sw.com
Thu Jul 20 10:03:45 UTC 2017

Hi everyone,

Please review this small webrev [1] that implements an enhancement [2] 
which adds has_negatives intrinsic to AARCH64 OpenJDK port. This 
intrinsic performs better than c2-compiled code for every array size tried:

ThunderX T88: about 2% for array size = 1 and up to 8.5x for large arrays

Cortex A53(R-Pi): shows about the same numbers(really large sizes can't 
be normally tested there due to small amount of available memory).

Intrinsified HasNegatives method checks if provided byte array has any 
byte with negative value(higher bit set) and intrinsic in general do as 
following(with various minor optimizations):

1) check array length variable to have lower bits set (0x1, 0x2, 0x4, 
0x8) and invoke respective load instruction(ldrb, ldrh, ldrw, ldr) while 
reducing remaining length variable respectively. So, remaining length is 
16*N after this code. Proceed to 2).

2) in case remaining length  >= 64, loads data in a loop with 4 ldp 
instructions(16 bytes each) and invoking prfm (prefetch hint) in case 
SoftwarePrefetchHintDistance >= 0 once per loop. This new flag 
(SoftwarePrefetchHintDistance) is introduced to provide configurable 
software prefetching in dynamically compiled code. This flag can disable 
software prefetch hint or set prefetch distance. Default distance is set 
to 3 * dcache_line which shows best performance on armv8 CPUs we have. 
64-bytes loop proceed until length < 64, then, proceed to 3).

3) simple 16-byte loading loop until remaining length is 0.

Note: It was observed that software prefetching hint improves 
performance for platforms that do not have hardware prefetching 
(ThunderX T88), but also for platforms we have in hand which do have 
hardware prefetching (Cortex A53).

Performance testing:

JMH-based microbenchmark was developed [3] to test the performance of 
this enhancement. The  performance results on Cortex A53 [4] and 
ThunderX T88 [5] for this intrinsic are on-par with C2-compiled java 
code for very small strings and improve the performance with the 
increase in string length starting from string length of 3 and up to 8x 
for long strings.

Functional testing:

Tested by running hotspot jtreg tests on Cortex A53 and ThunderX T88 and 
comparing the test results diff with vanilla build. No regressions were 
observed. Specifically, test 
hotspot/test/compiler/intrinsics/string/TestHasNegatives.java passed on 
both Cortex A53 and ThunderX T88.

[1] webrev: http://cr.openjdk.java.net/~dpochepk/8184943/webrev.01/
[2] CR: https://bugs.openjdk.java.net/browse/JDK-8184943
[3] JMH micro benchmark: 
[4] A53 graph: 
[5] T88 graph: 

I'll be happy to merge suggestions for improvement of this intrinsic 
should they come into this review.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20170720/07f2c54d/attachment.html>

More information about the hotspot-compiler-dev mailing list