RFR (XS) CR 8058643: (str) Re-examine hashCode implementation
vitalyd at gmail.com
Thu Sep 25 16:00:07 UTC 2014
Note that compiler does not use multiplication here. This is strength
reduced into a bit shift followed by a subtraction (i.e. h * 31 = h * 32
(i.e. sal 5) - h). The point about allowing execution units to run in
parallel (given the break in data dependency) is very valid though.
For kicks and giggles, I looked at what gcc, clang, and icc generate for
similar C code at O2 and O3, and did not observe them making any
transformations to break the data dependency.
While it would be cool if compiler could do this transformation on its own,
as you say Peter, the scope of where this transform is applicable is
somewhat narrow(?). If speeding up the non-cached String.hashCode() or
Arrays.hashCode() methods is desired, perhaps intrinsics could be
implemented (utility for String.hashCode seems low though). In that case,
one could then do this transformation manually and also vectorize it for
On Thu, Sep 25, 2014 at 7:09 AM, Peter Levart <peter.levart at gmail.com>
> On 09/25/2014 09:40 AM, Aleksey Shipilev wrote:
>> Hi Peter,
>> On 09/25/2014 02:46 AM, Peter Levart wrote:
>> I have to say it once again:
>> a) It is *an error* to use static finals as the benchmark input. They
>> are perfectly constant-foldable in way too many cases. Break this habit,
> Hi Aleksey,
> The "constant" in this example is only a reference to the char array.
> It's content is not. In String, this is a final instance field, which
> behaves similarly inside an instance method (usually it is read only once
> per method invocation).
> b) Explicit Blackholes are not needed, and just returning "int" from
>> @Benchmark method helps readability a lot. Please break this habit as
>> well. Having readable and maintainable benchmarks is a key.
> Ok, here's a modified benchmark:
> Which behaves similarly.
> Here are it's results:
> * Results on JDK9, Linux, i7-2600K CPU, JMH args: -f 1 -wi 5 -i 8 -gc true
> * Benchmark Mode Samples Score Score error
> * j.t.HashBench2._hashCode thrpt 8 8308858.217 353019.084
> * j.t.HashBench2.hashCode0 thrpt 8 8207337.729 217048.634
> * j.t.HashBench2.hashCode1 thrpt 8 13359572.359 345736.675
> * j.t.HashBench2.hashCode2 thrpt 8 15310621.202 238369.017
> * j.t.HashBench2.hashCode3 thrpt 8 17637944.829 232155.847
> * j.t.HashBench2.hashCode3i thrpt 8 17724181.444 509913.288
> * j.t.HashBench2.hashCode3x thrpt 8 8344128.432 159508.813
> * j.t.HashBench2.hashCode4 thrpt 8 16526850.489 969549.448
> * j.t.HashBench2.hashCode5 thrpt 8 17567765.554 917934.885
> * j.t.HashBench2.hashCode6 thrpt 8 17705074.332 419405.652
> * j.t.HashBench2.hashCode7 thrpt 8 18805633.563 209181.299
> * j.t.HashBench2.hashCode8 thrpt 8 18300123.201 376681.550
> It would be interesting to see how it behaves on different CPUs.
>> This is really great!
>>>> Couldn't this be a tweak via HotSpot, instead uglifying and bloating
>>>> the Java and hence the byte code?
>> This is for HotSpot compiler guys to answer. Theoretically I think it is
>>> possible. But it would have to be tailored to the very specific use case
>>> and I don't know if such specific transformation would have wide-enough
>>> applicability. If it would only speed-up String.hashCode and very
>>> similar loops, it is less trouble to do that by hand in one or few
>> I would think this happens in user-specified hashCode() over arrays.
>> IDEs would routinely inject the loop like that or delegate to
>> Arrays.hashCode, that does the same loop.
> Arrays.hasCode() can be "improved" this way too.
>> In other words, I would like to see this fixed on compiler side. This
>> seems to be the strength-reduction playing tricks with loop unrolling,
>> I'll submit a bug shortly.
> As I said, I don't think it has anything to do with loop unrolling. The
> transformation I applied in hashCode1,2,3, ... 8 produces code that
> executes 2, 3, 4, ... 9 independent multiplications in each chunk, which
> allow CPU's pipeline to execute them in parallel. I had to manually unroll
> the loop a bit just to achieve this transformation. But my manual unrolling
> does not bring the speed-up per se. The parallelization of multiplication
> does. This can be seen by observing the score of hashCode3x benchmark,
> which has the same loop structure as hashCode3, but performs
> multiplications in a way where each of them depends on the result of a
> previous one, which prevents the CPU from parallelizing them.
> This is not to say that such transformation couldn't be done on the JIT
> side. I just have a feeling that such transformation won't be widely used
> because it is very specific. It can only be used within integer arithmetic
> of the homogeneous width (since it changes the order of operations applied,
> the final result depends on which width is used when overflow happens).
> Floating arithmetic is equally unsiutable for such transformations that
> change order of operations. It can only help when the sequence of
> operations that are dependent on one another are changed into a sequence of
> independent operations and those operations have a weight that matters
> (such as multiplication).
> Regards, Peter
More information about the core-libs-dev