RFR (XS) CR 8058643: (str) Re-examine hashCode implementation
peter.levart at gmail.com
Thu Sep 25 16:58:21 UTC 2014
On 09/25/2014 06:00 PM, Vitaly Davidovich wrote:
> Note that compiler does not use multiplication here. This is strength
> reduced into a bit shift followed by a subtraction (i.e. h * 31 = h *
> 32 (i.e. sal 5) - h).
I've noticed that in the disassembly files provided by Aleksey. Good to
know, so one is not bothered to use bit shifts in situations where
multiplication / division look clearer in source code.
> The point about allowing execution units to run in parallel (given
> the break in data dependency) is very valid though.
> For kicks and giggles, I looked at what gcc, clang, and icc generate
> for similar C code at O2 and O3, and did not observe them making any
> transformations to break the data dependency.
So HotSpot can break the ice here ;-)
> While it would be cool if compiler could do this transformation on its
> own, as you say Peter, the scope of where this transform is applicable
> is somewhat narrow(?). If speeding up the non-cached
> String.hashCode() or Arrays.hashCode() methods is desired, perhaps
> intrinsics could be implemented (utility for String.hashCode seems low
> though). In that case, one could then do this transformation manually
> and also vectorize it for long inputs.
I see there are multiple options - each with it's own benefits and
drawbacks. Hard to choose.
> On Thu, Sep 25, 2014 at 7:09 AM, Peter Levart <peter.levart at gmail.com
> <mailto:peter.levart at gmail.com>> wrote:
> On 09/25/2014 09:40 AM, Aleksey Shipilev wrote:
> Hi Peter,
> On 09/25/2014 02:46 AM, Peter Levart wrote:
> I have to say it once again:
> a) It is *an error* to use static finals as the benchmark
> input. They
> are perfectly constant-foldable in way too many cases. Break
> this habit,
> Hi Aleksey,
> The "constant" in this example is only a reference to the char
> array. It's content is not. In String, this is a final instance
> field, which behaves similarly inside an instance method (usually
> it is read only once per method invocation).
> b) Explicit Blackholes are not needed, and just returning
> "int" from
> @Benchmark method helps readability a lot. Please break this
> habit as
> well. Having readable and maintainable benchmarks is a key.
> Ok, here's a modified benchmark:
> http://cr.openjdk.java.net/~plevart/misc/StringHash/HashBench2.java <http://cr.openjdk.java.net/%7Eplevart/misc/StringHash/HashBench2.java>
> Which behaves similarly.
> Here are it's results:
> * Results on JDK9, Linux, i7-2600K CPU, JMH args: -f 1 -wi 5 -i 8
> -gc true
> * Benchmark Mode Samples Score Score
> error Units
> * j.t.HashBench2._hashCode thrpt 8 8308858.217
> <tel:8308858.217> 353019.084 ops/s
> * j.t.HashBench2.hashCode0 thrpt 8 8207337.729
> 217048.634 ops/s
> * j.t.HashBench2.hashCode1 thrpt 8 13359572.359
> 345736.675 ops/s
> * j.t.HashBench2.hashCode2 thrpt 8 15310621.202
> 238369.017 ops/s
> * j.t.HashBench2.hashCode3 thrpt 8 17637944.829
> <tel:17637944.829> 232155.847 ops/s
> * j.t.HashBench2.hashCode3i thrpt 8 17724181.444
> <tel:17724181.444> 509913.288 ops/s
> * j.t.HashBench2.hashCode3x thrpt 8 8344128.432
> 159508.813 ops/s
> * j.t.HashBench2.hashCode4 thrpt 8 16526850.489
> 969549.448 ops/s
> * j.t.HashBench2.hashCode5 thrpt 8 17567765.554
> 917934.885 ops/s
> * j.t.HashBench2.hashCode6 thrpt 8 17705074.332
> <tel:17705074.332> 419405.652 ops/s
> * j.t.HashBench2.hashCode7 thrpt 8 18805633.563
> 209181.299 ops/s
> * j.t.HashBench2.hashCode8 thrpt 8 18300123.201
> 376681.550 ops/s
> It would be interesting to see how it behaves on different CPUs.
> This is really great!
> Couldn't this be a tweak via HotSpot, instead
> uglifying and bloating
> the Java and hence the byte code?
> This is for HotSpot compiler guys to answer. Theoretically
> I think it is
> possible. But it would have to be tailored to the very
> specific use case
> and I don't know if such specific transformation would
> have wide-enough
> applicability. If it would only speed-up String.hashCode
> and very
> similar loops, it is less trouble to do that by hand in
> one or few
> I would think this happens in user-specified hashCode() over
> IDEs would routinely inject the loop like that or delegate to
> Arrays.hashCode, that does the same loop.
> Arrays.hasCode() can be "improved" this way too.
> In other words, I would like to see this fixed on compiler
> side. This
> seems to be the strength-reduction playing tricks with loop
> I'll submit a bug shortly.
> As I said, I don't think it has anything to do with loop
> unrolling. The transformation I applied in hashCode1,2,3, ... 8
> produces code that executes 2, 3, 4, ... 9 independent
> multiplications in each chunk, which allow CPU's pipeline to
> execute them in parallel. I had to manually unroll the loop a bit
> just to achieve this transformation. But my manual unrolling does
> not bring the speed-up per se. The parallelization of
> multiplication does. This can be seen by observing the score of
> hashCode3x benchmark, which has the same loop structure as
> hashCode3, but performs multiplications in a way where each of
> them depends on the result of a previous one, which prevents the
> CPU from parallelizing them.
> This is not to say that such transformation couldn't be done on
> the JIT side. I just have a feeling that such transformation won't
> be widely used because it is very specific. It can only be used
> within integer arithmetic of the homogeneous width (since it
> changes the order of operations applied, the final result depends
> on which width is used when overflow happens). Floating arithmetic
> is equally unsiutable for such transformations that change order
> of operations. It can only help when the sequence of operations
> that are dependent on one another are changed into a sequence of
> independent operations and those operations have a weight that
> matters (such as multiplication).
> Regards, Peter
More information about the core-libs-dev