perspectives on streams performance
biboudis at gmail.com
Tue Mar 10 18:22:05 UTC 2015
For this investigation someone could find useful - as a starting-point -
the high-level performance assessment we recently did -
http://biboudis.github.io/clashofthelambdas/ - and put a magnifying glass
over the Java cases. We have 4 simple JMH’d micro-benchmarks (and their
corresponding for-based ones as baselines) that could be enhanced.
Afterwards, the equivalence between JIT-ted streams and hand-optimized
for-loops could be evaluated. The cases that are missing are mostly
type-polluted pipelines that will help us to understand the costs of
On Fri, Mar 6, 2015 at 3:01 AM, John Rose <john.r.rose at oracle.com> wrote:
> In order to get the full benefit from JDK 8 streams we will need to make
> them optimize fully. Here are a few thoughts about that.
> I think of streams as a more concise and orderly replacement of classic
> "for" loops. Every stream can be rewritten as one or more for-loops, at
> the cost of verbosity and commitment to hard-coding optimizations (like FJ).
> A classic "for" loop is a external iterator notation: The iteration
> machinery is outside of (lexically around) the data structure access.
> This notation is at least as old as Fortran. Streams are an internal
> iteration notation: The iteration machinery (crucially, the looping part
> of the algorithm) is inside the data structure, and only the loop body
> appears in the user code (as a lambda). This notation is also old, found
> in Lisp and Smalltalk.
> External iterators are easier to optimize, because their crucial iteration
> logic is always inlined into the specific loop request as coded by the
> user. (It has to be, because the user writes it explicitly.) Existing
> compilers, like HotSpot's, are good at optimizing "for" loops.
> HotSpot are less good at internal iterators. If the original point of the
> user request fails to inline all the way into the internal looping part of
> the algorithm (a hidden "for" loop), the quality of the loop will be very
> poor. (Exception: With micro-benchmarks, the loop quality can be
> partially recovered by relying on a pure, clean profile. But with real
> system code, the hidden "for" loop will have a polluted profile.) This is
> the problem I have referred to as "loop customization", even though it
> applies to non-looping code as well (as long as there is a template that
> needs expanding in order to gain performance).
> If streams are to perform at peak, we need to be able to connect the user
> request (where the user would have written a classic "for" loop, but chose
> a stream instead) to the expansion, customization, and optimization of the
> hidden loop. (Note that the hidden loop may run in a different thread!
> This defeats the usual forms of inlining.) Somehow the conditions of the
> user request need to be communicated to the code that actually does the
> work. The code that actually runs the loop must be customized to take into
> account whatever "syndrome" of template parameters (such as closure bodies
> or operand types) that are critical to optimizing the loop code.
> There are two natural scales of optimization: Per-chunk (sped up by
> multi-threading) and per-loop-body (sped up by vectorization and
> unrolling). Out of the box, the "parallel" modifier gives good
> multi-threading. But I am afraid that the loop body optimizations is less
> well behaved, for streams.
> I would like to encourage any interested colleagues to examine streams
> performance under the following conditions:
> 1. Head-to-head comparison of "for" loops and equivalent streams.
> 2. Proper vectorization of both forms of loops, at least for arraywise
> elemental operations, searches, and reductions. This would include issuing
> the best vectorizing instructions available for the platform (I'm thinking
> Haswell, etc.).
> 3. Benchmark management which operates multiple loop examples per JVM, to
> simulate realistically "dirty" hidden-loop profiles.
> 4. Artificial suppression of inlining from the request point (of a
> stream-based loop) to the algorithm's hidden loop, again to simulate
> realistically the compilation of loop kernels to run in multiple threads.
> All of the above examples are focused on measuring and improving
> per-loop-body optimizations (vectorization and other loop transforms).
> None of them need to be run with FJ or multiple threads. The JMH framework
> would be very useful for running the tests.
> It may be that after a head-to-head comparison we will find that the
> HotSpot optimizer is better than I'm giving it credit for. I have not made
> these studies myself. But my usual experience is that rocks like this,
> when you turn them over, have something "interesting" under them.
> None of this is urgent. I'm putting it out as a possibly interesting
> project for people to collaborate on.
> — John
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the hotspot-compiler-dev