perspectives on streams performance

Paul Sandoz paul.sandoz at
Tue Mar 10 11:27:42 UTC 2015

Hi John,

A common thing people run up against with simple benchmarks is the sequential benchmark "falling off the performance cliff" due to inlining thresholds being reached. 

Things start out fine, but then after a certain number of iterations the performance drops. Usually this happens after 10000 iterations when the stream pipeline gets inlined and the inline threshold is reached within the hot loop, thus effectively deoptimizing it. Can be quite mysterious for those uninitiated in hotspot behaviour.

I dunno how likely that would be in more real world scenarios, but perhaps it could happen for longer running applications. In such cases do we need to prioritize the hot loops over them being pulled partially into larger generated code blocks?

I suspect that is probably a smaller issue compared to profile pollution.

The stream pipeline is executed by creating a Consumer, that represents the fused pipeline calling behavioural parameters etc. It passes the Consumer to Spliterator.forEachRemaining. So if we wanted to start out with simpler investigations perhaps we could pick say the Spliterator for int[] (avoid boxing issues) with multiple IntConsumer instances. 


On Mar 6, 2015, at 2:01 AM, John Rose <john.r.rose at> wrote:

> In order to get the full benefit from JDK 8 streams we will need to make them optimize fully.  Here are a few thoughts about that.
> I think of streams as a more concise and orderly replacement of classic "for" loops.  Every stream can be rewritten as one or more for-loops, at the cost of verbosity and commitment to hard-coding optimizations (like FJ).
> A classic "for" loop is a external iterator notation:  The iteration machinery is outside of (lexically around) the data structure access.   This notation is at least as old as Fortran.  Streams are an internal iteration notation:  The iteration machinery (crucially, the looping part of the algorithm) is inside the data structure, and only the loop body appears in the user code (as a lambda).  This notation is also old, found in Lisp and Smalltalk.
> External iterators are easier to optimize, because their crucial iteration logic is always inlined into the specific loop request as coded by the user.  (It has to be, because the user writes it explicitly.)  Existing compilers, like HotSpot's, are good at optimizing "for" loops.
> HotSpot are less good at internal iterators.  If the original point of the user request fails to inline all the way into the internal looping part of the algorithm (a hidden "for" loop), the quality of the loop will be very poor.  (Exception:  With micro-benchmarks, the loop quality can be partially recovered by relying on a pure, clean profile.  But with real system code, the hidden "for" loop will have a polluted profile.)  This is the problem I have referred to as "loop customization", even though it applies to non-looping code as well (as long as there is a template that needs expanding in order to gain performance).
> If streams are to perform at peak, we need to be able to connect the user request (where the user would have written a classic "for" loop, but chose a stream instead) to the expansion, customization, and optimization of the hidden loop.  (Note that the hidden loop may run in a different thread!  This defeats the usual forms of inlining.)  Somehow the conditions of the user request need to be communicated to the code that actually does the work.  The code that actually runs the loop must be customized to take into account whatever "syndrome" of template parameters (such as closure bodies or operand types) that are critical to optimizing the loop code.
> There are two natural scales of optimization:  Per-chunk (sped up by multi-threading) and per-loop-body (sped up by vectorization and unrolling).  Out of the box, the "parallel" modifier gives good multi-threading.  But I am afraid that the loop body optimizations is less well behaved, for streams.
> I would like to encourage any interested colleagues to examine streams performance under the following conditions:
> 1. Head-to-head comparison of "for" loops and equivalent streams.
> 2. Proper vectorization of both forms of loops, at least for arraywise elemental operations, searches, and reductions.  This would include issuing the best vectorizing instructions available for the platform (I'm thinking Haswell, etc.).
> 3. Benchmark management which operates multiple loop examples per JVM, to simulate realistically "dirty" hidden-loop profiles.
> 4. Artificial suppression of inlining from the request point (of a stream-based loop) to the algorithm's hidden loop, again to simulate realistically the compilation of loop kernels to run in multiple threads.
> All of the above examples are focused on measuring and improving per-loop-body optimizations (vectorization and other loop transforms).  None of them need to be run with FJ or multiple threads.  The JMH framework would be very useful for running the tests.
> It may be that after a head-to-head comparison we will find that the HotSpot optimizer is better than I'm giving it credit for.  I have not made these studies myself.  But my usual experience is that rocks like this, when you turn them over, have something "interesting" under them.
> None of this is urgent.  I'm putting it out as a possibly interesting project for people to collaborate on.
> — John

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <>

More information about the hotspot-compiler-dev mailing list