perspectives on streams performance
john.r.rose at oracle.com
Fri Mar 6 01:01:20 UTC 2015
In order to get the full benefit from JDK 8 streams we will need to make them optimize fully. Here are a few thoughts about that.
I think of streams as a more concise and orderly replacement of classic "for" loops. Every stream can be rewritten as one or more for-loops, at the cost of verbosity and commitment to hard-coding optimizations (like FJ).
A classic "for" loop is a external iterator notation: The iteration machinery is outside of (lexically around) the data structure access. This notation is at least as old as Fortran. Streams are an internal iteration notation: The iteration machinery (crucially, the looping part of the algorithm) is inside the data structure, and only the loop body appears in the user code (as a lambda). This notation is also old, found in Lisp and Smalltalk.
External iterators are easier to optimize, because their crucial iteration logic is always inlined into the specific loop request as coded by the user. (It has to be, because the user writes it explicitly.) Existing compilers, like HotSpot's, are good at optimizing "for" loops.
HotSpot are less good at internal iterators. If the original point of the user request fails to inline all the way into the internal looping part of the algorithm (a hidden "for" loop), the quality of the loop will be very poor. (Exception: With micro-benchmarks, the loop quality can be partially recovered by relying on a pure, clean profile. But with real system code, the hidden "for" loop will have a polluted profile.) This is the problem I have referred to as "loop customization", even though it applies to non-looping code as well (as long as there is a template that needs expanding in order to gain performance).
If streams are to perform at peak, we need to be able to connect the user request (where the user would have written a classic "for" loop, but chose a stream instead) to the expansion, customization, and optimization of the hidden loop. (Note that the hidden loop may run in a different thread! This defeats the usual forms of inlining.) Somehow the conditions of the user request need to be communicated to the code that actually does the work. The code that actually runs the loop must be customized to take into account whatever "syndrome" of template parameters (such as closure bodies or operand types) that are critical to optimizing the loop code.
There are two natural scales of optimization: Per-chunk (sped up by multi-threading) and per-loop-body (sped up by vectorization and unrolling). Out of the box, the "parallel" modifier gives good multi-threading. But I am afraid that the loop body optimizations is less well behaved, for streams.
I would like to encourage any interested colleagues to examine streams performance under the following conditions:
1. Head-to-head comparison of "for" loops and equivalent streams.
2. Proper vectorization of both forms of loops, at least for arraywise elemental operations, searches, and reductions. This would include issuing the best vectorizing instructions available for the platform (I'm thinking Haswell, etc.).
3. Benchmark management which operates multiple loop examples per JVM, to simulate realistically "dirty" hidden-loop profiles.
4. Artificial suppression of inlining from the request point (of a stream-based loop) to the algorithm's hidden loop, again to simulate realistically the compilation of loop kernels to run in multiple threads.
All of the above examples are focused on measuring and improving per-loop-body optimizations (vectorization and other loop transforms). None of them need to be run with FJ or multiple threads. The JMH framework would be very useful for running the tests.
It may be that after a head-to-head comparison we will find that the HotSpot optimizer is better than I'm giving it credit for. I have not made these studies myself. But my usual experience is that rocks like this, when you turn them over, have something "interesting" under them.
None of this is urgent. I'm putting it out as a possibly interesting project for people to collaborate on.
More information about the hotspot-compiler-dev