perspectives on streams performance
kirk at kodewerk.com
Fri Mar 6 09:16:54 UTC 2015
I can comment on what I’m seeing out in the wild with streams. We just had an interesting experience with them yesterday but unfortunately the chap that put the test together did it in such a way that it suffers from the classic hot path/cold path optimization problem so there is some work to be done before we can look at what HS is doing to the code. That said, the code, written imperatively runs at 25,000,000 TPS. The code run with Lambda’s and parallel streams hits 23,000,000 TPS. In the course of the later runs, we cannot pin the CPU. They top out at about 80%. In this case the code is running through millions of “FpML like documents" and summing up certain fields. I suspect that the problem is due to one of or a combo of boxing and being unable to “tune” the spliterator. The overall workload is large enough that it should split nicely into large enough chunks to not be bothered by FJ overhead however that’s not what we appear to be seeing. Additionally, we are seeing the same magnitude of hit on performance when we run without parallel streams.
I caution that these are very preliminary results as I’ve only had an hour with the bench. I’m planning on spending more time sorting out what going on with it next week. It’s (unfortunately) not running in JMH yet. I’m also not in a position to share the code though if this use case is of interest I’ll take the steps to see if I can get it to you. I suspect that it will be representative of a fairly common use case in certain industries.
On Mar 6, 2015, at 1:01 AM, John Rose <john.r.rose at oracle.com> wrote:
> In order to get the full benefit from JDK 8 streams we will need to make them optimize fully. Here are a few thoughts about that.
> I think of streams as a more concise and orderly replacement of classic "for" loops. Every stream can be rewritten as one or more for-loops, at the cost of verbosity and commitment to hard-coding optimizations (like FJ).
> A classic "for" loop is a external iterator notation: The iteration machinery is outside of (lexically around) the data structure access. This notation is at least as old as Fortran. Streams are an internal iteration notation: The iteration machinery (crucially, the looping part of the algorithm) is inside the data structure, and only the loop body appears in the user code (as a lambda). This notation is also old, found in Lisp and Smalltalk.
> External iterators are easier to optimize, because their crucial iteration logic is always inlined into the specific loop request as coded by the user. (It has to be, because the user writes it explicitly.) Existing compilers, like HotSpot's, are good at optimizing "for" loops.
> HotSpot are less good at internal iterators. If the original point of the user request fails to inline all the way into the internal looping part of the algorithm (a hidden "for" loop), the quality of the loop will be very poor. (Exception: With micro-benchmarks, the loop quality can be partially recovered by relying on a pure, clean profile. But with real system code, the hidden "for" loop will have a polluted profile.) This is the problem I have referred to as "loop customization", even though it applies to non-looping code as well (as long as there is a template that needs expanding in order to gain performance).
> If streams are to perform at peak, we need to be able to connect the user request (where the user would have written a classic "for" loop, but chose a stream instead) to the expansion, customization, and optimization of the hidden loop. (Note that the hidden loop may run in a different thread! This defeats the usual forms of inlining.) Somehow the conditions of the user request need to be communicated to the code that actually does the work. The code that actually runs the loop must be customized to take into account whatever "syndrome" of template parameters (such as closure bodies or operand types) that are critical to optimizing the loop code.
> There are two natural scales of optimization: Per-chunk (sped up by multi-threading) and per-loop-body (sped up by vectorization and unrolling). Out of the box, the "parallel" modifier gives good multi-threading. But I am afraid that the loop body optimizations is less well behaved, for streams.
> I would like to encourage any interested colleagues to examine streams performance under the following conditions:
> 1. Head-to-head comparison of "for" loops and equivalent streams.
> 2. Proper vectorization of both forms of loops, at least for arraywise elemental operations, searches, and reductions. This would include issuing the best vectorizing instructions available for the platform (I'm thinking Haswell, etc.).
> 3. Benchmark management which operates multiple loop examples per JVM, to simulate realistically "dirty" hidden-loop profiles.
> 4. Artificial suppression of inlining from the request point (of a stream-based loop) to the algorithm's hidden loop, again to simulate realistically the compilation of loop kernels to run in multiple threads.
> All of the above examples are focused on measuring and improving per-loop-body optimizations (vectorization and other loop transforms). None of them need to be run with FJ or multiple threads. The JMH framework would be very useful for running the tests.
> It may be that after a head-to-head comparison we will find that the HotSpot optimizer is better than I'm giving it credit for. I have not made these studies myself. But my usual experience is that rocks like this, when you turn them over, have something "interesting" under them.
> None of this is urgent. I'm putting it out as a possibly interesting project for people to collaborate on.
> — John
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 496 bytes
Desc: Message signed with OpenPGP using GPGMail
More information about the hotspot-compiler-dev