C/P/N/Q par vs. seq break-even analysis with 10ms think time
edharned at gmail.com
Tue Oct 16 10:51:05 PDT 2012
Perhaps you are barking up the wrong tree. You can’t fix something that
The framework is not a parallel engine. The framework is not even a
general-purpose application development service.
The framework is an academic experiment underpinning a research paper. It
is performing exactly as designed when used for graph processing of large
aggregate data structures.
That “the execution is dominated by threads wandering around” is the result
of faulty work stealing using a submission queue(s)
and the inability of threads to find work when the computation is modest.
The excessive thread creation is well documented and is probably the
dooming factor for this framework.
Fork/Join is a niche product. F/J simply means to split the work into
fragments and join the results together. You can split the work into
components and let each component execute independently or you can
dynamically decompose the work. In either case, each is still a small niche
product. By itself, F/J is not parallel processing. F/J is just one aspect
of parallel processing. Dynamic decomposition is just one facet of F/J.
What you want for lambda is a parallel engine. Probably something
under-the-covers for API developers only (akin to the Unsafe.class)
Something like Microsoft’s TPL. You need to start at the beginning
designing a parallel engine for general-purpose commercial application
development. An academic-centric niche product just doesn’t cut it.
Using this framework for other its intended use will only reveal its flaws
in ways that embarrass the team.
On Tue, Oct 16, 2012 at 11:27 AM, Aleksey Shipilev <
aleksey.shipilev at oracle.com> wrote:
> This is more thorough analysis on what's going on at the break-even
> point in C/P/N/Q experiment . I've took the fjp-trace  profiling
> at the break-even point, and the results are here . The new feature
> for fjp-trace can reconstruct the entire decomposition tree, which you
> might want to peek here .
> - notice that the handoff from the submitter to FJP takes quite a bit
> of time, somewhat 70us in this case;
> - the entire task finishes in ~500us, but the trace shows execution for
> only ~310us. This is due to fjp-trace architecture which can not record
> the JOIN in the external submitters (yet). This might very well mean the
> handoff back to the blocked submitter takes another 100us.
> - threads are waking up rather slow (on this timescale), full-blown
> parallelism lasts for somewhat 50us.
> So, here's what we got on the table. If I understand this data
> correctly, then the 500us execution divides as:
> ~70us: handoff to FJP
> ~200us: FJP rampup
> ~50us: FJP steady (even though lots of balancing)
> ~100us: result handoff
> That means if we want to pursue parallel decompositions on smaller
> scale, we need to figure out the rampup effects first. I have yet to
> figure out if the rampup effects is due to sequential decomposition in
> lambda code, or that is the genuine threading lags.
> Another thing is the interface between submitter and the FJP. I vaguely
> recall the infrastructure for allowing submitters to run the tasks
> themselves in in place, but how much effort that would take to get to at
> least experimental readiness? (Also, I don't see how/if the
> CountedCompleters could interoperate with submitters in this case, is
> there any option to make submitter to be the last completer?).
>  https://github.com/shipilev/fjp-trace
>  http://shipilev.net/pub/jdk/lambda/20121003-fjpwakeup/
More information about the lambda-dev