C/P/N/Q par vs. seq break-even analysis with 10ms think time
aleksey.shipilev at oracle.com
Tue Oct 16 08:27:04 PDT 2012
This is more thorough analysis on what's going on at the break-even
point in C/P/N/Q experiment . I've took the fjp-trace  profiling
at the break-even point, and the results are here . The new feature
for fjp-trace can reconstruct the entire decomposition tree, which you
might want to peek here .
- notice that the handoff from the submitter to FJP takes quite a bit
of time, somewhat 70us in this case;
- the entire task finishes in ~500us, but the trace shows execution for
only ~310us. This is due to fjp-trace architecture which can not record
the JOIN in the external submitters (yet). This might very well mean the
handoff back to the blocked submitter takes another 100us.
- threads are waking up rather slow (on this timescale), full-blown
parallelism lasts for somewhat 50us.
So, here's what we got on the table. If I understand this data
correctly, then the 500us execution divides as:
~70us: handoff to FJP
~200us: FJP rampup
~50us: FJP steady (even though lots of balancing)
~100us: result handoff
That means if we want to pursue parallel decompositions on smaller
scale, we need to figure out the rampup effects first. I have yet to
figure out if the rampup effects is due to sequential decomposition in
lambda code, or that is the genuine threading lags.
Another thing is the interface between submitter and the FJP. I vaguely
recall the infrastructure for allowing submitters to run the tasks
themselves in in place, but how much effort that would take to get to at
least experimental readiness? (Also, I don't see how/if the
CountedCompleters could interoperate with submitters in this case, is
there any option to make submitter to be the last completer?).
More information about the lambda-dev