RFR: Optimizing best of two work stealing queue selection
thomas.schatzl at oracle.com
Fri Jun 22 14:56:45 UTC 2018
On Fri, 2018-06-22 at 09:32 -0400, Zhengyu Gu wrote:
> This optimization is based on this paper . The idea is to keep
> last successfully stolen queue as one of best of two candidates.
> My experiments showed that it only marginally improved successful
> stealing ratio, given the ratio is *very* high to begin with, north
> of 95%, even under extremely unbalanced scenarios (e.g. concurrent
> threads = 3, parallel threads = 11, of course, based on Shenandoah
> ParallelClaimableQueueSet ).
> However, it showed remarkable improvement on task termination:
> I measured how often worker threads have to exit termination
> protocol and retry (failed termination)
The other change suggested in the paper that only checks
num_live_threads instead of max_num_threads during termination would
also be nice to have, but I think (and you were imho right about this)
you wanted to have this separate :)
That other change probably works because then we do the can-we-
terminate check more often the less work there is, see my question
about why not only query the queues with known-elements (at the time of
waking up the threads).
> Before patch:
> Avg: failed termination 4.747 Worst case: 209
> After patch:
> Avg: failed termination 0.513 Worst case: 11
> Although, above result is from one particular run, but it is *very
> reproducible* and also with other benchmarks.
> Hopefully, Aleksey and Thomas have better ways to measure the impact
> to overall termination.
I can do some perf measurements.
My current overall comment of this change is threefold:
- why not immediately contribute in jdk/jdk? There does not seem to be
anything Shenandoah-specific about it.
- I do not see a point in having an optional switch; it seems obviously
superior, and the paper already provided significant feedback on its
usefulness. Aleksey and me will also do measurements. If there were a
switch to enable this I would make this one runtime (i.e. as parameter
to the queueset), and only maybe, if I really wanted a global, use it
as default value for that parameter.
- I would try to make some effort to make this "last-queue-id" thread-
local. I guess there will be lots of memory ping-pong between threads
reading/writing to the same cacheline(s) all the time.
I.e. at least use a PaddedArray, or pass the last-queue-id to
steal_best_of_2() like the seed parameter.
More information about the shenandoah-dev