Stefan Karlsson stefan.karlsson at oracle.com
Fri Jun 12 14:52:11 UTC 2015

Hi all,

The current implementation to distribute tasks to GC worker threads 
often cause long latencies (multiple milliseconds) when the threads are 
started and stopped.

The main reason is that the worker threads have to fight over the 
Monitor lock when they are woken up from the call to Monitor::wait. 
Another reason is that all worker threads call notify_all when they 
finish a task and there wakes all all sleeping worker threads, which 
will yet again force the worker threads to fight over the lock.

I propose that we use semaphores instead, so that the worker threads 
don't have to fight over a lock when they are woken up.

The patches build upon the following patch which introduces a Semaphore 
utility class. This patch will sent out for review on the hotspot-dev, 
since it affects non-GC parts of the code:

The first patch that I would like to get reviewed is:
  https://bugs.openjdk.java.net/browse/JDK-8087323 - Unify and split the 
work gang classes

It prepares for JDK-8087324, by separating the generic WorkGang 
implementation from the more elaborate YieldingFlexibleWorkGang (CMS) 
implementation. By having this part as a separate patch, I hope it will 
be easier to review JDK-8087324. The patch changes the work gang 
inheritance from:




Parts of the FlexibleWorkGang and WorkGang code that is going to be used 
by both concrete work gang classes, has been moved into 
AbstractWorkGang. I've duplicated some code in WorkGang and 
YieldingFlexibleWorkGang, but that code will be removed from WorkGang in 
the following patch.

The second patch I'd like to get reviewed is:
  https://bugs.openjdk.java.net/browse/JDK-8087324 - Use semaphores when 
starting and stopping GC task threads

It first simplifies the way we distribute the tasks to the GC worker 
threads. For example, the coordinator thread dispatches a task to a 
specific number of workers, and then waits for all work to be completed. 
There's no risk that multiple tasks will be scheduled simultaneously, so 
there's no need for the sequences number that is used in the current 

The patch contains two task dispatch / thread synchronization 

The first implementation uses Monitors, similar to what we did before 
the patch, but with a slightly lower overhead since the code calls 
notify_all less often. It still suffers from the "thundering heard" 
problem. When the coordinator thread signals that the worker threads 
should start, they all wake up from Monitor::wait and they all try to 
lock the Monitor.

The second, and the more interesting, implementation uses semaphores. 
When the worker threads wake up from the semaphore wait, they don't have 
to serialize the execution by taking a lock. This greatly decreases the 
time it takes to start and stop the worker threads.

The semaphore implementation is used on all platforms where the 
Semaphore class has been implemented in JDK-8087322. So, on some OS:es 
the code will revert to the Monitor-based solution until a Semaphore 
class has been implemented for that OS. So, porters might want to 
consider implementing the Sempahore class.

There's also a diagnostic vm option 
(-XX:+/-UseSemaphoreGCThreadsSynchronization) to turn off the 
Semaphore-based implementation, which can be used to debug this new 
code. It's mainly targeted towards support and sustaining engineering.

The patches have been performance tested on Linux, Solaris, OSX, and 

The effects of the patch can be seen by running benchmarks with small 
young gen sizes, which triggers frequent and short GCs.

For example, here are runs from the SPECjvm2008 xml.transform benchmark 
-Xmx1g -Xms1g -Xmn64m -XX:+PrintGC -XX:+UseG1GC -jar SPECjvm2008.jar 
-ikv xml.transform -it 30 -wt 30

I got the following GC times:

             Average    Median    99.9 percentile   Max
Baseline: 8.76ms    8.44 ms   25.9 ms 34.7 ms
Monitor:   6.17 ms 5.88 ms   26.0 ms 49.1 ms
Semaphore: 3.43 ms 3.26 ms   13.4 ms           33.4 ms

If I run an empty GC task 10 times per GC, by running the following code:

I get the following numbers to complete the empty GC tasks:

             Average    Median    99.9 percentile   Max
Baseline: 1.43 ms    0.92 ms   3.43 ms           9.30ms
Monitor:    0.75ms 0.72 ms   1.74 ms           2.78ms
Semaphore: 0.07 ms 0.07 ms   0.17 ms           0.26 ms

The code has been tested with JPRT and our nightly testing suites.

I've created a unit test to run a small test with both the semaphore 
implementation and the monitor implementation:

But since we currently don't have code to shutdown worker threads after 
they have been started, I don't want to push this test (or clean it up) 
until we have that in place. I created this bug for that:


