RFR(m): 8214271: Fast primitive to wake many threads

Daniel D. Daugherty daniel.daugherty at oracle.com
Fri Dec 21 22:21:45 UTC 2018

On 12/21/18 4:45 AM, Robbin Ehn wrote:
> Hi David,
> On 2018-12-21 04:17, David Holmes wrote:
>> Got it - subtle.
>> Further this sounds like a race that could lead to bugs if not used 
>> very carefully ie. you can't assume between disarm() and wake() that 
>> all threads are blocked.
> I didn't realize how subtle this is. I think your original comment that
> disarm/wake should be one operation was spot on.
> Investigating... thinking... testing... yes I think this will work, 
> fixed!
> Sorry for not looking more into this before.
>> I think perhaps this needs to be expanded to make this more obvious:
>>    68 //    - A call to wait(tag) will block if the barrier is armed 
>> with the value
>>    69 //      'tag'; else it will return immediately.
>>    70 //    - A blocked thread is eligible to execute again once the 
>> barrier is
>>    71 //      disarmed and wake() has been called.
>> +          - A call to wait(tag) that would block if it continued, 
>> but instead
>> +            is descheduled, may return immediately if scheduled after a
>> +           call to disarm(), but before the call to wake().
>> It also made me realize that in the general case (not when used with 
>> safepoints I think due to other state checks) a wake() may stall due 
>> to threads with a previous tag entering the wait() late.
> I added a double checking in the semaphore version, this means both
> implementation should have progress guarantee.
> Making this v5 a bit large due to a lot of comments being changed.
> Inc:
> http://cr.openjdk.java.net/~rehn/8214271/5/inc/webrev/
> Full:
> http://cr.openjdk.java.net/~rehn/8214271/5/full/webrev/

It's been a while since I've reviewed this topic so I'm going with
the full webrev. Sorry for the long delay in getting back to this:

     No comments.

     L35:     check_type(cond, "%s; error='%s' (errno=%s)", msg, 
os::strerror(err),   \
         nit - The '%s;' should probably be a '%s:' to match other mesg 

     L41: static int futex(volatile int *addr, int futex_op, int op_arg)
     L42: {
         nit - the '{' on L42 should be at the end of L41 (with a space 
in front of it).

     L47:   assert(_futex_barrier == 0, "Already armed");
         Please consider:
           assert(_futex_barrier == 0, "Should not be already armed: "
                  "_futex_barrier=%d", _futex_barrier);

     L53:   assert(_futex_barrier != 0, "Not armed");
         Please consider:
           assert(_futex_barrier != 0, "Should be armed/non-zero.");

     L58:   guarantee_with_errno(s > -1, "futex FUTEX_WAKE");
         Please consider:
            guarantee_with_errno(s > -1, "futex FUTEX_WAKE failed: 
s=%d", s);

     L72:     guarantee_with_errno((s == 0) ||
     L73:                          (s == -1 && errno == EAGAIN) ||
     L74:                          (s == -1 && errno == EINTR),
     L75:                          "futex FUTEX_WAIT");
         Please consider:
           "futex FUTEX_WAIT failed: s=%d", s);

     L76:     // Return value 0: woken up, but re-check in case of 
spurious wakeup
     L78:     // Error EAGAIN: we are already disarmed and so will pass 
the check
         Please add a period to the end of these two lines.

     I like the new header comment.

     L102:   // Returns implementation type.
         "type" or "description"?

     L109-111: indent needs two more spaces.

     L31: // Except for the barrier tag itself, it uses two counters to 
keep the semaphore

           // In addition to the barrier tag, it uses two counters to 
keep the semaphore

     L61:   // Loads of _barrier_threads/_waiters must not float above 
disarm store.
     L62:   OrderAccess::fence();
         Since you are using fence() here, you should also say that "and 
disarm store
         must not float below."

     L73:   OrderAccess::fence();
         Missing your rationale comment for this fence().

     L79:     OrderAccess::fence();
         I was expecting this one to be OrderAccess::loadload().

         Also, missing your rationale comment for this OrderAccess use.

     L59:       OrderAccess::storeload(); // Loads in WB must not float up.
         Not sure why this storeload() is here instead of being in wait().
         Seems like wait() should do the "right thing".

Thumbs up on the code! I don't need to see a new webrev if you choose
to fix the minor things I pointed out above.


> gtest passes thousands of loops locally and hundreds in mach5.
> Thanks, Robbin
>> Thanks,
>> David
>>>> s/Implementation/Implementations/
>>> Fixed
>>>> The fourth line is no longer needed.
>>> Above is the reason I would like to keep the fourth line, since only 
>>> if you call
>>> both disarm() and wake() you have that guarantee that waiter threads 
>>> will
>>> return.
>>> Thanks, Robbin
>>>> Thanks,
>>>> David
>>>>> Inc:
>>>>> http://cr.openjdk.java.net/~rehn/8214271/4/inc/webrev/
>>>>> Full:
>>>>> http://cr.openjdk.java.net/~rehn/8214271/4/full/webrev/
>>>>> /Robbin
>>>>>> Otherwise this all looks good!
>>>>>> Thanks,
>>>>>> David
>>>>>> -----
>>>>>>> Full:
>>>>>>> http://cr.openjdk.java.net/~rehn/8214271/3/full/webrev/
>>>>>>> Thanks, Robbin
>>>>>>> On 11/23/18 5:55 PM, Robbin Ehn wrote:
>>>>>>>> Forgot RFR in subject.
>>>>>>>> /Robbin
>>>>>>>> On 2018-11-23 17:51, Robbin Ehn wrote:
>>>>>>>>> Hi all, please review.
>>>>>>>>> When a safepoint is ended we need a way to get back to 100% 
>>>>>>>>> utilization as fast
>>>>>>>>> as possible. 100% utilization means no idle cpu in the system 
>>>>>>>>> if there is a
>>>>>>>>> JavaThread that could be executed. The traditional ways to 
>>>>>>>>> wake many, e.g.
>>>>>>>>> semaphore, pthread_cond, is not implemented with a single 
>>>>>>>>> syscall instead they
>>>>>>>>> typical do one syscall per thread to wake.
>>>>>>>>> This change-set contains that primitive, the WaitBarrier, and 
>>>>>>>>> a gtest for it.
>>>>>>>>> No actual users, which is in coming patches.
>>>>>>>>> The WaitBarrier solves by doing a cooperative semaphore 
>>>>>>>>> posting, threads woken
>>>>>>>>> will also post. On Linux we can instead directly use a futex 
>>>>>>>>> and with one
>>>>>>>>> syscall wake all. Depending on how many threads and cpus the 
>>>>>>>>> performance vary,
>>>>>>>>> but a good utilization of the machine, just on the edge of 
>>>>>>>>> saturated, the time to reach 100% utilization is around 3 
>>>>>>>>> times faster with the WaitBarrier (where futex is faster than 
>>>>>>>>> semaphore).
>>>>>>>>> Webrev:
>>>>>>>>> http://cr.openjdk.java.net/~rehn/8214271/webrev/
>>>>>>>>> CR:
>>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8214271
>>>>>>>>> Passes 100 iterations of gtest on our platforms, both 
>>>>>>>>> fastdebug and release.
>>>>>>>>> And have been stable when used in safepoints (t1-8) (coming 
>>>>>>>>> patches).
>>>>>>>>> Thanks, Robbin

More information about the hotspot-dev mailing list