RFR (S) 8235962: os::current_thread_id() is not signal safe on macOS
gerard.ziemski at oracle.com
Thu Jan 16 19:14:19 UTC 2020
Thank you very much for the review.
On 1/13/20 8:49 PM, David Holmes wrote:
> Hi Gerard,
> Thanks for tackling this. All extremely confusing as usual :) far too
> many different notions of "thread identity".
> On 14/01/2020 4:43 am, gerard ziemski wrote:
>> hi all,
>> Before JDK-8022808 we used to call "mach_thread_self()" to obtain a
>> thread id, but back then we forgot to release the returned mach port,
>> which would make a long running java process (like Kitchensink)
>> eventually run out of resources and hang it.
>> To solve that, in JDK-8022808, we switched to using
>> "pthread_mach_thread_np(pthread_self())", which returns a static mach
>> port, which does not need to be released.
>> However, the "pthread_mach_thread_np()" API, which we started using
>> is specifically knownn as not async-signal-safe API and that might
>> cause an issue, because we use it to generate thread ids in
>> hs_err_pid.log crash log files in our signal handler, which requires
>> that only async-signal-safe API be used.
>> Our crash handler is best effort approach, and to this extend I'd
>> like to propose a fix, which reverts back to using
>> "mach_thread_self()", which I must admit is unknown whether it itself
>> is async-signal-safe, though it's not part of pthread API, like
>> "pthread_mach_thread_np()" which itself is not part of
>> async-signal-safe APIs (IEEE Std 1003.1, 2016). Additionally, the
>> reporter of the issue suggest that it solves the actual issue with
>> using "pthread_mach_thread_np()" for them in their project, so let's
>> benefit from their experience.
> Okay that seems reasonable.
>> In this fix we revert back to using "mach_thread_self()" API, but
>> make sure to release the mach port, and use MACH_PORT_VALID() macro
>> to verify the validity of mach port.
> I'm unclear on this aspect. Now that we are releasing it doesn't that
> mean that the port value could be reused and applied to a different
> thread? It isn't at all clear to me what a "mach port" actually means
> and what releasing it (or failing to) means.
I've done more research and testing and here is how I understand our
usage of mach ports:
Mach ports are a resource a code needs in order to talk to mach thread
(or task), though we simply use its value as thread id. They (i.e. the
mach kernel) use reference counting to keep track of whether they need
to exist, and get automatically destroyed when the ref count reaches 0.
They can be created via APIs like mach_port_allocate(), and deleted
directly via mach_port_destroy(), or indirectly via
mach_port_deallocate(). The difference between these two (inaptly named
imho) APIs is that while mach_port_destroy() will outright delete the
port, mach_port_deallocate() will only decrease the ref count and will
call mach_port_destroy() implicitly only if it reaches 0.
If the user obtains mach port via mach_thread_self(), the ref count of
that thread will increase, which needs to be matched by a call to
mach_port_deallocate(). However, one can not delete such special mach
port via mach_port_destroy() (only user own created ports can de deleted).
Back in 2013 a call to mach_thread_self() would create a brand new mach
port, so if you did not call mach_port_deallocate() after being done
with it, the resource would be leaked. I believe that was actually a bug
in early OS X, or possibly a bad API design, which affected many other
users, not just us (i.e. Google's Chrome app for Mac). The confusion was
primary due to the fact that mach_task_self() would not create a new
port, so developers expected the same from mach_thread_self(). There are
only about 300,000 ports available to a process (verified via a native
sample that keeps creating new ports via mach_port_allocate() API), so
eventually long lived apps, like our kitchensink, would run out of them.
We could have fixed our issue by matching mach_thread_self() with
mach_port_deallocate() even back then, but instead we chose to use
pthread_mach_thread_np() API, which is documented to return thread's
static mach port, which does not need to be returned via
mach_port_deallocate(). I think that solution just bypassed the mystery
of mach ports and when and if we need to delete them, but also was more
efficient as it did not create a new port.
Today, mach_thread_self() on OS X behaves differently, to how it did
back then - it no longer creates new mach ports, and so in most cases
the matching call to mach_port_deallocate() could be skipped, as the
counter the mach kernel uses for ref count is a very big number. Though
it would eventually overflow, and the kernel then would probably delete
the mach port with tragic results. I tried to see if I could make that
happen, but no luck in 24 hours of a test case continuously asking for
mach_thread_self() every 10 micro seconds.
So today we could have simply reverted to how the code used to be, and
not even bother to call mach_port_deallocate() after mach_thread_self(),
though theoretically that could overflow the ref count eventually.
The proposed fix, reverts back to using mach_thread_self() (which
increases the ref count) because it's probably signal-async-safe, unlike
pthread_mach_thread_np(), AND calls mach_port_deallocate() (which
decreases the refcount)
Someone please correct me if I got anything wrong here.
>> bug: https://bugs.openjdk.java.net/browse/JDK-8235962
>> webrev: http://cr.openjdk.java.net/~gziemski/8235962_rev1/index.html
> I found the refactoring around _unique_thread_id a bit hard to follow.
> Your new set_unique_thread_id needs to use "#ifdef __APPLE__"
> otherwise it won't compile on other BSD systems.
I will fix it, test it and post a new webrev, assuming everyone is
satisfied with my answers above.
>> Tested with Mach5 tier1,2,3,4,5 and Kitchensink24 and
>> Kitchensink24Stress locally on Mac laptop over 2 days.
More information about the hotspot-runtime-dev