RFR (S) 8235962: os::current_thread_id() is not signal safe on macOS
david.holmes at oracle.com
Thu Jan 16 22:07:15 UTC 2020
Thank you for the very detailed discussion on mach ports! Please add all
of that to the bug report (plus any references/links to where you found
I'm good with the current fix to "do the right thing" and keep the
reference count in sync.
On 17/01/2020 5:14 am, gerard ziemski wrote:
> Hi David,
> Thank you very much for the review.
> On 1/13/20 8:49 PM, David Holmes wrote:
>> Hi Gerard,
>> Thanks for tackling this. All extremely confusing as usual :) far too
>> many different notions of "thread identity".
>> On 14/01/2020 4:43 am, gerard ziemski wrote:
>>> hi all,
>>> Before JDK-8022808 we used to call "mach_thread_self()" to obtain a
>>> thread id, but back then we forgot to release the returned mach port,
>>> which would make a long running java process (like Kitchensink)
>>> eventually run out of resources and hang it.
>>> To solve that, in JDK-8022808, we switched to using
>>> "pthread_mach_thread_np(pthread_self())", which returns a static mach
>>> port, which does not need to be released.
>>> However, the "pthread_mach_thread_np()" API, which we started using
>>> is specifically knownn as not async-signal-safe API and that might
>>> cause an issue, because we use it to generate thread ids in
>>> hs_err_pid.log crash log files in our signal handler, which requires
>>> that only async-signal-safe API be used.
>>> Our crash handler is best effort approach, and to this extend I'd
>>> like to propose a fix, which reverts back to using
>>> "mach_thread_self()", which I must admit is unknown whether it itself
>>> is async-signal-safe, though it's not part of pthread API, like
>>> "pthread_mach_thread_np()" which itself is not part of
>>> async-signal-safe APIs (IEEE Std 1003.1, 2016). Additionally, the
>>> reporter of the issue suggest that it solves the actual issue with
>>> using "pthread_mach_thread_np()" for them in their project, so let's
>>> benefit from their experience.
>> Okay that seems reasonable.
>>> In this fix we revert back to using "mach_thread_self()" API, but
>>> make sure to release the mach port, and use MACH_PORT_VALID() macro
>>> to verify the validity of mach port.
>> I'm unclear on this aspect. Now that we are releasing it doesn't that
>> mean that the port value could be reused and applied to a different
>> thread? It isn't at all clear to me what a "mach port" actually means
>> and what releasing it (or failing to) means.
> I've done more research and testing and here is how I understand our
> usage of mach ports:
> Mach ports are a resource a code needs in order to talk to mach thread
> (or task), though we simply use its value as thread id. They (i.e. the
> mach kernel) use reference counting to keep track of whether they need
> to exist, and get automatically destroyed when the ref count reaches 0.
> They can be created via APIs like mach_port_allocate(), and deleted
> directly via mach_port_destroy(), or indirectly via
> mach_port_deallocate(). The difference between these two (inaptly named
> imho) APIs is that while mach_port_destroy() will outright delete the
> port, mach_port_deallocate() will only decrease the ref count and will
> call mach_port_destroy() implicitly only if it reaches 0.
> If the user obtains mach port via mach_thread_self(), the ref count of
> that thread will increase, which needs to be matched by a call to
> mach_port_deallocate(). However, one can not delete such special mach
> port via mach_port_destroy() (only user own created ports can de deleted).
> Back in 2013 a call to mach_thread_self() would create a brand new mach
> port, so if you did not call mach_port_deallocate() after being done
> with it, the resource would be leaked. I believe that was actually a bug
> in early OS X, or possibly a bad API design, which affected many other
> users, not just us (i.e. Google's Chrome app for Mac). The confusion was
> primary due to the fact that mach_task_self() would not create a new
> port, so developers expected the same from mach_thread_self(). There are
> only about 300,000 ports available to a process (verified via a native
> sample that keeps creating new ports via mach_port_allocate() API), so
> eventually long lived apps, like our kitchensink, would run out of them.
> We could have fixed our issue by matching mach_thread_self() with
> mach_port_deallocate() even back then, but instead we chose to use
> pthread_mach_thread_np() API, which is documented to return thread's
> static mach port, which does not need to be returned via
> mach_port_deallocate(). I think that solution just bypassed the mystery
> of mach ports and when and if we need to delete them, but also was more
> efficient as it did not create a new port.
> Today, mach_thread_self() on OS X behaves differently, to how it did
> back then - it no longer creates new mach ports, and so in most cases
> the matching call to mach_port_deallocate() could be skipped, as the
> counter the mach kernel uses for ref count is a very big number. Though
> it would eventually overflow, and the kernel then would probably delete
> the mach port with tragic results. I tried to see if I could make that
> happen, but no luck in 24 hours of a test case continuously asking for
> mach_thread_self() every 10 micro seconds.
> So today we could have simply reverted to how the code used to be, and
> not even bother to call mach_port_deallocate() after mach_thread_self(),
> though theoretically that could overflow the ref count eventually.
> The proposed fix, reverts back to using mach_thread_self() (which
> increases the ref count) because it's probably signal-async-safe, unlike
> pthread_mach_thread_np(), AND calls mach_port_deallocate() (which
> decreases the refcount)
> Someone please correct me if I got anything wrong here.
>>> bug: https://bugs.openjdk.java.net/browse/JDK-8235962
>>> webrev: http://cr.openjdk.java.net/~gziemski/8235962_rev1/index.html
>> I found the refactoring around _unique_thread_id a bit hard to follow.
>> Your new set_unique_thread_id needs to use "#ifdef __APPLE__"
>> otherwise it won't compile on other BSD systems.
> I will fix it, test it and post a new webrev, assuming everyone is
> satisfied with my answers above.
>>> Tested with Mach5 tier1,2,3,4,5 and Kitchensink24 and
>>> Kitchensink24Stress locally on Mac laptop over 2 days.
More information about the hotspot-runtime-dev