segments and confinement
samuel.audet at gmail.com
Tue May 19 07:03:13 UTC 2020
Have you tried to conduct those experiments with thread-local storage in
C++? The overhead produced by C++ compilers is usually negligible, at
least on Linux:
If performance issues with ThreadLocal turn out to be caused by one of
those limitations of C2, I wonder if GraalVM has the same limitation. In
any case, it would probably need to be turned into some compiler hint
similar to `volatile` or something... :/
Assuming we could prove that we can do what we need to do with efficient
thread-local storage, do you think it would have a chance to spark
awareness of the need to get this working within the JVM?
On 5/15/20 6:21 PM, Maurizio Cimadamore wrote:
> On 15/05/2020 04:10, Samuel Audet wrote:
>> Thanks for the summary!
>> I was about to say that we can probably do funky stuff with
>> thread-local storage, and not only with GC, but for example to prevent
>> threads from trying to access addresses they must not access, but I
>> see you've already started looking at that, at least for GC, so keep
>> going. :)
> For the records - one of the experiments I've tried (but not listed
> here) was specifically by using ThreadLocal storage (to emulate some
> kind of thread group concept) - but that also gave pretty poor results
> performance-wise (not too far from locking) - which seems to suggest
> that, if a solution exists (and this might not be _that_ obvious - after
> all the ByteBuffer API has been struggling with this problem for many
> many years) - it exists at a lower level.
>> In any case, if the final solution could be applied to something else
>> than memory segments that have to be allocated by the VM, then it
>> would have great value for native interop. I hope it goes there.
> The more we can make the segment lifetime general and shareable across
> threads, the more we increase the likelihood of that happening.
> Currently, segments have a fairly restricted lifetime handling (because
> of confinement, which is because of safety) - and the same guarantees
> don't seem useful (or outright harmful) when thinking about native
> libraries and other resources (I don't think the concept of a confined
> native library is very appealing).
> So, IMHO, it all hinges on if and how we can make segments more general
> and useful.
>> On 5/13/20 8:51 PM, Maurizio Cimadamore wrote:
>>> this is an attempt to address some of the questions raised here ,
>>> in a dedicated thread. None of the info here is new and some of these
>>> things have already been discussed, but it might be good to recap as
>>> to where we are when it comes to memory segment and confinement.
>>> The foreign memory access API has three goals:
>>> * efficiency: access should be as fast as possible (hopefully close to
>>> unsafe access)
>>> * deterministic deallocation: the programmer have a say as to *when*
>>> things should be deallocated
>>> * safety: all memory accesses should never cause an hard VM crash
>>> (e.g. because accessing memory out of bounds, or because accessing
>>> memory that has been deallocated already
>>> Now, as long as memory segment are used by _one thread at a time_
>>> (this pattern is also known as serial confinement), everything works
>>> out nicely. In such a scenario, it is not possible for memory to be
>>> accessed _while_ it is being deallocated. Memory segment spatial
>>> bounds ensure that out-of-bound access is not possible, and the
>>> memory segment liveness check ensures that memory cannot be accessed
>>> _after_ it has been deallocated. All good.
>>> When we start considering situations where multiple threads want to
>>> access the same segment at the same time, one of the pillars on which
>>> safety relied goes away: namely, we can have races between a thread
>>> accessing memory and a thread deallocating same memory (e.g. by
>>> closing the segment it is associated with). In other words, safety,
>>> one of the three pillars of the API, is undermined. What are the
>>> The first, obvious solution, would be to use some kind of locking
>>> scheme so that, while memory is accessed, it cannot be closed.
>>> Unfortunately, memory access is such a short-lived operation that the
>>> cost of putting a lock acquire/release around it vastly exceed the
>>> cost of the memory access itself. Furthermore, optimistic locking
>>> strategies, while possible when reading, are not possible when
>>> writing (e.g. you can still write to memory you are not supposed to).
>>> So, unless we want memory access to be super slow (some benchmarks
>>> revealed that, with best strategies, we are looking at at least 100x
>>> cost over plain access), this is not a feasible solution.
>>> *Atomic reference counting*
>>> The solution implemented in Java SE 14 was based on atomic reference
>>> counting - a MemorySegment can be "acquired" by another thread.
>>> Closing the acquired view decrements the count. Safety is achieved by
>>> enforcing an additional constraint: a segment cannot be closed if it
>>> has pending acquired views. This scheme is relatively flexible, allow
>>> for efficient, lock-free access, and it is still deterministic. But
>>> the feedback we received was somewhat underwhelming - while access
>>> was allowed to multiple threads, the close() operation was still only
>>> allowed to the original segment owner. This restriction seemed to
>>> defeat the purpose of the acquire scheme, at least in some cases.
>>> *Divide and conquer*
>>> In the API revamp which we hope to deliver for Java 15, the general
>>> acquire mechanism will be replaced by a more targeted capability -
>>> that to divide a segment into multiple chunks (using a spliterator)
>>> and have multiple threads have a go at the non-overlapping slices.
>>> This gives a somewhat simpler API, since now all segments are
>>> similarly confined - and the fact that access to the slices occur
>>> through the spliterator API makes the API somewhat more accessible,
>>> removing the distinction between acquired segments and non-acquired
>>> ones. This is also a more honest approach: indeed the acquire scheme
>>> was really most useful to process the contents of a segment in
>>> parallel - and this is something that the Spliterator API allows you
>>> to do relatively well (plus, we gained automatic synergy with
>>> parallel streams).
>>> *Unsafe hatch*
>>> The new MemorySegment::ofNativeRestricted factory allows creation of
>>> memory segment without an explicit thread owner. Now, this factory is
>>> meant to be used for unsafe use cases (e.g. those originating from
>>> native interop), and clients of this API will have to provide
>>> explicit opt-in (e.g. a command line flag) in order to use it ---
>>> since improper uses of the segments derived from it can lead to hard
>>> VM crashes. So, while this option is certainly powerful, it cannot be
>>> considered a _safe_ option to deal with shared memory segments and,
>>> at best, it merely provides a workaround for clients using other
>>> existing unsafe API points (such as Unsafe::invokeCleaner).
>>> *GC to the rescue*
>>> What if we wanted a truly shared segment which could be accessed by
>>> any thread w/o restrictions? Currently, the only way to do that is to
>>> let the segment be GC-managed (as already happens with byte buffers);
>>> this gives up one of the principle of the foreign memory access API:
>>> deterministic deallocation. While this is a fine fallback solution,
>>> this also inherits all the problems that are present in the
>>> ByteBuffer implenentation: we will have to deal with cases where the
>>> Cleaner doesn't deallocate segments fast enough (to partially counter
>>> that, ByteBuffer implements a very complex scheme, which makes
>>> ByteBuffer::allocateDirect very expensive); furthermore, all memory
>>> accesses will need to be wrapped around reachability fences, since we
>>> don't want the cleaner to kick in in the middle of memory access. If
>>> all else fail (see below), this is of course something we'll consider
>>> *Other (experimental) solutions*
>>> Other approaches we're considering are a variation of a scheme
>>> proposed originally by Andrew Haley  which uses GC safepoints as a
>>> way to prove that no thread is accessing memory when the close
>>> operation happens. What we are investigating is as to whether the
>>> cost of this solution (which would requite a stop-the-world pause)
>>> can be ameliorated by using thread-local GC handshakes (). If this
>>> could be pulled off, that would of course provide the most natural
>>> extension for the memory access API in the multi-threaded case:
>>> safety and efficiency would be preserved, and a small price would be
>>> paid in terms of the performances of the close() operation (which is
>>> something we can live with).
>>> Another experimental solution we're considering is to relax the
>>> confinement constraint so that more coarse-grained confinement units
>>> can also be associated with segments. For instance, Loom is
>>> considering the inclusion of an unbounded executor service , which
>>> can be used to schedule fibers. What if we could create a memory
>>> segment that is confined to one such executor service? This way, we
>>> could achieve safety by having the close() operation wait until all
>>> the threads (or fibers!) in the service have completed.
>>> This should summarize where we're at pretty exhaustively. In other
>>> words, no, we did not give up on multi-threaded access, but we need
>>> to investigate more to understand what possibilities are available to
>>> us, especially if we're willing to go lower level.
>>>  - https://mail.openjdk.java.net/pipermail/panama-dev/2020-May/008989.html
>>>  - https://mail.openjdk.java.net/pipermail/jmm-dev/2017-January.txt
>>>  - https://openjdk.java.net/jeps/312
>>>  - https://github.com/openjdk/loom/commit/f21d6924
More information about the panama-dev