Thoughts about SubstrateVM GC
Peter B. Kessler
Peter.B.Kessler at Oracle.COM
Fri Mar 1 00:49:02 UTC 2019
Mostly "what Christian said". Some additional details and comments inline.
On 02/28/19 02:09 PM, Christian Wimmer wrote:
> Hi Roman,
> Thanks for your interest in Substrate VM!
> On 2/28/19 12:59, Roman Kennke wrote:
>> Hello all,
>> I hope this is the right mailing list to discuss SubstrateVM? If not,
>> please redirect me.
> There is no better list, at least at this time.
>> During the last couple of days, I did have a closer look at
>> SubstrateVM's GC, and also did some experiments. I would like to
>> summarize what I found (so that you can correct me if I'm wrong), and
>> make a case for some improvements that I would like to work on.
>> Here's my findings so far:
>> Substrate GC is a 2-generation, STW, single-threaded GC.
>> The young generation is a single space. When collected, all live objects
>> get scavenged out, directly into the old generation.
>> The old generation is 2 semispaces (actually, 4 with the 2 pinned
>> spaces, which I'll ask about later). When collected, live objects get
>> scavenged back-and-forth between the two spaces.
> yes, in theory. In "reality" there is no contiguous memory for the spaces, so after a GC all the aligned chunks of the from-space are either returned to the OS or cached and immediately reused for young generation allocation.
>> Is that correct so far?
>> It seemed a bit weird at first to write a Java GC in Java language. :-)
> It makes some things easier, e.g., the object layout code used by the GC can immediately be used in other parts of the VM and the compiler. But in the end there is C-style memory access of course to actually process the objects, and that code is more verbose in Java.
>> I analyzed a bit of generated assembly code, comparing it side-by-side
>> with corresponding Java code, and was actually quite impressed by it.
>> It's also got room for improvements, but that was not the major
>> bottleneck. The single major bottleneck in my experiments was waiting
>> for loads of the mark word during scavenging, in other words, it's doing
>> way too much of it ;-)
>> I have noticed a bunch of problems so far:
>> - The promotion rate between young-gen and old-gen seems fairly hot.
>> This is because there is no notion of tenuring objects or so.
>> - This implies that there are relatively many old-gen collections
>> happening, which seriously affect application throughput (once they happen)
>> - Because of the above, the usual wisdoms from other GCs don't apply: I
>> could get significant improvements (i.e. fewer diving into full-GCs) by
>> configuring a small young-gen (like 10%) and large old-gen (like 90%).
>> But that's not really great either.
> Our "design goal" was to start out with the simplest possible GC implementation that is viable (i.e., generational). That was the starting point, and we have not had time yet to do something better. But we know something better is certainly needed.
>> - The policy when to start collecting seems a bit unclear to me. In my
>> understanding, there is (almost) no use (for STW GCs) in starting a
>> collection before allocation space is exhausted. Which means, it seems
>> an obvious trigger to start collection on allocation failure. Yet, the
>> policies I'm looking at are time-based or time-and-space-based. I said
>> 'almost' because the single use for time-based collection would be
>> periodic GC that is able to knock out lingering garbage during
>> no/little-allocation phase of an application, and then only when the GC
>> is also uncommitting the resulting unused pages (which, afaics,
>> Substrate GC would do: bravo!). But that doesn't seem to be the point of
>> the time-based policies: it looks like the goal of those policies is to
>> balance time spent in young-vs-old-gen collections.?!
> The young generation size is fixed, and a collection is started when this space is filled. So from that point of view, the starting trigger of a GC is always space based.
> The policy whether to do an incremental (young) collection or a full collection is time based (but you can easily plug in any policy you want). The goal is to balance time between incremental and full collection. We certainly don't want to fill up the old generation because that is the maximum heap size, and it is by default provisioned to be very big.
The young generation size is variable, at the granularity of chunks. (And the chunk size is variable, at the granularity of powers of 2.) The current defaults were based on not enough real example applications. There is no hard "allocation failure in the young generation".
Most allocation goes fast-path, into chunks handed out to individual threads (AllocationSnippets.newInstance). When fast-path allocation exhausts a chunk, the slow-path code (ThreadLocalAllocation.allocateNewInstance) asks if a collection should be done before the allocation. That invokes the current implementation of HeapPolicy.CollectOnAllocationPolicy.maybeCauseCollection, and one can write a policy of one's choice. The current default allocation policy (HeapPolicy.CollectOnAllocationPolicy.Sometimes.maybeCauseCollection) is to look at the bytes allocated since the last collection and compare it to the current value of the maximum young generation size: imposing an artificial young generation size limit. Time is not involved in any of the policies we currently have for when to request collections.
Once a collection has been requested, there are separate decisions as to whether an incremental collection should be done (CollectionPolicy.collectIncrementally), and then a decision as to whether a complete collection should be done (CollectionPolicy.collectCompletely). The current collection policy is CollectionPolicy.ByTime, which as you observed, tries to balance the accumulated time in incremental collections with the time in complete collections. (See https://dl.acm.org/citation.cfm?id=1542433.) Probably a better default collection policy would be CollectionPolicy.BySpaceAndTime, which, as you want, allows the old generation to fill up (to the current value of the minimum size of the heap). So far, our users have requested small heaps rather than reduced collection time. (Well, they *ask* for both! :-)
>> With a little bit of distance and squinting of eyes, one can see that
>> Substrate GC's young generation is really what is called 'nursery space'
>> elsewhere, which aims to reduce the rate at which objects get introduced
>> into young generation. And the old generation is really what is usually
>> called young generation elsewhere. What's missing is a true old
>> generation space.
> Not really, because the young generation can be collected independently, i.e., there are the generational write barriers, remembered sets, ...
> So the young generation is reduced to the nursery space, but I argue the old generation is really an old generation.
An intermediate generation between a young generation and an old generation would allow us to collect medium-lived objects without a full collection (via remembered sets in the old generation). (Cf. HotSpot survivor spaces, which are effective even if they are difficult to explain to users.) A small matter of programming.
>> Considering all this, I would like to propose some improvements:
>> - Introduce a notion of tenuring objects. I guess we need like 2 age
>> bits in the header or elsewhere for this. Do we have that room?
> You don't need the age bits in the header. You can easily go from the object to the aligned chunk that the object is in (we do that all the time, for example for the write barrier to do the card marking), and store the age in the chunk header. Requiring all objects in one chunk to have the same age is not much of a limitation.
> Adding tenuring is definitely necessary to achieve reasonable GC performance.
>> - Implement a true old-space (and rename the existing young to nursery
>> and old to young?). In my experience, sliding/mark-compact collection
>> using a mark bitmap works best for this: it tends to create a 'sediment'
>> of permanent/very-long-lived objects at the bottom which would never get
>> copied again. Using a bitmap, walking of live objects (e.g. during
>> copying, updating etc) would be very fast: much faster than walking
>> objects by their size.
> A mark-and-compact old generation algorithm definitely makes sense. Again, the only reason why we don't have it yet is that no one had time to implement it.
> Mark-and-compact is also great to reduce memory footprint. Right now, during GC the memory footprint can double because of the temporary space for copying.
Note that the marking bitmaps would have to be per-chunk, the way the remembered sets are per-chunk. There is no larger contiguous address range to leverage. If you slide the live objects to one end of a chunk you would still not be able to free the chunk unless it were completely empty. So you would have to slide between chunks. There is no inherent ordering between the chunks of a generation, but we could impose one, e.g., to get your "sedimentation". One would have to be careful about chunk boundaries, and Murphy would size objects so the "compaction" took up more space than the original, but it should mostly work out in practice.
>> - I am not totally sure about the policies. My current thinking is that
>> this needs some cleanup/straightening-out, or maybe I am
>> misunderstanding something there. I believe (fairly strongly) that
>> allocation failure is the single useful trigger for STW GC, and on top
>> of that an (optional) periodic GC trigger that would kick in after X
>> (milli)seconds no GC.
Ah, another constraint: we used to think that SubstrateVM would be useful in single-threaded, and maybe "tickless" applications. No background collection threads, no alarms. You could write a policy to start a collection after a time interval, but the instigation might still want to be failing over into slow-path allocation. A strictly time-based collection policy would certainly be possible, but you might disappoint applications that were in the middle of servicing a request.
> As I mentioned above, the GC trigger is allocation failure for the young generation.
>> - Low-hanging-fruit improvement that could be done right now: allocate
>> large objects(arrays) straight into old-gen instead of copying them
>> around. Those are usually long-lived anyway, and copying them
>> back-and-forth just costs CPU time for no benefit. This will become even
>> more pronounced with a true old-gen.
> Large arrays are allocated separately in unaligned chunks. Such arrays are never copied, but only logically moved from the young generation into the old generation. An unaligned chunk contains exactly one large array.
>> Oh and a question: what's this pinned object/chunks/spaces all about?
> There are two mechanisms right now to get objects that are never moved by the GC:
> 1) A "normal" object can be temporarily pinned using org.graalvm.nativeimage.PinnedObject. The current implementation then keeps the whole aligned chunk that contains the object alive, i.e., it is designed for pinnings that are released quickly so that no objects are actually ever pinned when the GC runs, unless the GC runs in an unlucky moments. We use such pinning for example to pass pointers into byte arrays directly to C functions without copying.
> 2) A PinnedAllocator can be used to get objects that are non-moving for a long period of time. This is currently used for the metadata of runtime compiled code. We are actively working to make PinnedAllocator unnecessary by putting the metadata into C memory, and then hopefully we can remove PinnedAllocator and all code that is necessary for it in the GC, i.e., the notion of pinned spaces you mentioned before.
Pinned objects also get in the way of "sliding compaction". If one is unlucky and there are pinned objects when the collection starts.
>> What do you think about all this? Somebody else might have thought about
>> all this already, and have some insights that I don't have in my naive
>> understanding? Maybe some of it is already worked on or planned? Maybe
>> there are some big obstactles that I don't see yet, that make it less
> We certainly have ideas and plans, and they match your observations. If you are interested in contributing, we can definitely give you some guidance so that you immediately work into the right direction.
More information about the graal-dev