JEP 132: More-prompt finalization

Kirk Pepperdine kirk.pepperdine at
Sat May 30 05:20:20 UTC 2015

Hi Moh,

> However, I was hoping this would have the effect of improving (non-finalizable) reference handling. We've seen serious issues in WeakReference handling and have had to write some twisted code to deal with this.

Better reference life-cycle handling would actually be beneficial IMHO as many cache implementations suffer because of certain aspects of the current implementation. SoftReference is very difficult to use.
> So I guess the question I have to Kirk and David is: do you feel a GC load of 10K WeakReferences per cycle is also "doing something else wrong”?

Hard to say as this really has to be eval’ed on a case by case basis. But I’d wonder if the WeakReference was actually needed if you are recycling them so quickly.


> Sorry if this is going off-topic.
> Thanks
> Moh
>> -----Original Message-----
>> From: core-libs-dev [mailto:core-libs-dev-bounces at] On Behalf
>> Of Kirk Pepperdine
>> Sent: Thursday, May 28, 2015 11:58 PM
>> To: david.holmes at Holmes
>> Cc: hotspot-gc-dev at; core-libs-
>> dev at
>> Subject: Re: JEP 132: More-prompt finalization
>> Hi Peter,
>> It is a very interesting proposal but to further David's comments, the life-
>> cycle costs of reference objects is horrendous of which the actual process of
>> finalizing an object is only a fraction of that total cost. Unfortunately your
>> micro-benchmark only focuses on one aspect of that cost. In other words, it
>> isn't very representative of a real concern. In the real world the finalizer
>> *must compete with mutator threads and since F-J is an "all threads on deck"
>> implementation, it doesn't play well with others. It creates a "tragedy of the
>> commons". That is situations where everyone behaves rationally with a common
>> resource but to the detriment of the whole group". In short, parallelizing (F-
>> Jing) *everything* in an application is simply not a good idea. We do not live
>> in an infinite compute environment which means to have to consider the impact
>> of our actions to the entire group.
>> This was one of the points of my recent article in Java Magazine which I wrote
>> to try to counter some of the rhetoric I was hearing in conference about the
>> universal benefits of being able easily parallelize streams in Java 8. Yes, I
>> agree it's a great feature but it must be used with discretion. Case in point.
>> After I finished writing the article, I started running into a couple of early
>> adopters that had swallowed the parallel message whole indiscriminately
>> parallelizing all of their streams. As you can imagine, they were quite
>> surprised by the results and quickly worked to de-parallelize *all* of the
>> streams in the application.
>> To add some ability to parallelize the handling of reference objects seems
>> like a good idea if you are collecting large numbers of reference objects
>> (>10,000 per GC cycle). However if you are collecting large numbers of
>> reference objects you're most likely doing something else wrong. IME,
>> finalization is extremely useful but really only for a limited number of use
>> cases and none of them (to date) have resulted in the app burning through
>> 1000s of final objects / sec.
>> It would be interesting to know why why you picked on this particular issue.
>> Kind regards,
>> Kirk
>> On May 29, 2015, at 5:18 AM, David Holmes <david.holmes at> wrote:
>>> Hi Peter,
>>> I guess I'm very concerned about the premise that finalization should scale
>> to millions of objects and be performed highly concurrently. To me that's
>> sending the wrong message about finalization. It also isn't the most effective
>> use of cpu resources - most people would want to do useful work on most cpu's
>> most of the time.
>>> Cheers,
>>> David
>>> On 29/05/2015 3:12 AM, Peter Levart wrote:
>>>> Hi,
>>>> Did you know that the following simple loop:
>>>> public class FinalizableBottleneck {
>>>>    static boolean no;
>>>>    @Override
>>>>    protected void finalize() throws Throwable {
>>>>        // empty finalize() method does not make the object finalizable
>>>>        // (it is not even registered on the finalizer's list)
>>>>        if (no) {
>>>>            throw new AssertionError();
>>>>        }
>>>>    }
>>>>    public static void main(String[] args) {
>>>>        while (true) {
>>>>            new FinalizableBottleneck();
>>>>        }
>>>>    }
>>>> }
>>>> ...quickly fills the entire heap with FinalizableBottleneck and internal
>>>> Finalizer objects and brings the JVM to a halt? After a few seconds of
>>>> running the above program, jmap -histo:live reports:
>>>> num     #instances         #bytes  class name
>>>> ----------------------------------------------
>>>>   1:      50048325     2001933000  java.lang.ref.Finalizer
>>>>   2:      50048278      800772448  FinalizableBottleneck
>>>> There are a couple of bottlenecks that make this happen:
>>>> - ReferenceHandler thread synchronizes with VM to unhook Reference(s)
>>>> from the pending chain one be one and dispatches them to their respected
>>>> ReferenceQueue(s) which also use synchronization for equeueing each
>>>> Reference.
>>>> - Enqueueing synchronizes with the finalization thread which removes the
>>>> Finalizer(s) (FinalReferences) from the finalization queue and executes
>>>> them.
>>>> - Executing the Finalizer(s) removes them from the doubly-linked list of
>>>> all Finalizer(s) which is used to retain them until they are needed and
>>>> this synchronizes with the threads that link new Finalizer(s) into the
>>>> doubly-linked list as new finalizable objects get registered.
>>>> We see that the creation of a finalizable object only takes one
>>>> synchronization (registering into the doubly-linked list) and is
>>>> performed synchronously, while finalization takes 4 synchronizations
>>>> among 4 different threads (in pairs) and happens when the Finalizer
>>>> instance "travels" over from VM thread to ReferenceHandler thread and
>>>> then to finalization thread. No wonder that finalization can not keep up
>>>> with allocation in a single thread. The situation is even worse when
>>>> finalize() methods do some actual work.
>>>> I have experimented with various approaches to widen these bottlenecks
>>>> and found out that I can not beat the ForkJoinPool when combined with
>>>> some improvements to internal data structures used in reference
>>>> processing. Here's a prototype I came up with:
>>>> And this is the benchmark I use for measuring the throughput:
>>>> The benchmark shows (results inline in source) that using unpatched JDK,
>>>> on my PC (i7-2700K, Linux, JDK8) I can not construct more than 1500
>>>> finalizable objects per ms in a single thread and that while doing so,
>>>> finalization only manages to process approx. 100 - 120 objects at the
>>>> same time. Objects "in-flight" quickly accumulate and bring the VM to a
>>>> halt, where it is not doing anything but full GC cycles.
>>>> When constructing in 4 threads, there's not much difference.
>>>> Construction of finalizable objects simply doesn't scale.
>>>> Patched JDK shows something completely different. Single thread
>>>> construction achieves a rate of 3600 objects / ms. Number of "in-flight"
>>>> objects is kept constant at about 5-6M instances which amounts to approx
>>>> 1.5 s of allocation. I think this is about the rate of GC cycles during
>>>> which VM also processes the references. The benchmark also shows the
>>>> ForkJoinPool statistics which shows that the number of queued tasks is
>>>> also kept low.
>>>> Increasing the allocation threads to 4 increases allocation rate to
>>>> about 4300 objects / ms and finalization keeps up. Increasing allocation
>>>> threads to 8, further increases allocation rate to about 4600 objects /
>>>> ms and finalization still keeps up. The increase in rate is not linear,
>>>> but keep in mind that i7 is a 4-core CPU.
>>>> About the implementation...
>>>> 1st improvement I did was for the doubly-linked list of Finalizer
>>>> instances that is used to keep them alive until they are needed. I
>>>> ripped-off the wonderful ConcurrentLinkedDeque by Doug Lea and Martin
>>>> Buchholz and just kept the internal link/unlink methods while
>>>> specializing them to Finalizer entries (very straight-forward). I
>>>> experimented with throughput and got some improvement, but throughput
>>>> has increased much more when I used several instances of independent
>>>> lists and distributed registrations among them randomly (unlinking
>>>> consequently is also distributed randomly).
>>>> I found out that no matter how hard I try to optimize ReferenceQueue
>>>> while keeping the API unchanged, I can only do so much and that was not
>>>> enough. I have been surprised by how well ForkJoinPool distributes tasks
>>>> among threads, so I concluded that leveraging it is the best choice. I
>>>> re-designed the pending-list unhooking loop to unhook pending references
>>>> in chunks which greatly improves the throughput. Since unhooking can be
>>>> performed by a single thread while holding a lock which is mandated by
>>>> interface between VM and Java, I didn't employ multiple threads, but a
>>>> single eternal ForkJoinTask that unhooks in chunks and forks-off other
>>>> processing tasks that process chunks. When there are just a couple of
>>>> References pending at one time and a not-full chunk is unhooked, then
>>>> the processing is performed by the same thread that unhooked the
>>>> refrences, but when there are more, worker tasks are forked off and the
>>>> unhooking thread continues with full peace. This processing includes
>>>> execution of Cleaners, forking the finalizer tasks and enqueue-ing other
>>>> references. Finalizer(s) are always executed as separate ForkJoinTask(s).
>>>> It's interesting how Runtime.runFinalizers() is implemented in this
>>>> patch - it basically amounts to ForkJoinPool.awaitQuiescence() ...
>>>> I also tweaked the ReferenceQueue implementation a bit (it is still used
>>>> for other kinds of references) so that it avoids synchronization with a
>>>> monitor lock when there are no blocking waiters and uses CAS to
>>>> enqueue/dequeue. This improves throughput when the queue is not empty.
>>>> Since in the prototype multiple threads can enqueue into the same queue,
>>>> I thought this would improve throughput in such situations.
>>>> Comments, suggestions, criticism are welcome.
>>>> Regards, Peter

More information about the core-libs-dev mailing list