JEP 132: More-prompt finalization

David Holmes david.holmes at oracle.com
Fri May 29 02:18:06 UTC 2015


Hi Peter,

I guess I'm very concerned about the premise that finalization should 
scale to millions of objects and be performed highly concurrently. To me 
that's sending the wrong message about finalization. It also isn't the 
most effective use of cpu resources - most people would want to do 
useful work on most cpu's most of the time.

Cheers,
David

On 29/05/2015 3:12 AM, Peter Levart wrote:
> Hi,
>
> Did you know that the following simple loop:
>
> public class FinalizableBottleneck {
>      static boolean no;
>
>      @Override
>      protected void finalize() throws Throwable {
>          // empty finalize() method does not make the object finalizable
>          // (it is not even registered on the finalizer's list)
>          if (no) {
>              throw new AssertionError();
>          }
>      }
>
>      public static void main(String[] args) {
>          while (true) {
>              new FinalizableBottleneck();
>          }
>      }
> }
>
>
> ...quickly fills the entire heap with FinalizableBottleneck and internal
> Finalizer objects and brings the JVM to a halt? After a few seconds of
> running the above program, jmap -histo:live reports:
>
>   num     #instances         #bytes  class name
> ----------------------------------------------
>     1:      50048325     2001933000  java.lang.ref.Finalizer
>     2:      50048278      800772448  FinalizableBottleneck
>
>
> There are a couple of bottlenecks that make this happen:
>
> - ReferenceHandler thread synchronizes with VM to unhook Reference(s)
> from the pending chain one be one and dispatches them to their respected
> ReferenceQueue(s) which also use synchronization for equeueing each
> Reference.
> - Enqueueing synchronizes with the finalization thread which removes the
> Finalizer(s) (FinalReferences) from the finalization queue and executes
> them.
> - Executing the Finalizer(s) removes them from the doubly-linked list of
> all Finalizer(s) which is used to retain them until they are needed and
> this synchronizes with the threads that link new Finalizer(s) into the
> doubly-linked list as new finalizable objects get registered.
>
> We see that the creation of a finalizable object only takes one
> synchronization (registering into the doubly-linked list) and is
> performed synchronously, while finalization takes 4 synchronizations
> among 4 different threads (in pairs) and happens when the Finalizer
> instance "travels" over from VM thread to ReferenceHandler thread and
> then to finalization thread. No wonder that finalization can not keep up
> with allocation in a single thread. The situation is even worse when
> finalize() methods do some actual work.
>
> I have experimented with various approaches to widen these bottlenecks
> and found out that I can not beat the ForkJoinPool when combined with
> some improvements to internal data structures used in reference
> processing. Here's a prototype I came up with:
>
> http://cr.openjdk.java.net/~plevart/misc/JEP132/ReferenceHandling/webrev.01/
>
>
> And this is the benchmark I use for measuring the throughput:
>
> http://cr.openjdk.java.net/~plevart/misc/JEP132/ReferenceHandling/FinalizerThroughput.java
>
>
> The benchmark shows (results inline in source) that using unpatched JDK,
> on my PC (i7-2700K, Linux, JDK8) I can not construct more than 1500
> finalizable objects per ms in a single thread and that while doing so,
> finalization only manages to process approx. 100 - 120 objects at the
> same time. Objects "in-flight" quickly accumulate and bring the VM to a
> halt, where it is not doing anything but full GC cycles.
>
> When constructing in 4 threads, there's not much difference.
> Construction of finalizable objects simply doesn't scale.
>
> Patched JDK shows something completely different. Single thread
> construction achieves a rate of 3600 objects / ms. Number of "in-flight"
> objects is kept constant at about 5-6M instances which amounts to approx
> 1.5 s of allocation. I think this is about the rate of GC cycles during
> which VM also processes the references. The benchmark also shows the
> ForkJoinPool statistics which shows that the number of queued tasks is
> also kept low.
>
> Increasing the allocation threads to 4 increases allocation rate to
> about 4300 objects / ms and finalization keeps up. Increasing allocation
> threads to 8, further increases allocation rate to about 4600 objects /
> ms and finalization still keeps up. The increase in rate is not linear,
> but keep in mind that i7 is a 4-core CPU.
>
> About the implementation...
>
> 1st improvement I did was for the doubly-linked list of Finalizer
> instances that is used to keep them alive until they are needed. I
> ripped-off the wonderful ConcurrentLinkedDeque by Doug Lea and Martin
> Buchholz and just kept the internal link/unlink methods while
> specializing them to Finalizer entries (very straight-forward). I
> experimented with throughput and got some improvement, but throughput
> has increased much more when I used several instances of independent
> lists and distributed registrations among them randomly (unlinking
> consequently is also distributed randomly).
>
> I found out that no matter how hard I try to optimize ReferenceQueue
> while keeping the API unchanged, I can only do so much and that was not
> enough. I have been surprised by how well ForkJoinPool distributes tasks
> among threads, so I concluded that leveraging it is the best choice. I
> re-designed the pending-list unhooking loop to unhook pending references
> in chunks which greatly improves the throughput. Since unhooking can be
> performed by a single thread while holding a lock which is mandated by
> interface between VM and Java, I didn't employ multiple threads, but a
> single eternal ForkJoinTask that unhooks in chunks and forks-off other
> processing tasks that process chunks. When there are just a couple of
> References pending at one time and a not-full chunk is unhooked, then
> the processing is performed by the same thread that unhooked the
> refrences, but when there are more, worker tasks are forked off and the
> unhooking thread continues with full peace. This processing includes
> execution of Cleaners, forking the finalizer tasks and enqueue-ing other
> references. Finalizer(s) are always executed as separate ForkJoinTask(s).
>
> It's interesting how Runtime.runFinalizers() is implemented in this
> patch - it basically amounts to ForkJoinPool.awaitQuiescence() ...
>
> I also tweaked the ReferenceQueue implementation a bit (it is still used
> for other kinds of references) so that it avoids synchronization with a
> monitor lock when there are no blocking waiters and uses CAS to
> enqueue/dequeue. This improves throughput when the queue is not empty.
> Since in the prototype multiple threads can enqueue into the same queue,
> I thought this would improve throughput in such situations.
>
> Comments, suggestions, criticism are welcome.
>
> Regards, Peter
>


More information about the hotspot-gc-dev mailing list