Low-Overhead Heap Profiling

Tony Printezis tprintezis at twitter.com
Thu Jun 25 20:28:50 UTC 2015

Hi Jeremy,


On June 24, 2015 at 7:26:55 PM, Jeremy Manson (jeremymanson at google.com) wrote:

On Wed, Jun 24, 2015 at 10:57 AM, Tony Printezis <tprintezis at twitter.com> wrote:
Hi Jeremy,

Please see inline.

On June 23, 2015 at 7:22:13 PM, Jeremy Manson (jeremymanson at google.com) wrote:

I don't want the size of the TLAB, which is ergonomically adjusted, to be tied to the sampling rate.  There is no reason to do that.  I want reasonable statistical sampling of the allocations.  

As I said explicitly in my e-mail, I totally agree with this. Which is why I never suggested to resize TLABs in order to vary the sampling rate. (Apologies if my e-mail was not clear.)

My fault - I misread it.  Doesn't your proposal miss out of TLAB allocs entirely

This is correct: We’ll also have to intercept the outside-TLAB allocs. But, IMHO, this is a feature as it’s helpful to know how many (and which) allocations happen outside TLABs. These are generally very infrequent (and slow anyway), so sampling all of those, instead of only sampling some of them, does not have much of an overhead. But, you could also do sampling for the outside-TLAB allocs too, if you want: just accumulate their size on a separate per-thread counter and sample the one that bumps that counter goes over a limit.

An additional observation (orthogonal to the main point, but I thought I’d mention it anyway): For the outside-TLAB allocs it’d be helpful to also know which generation the object ended up in (e.g., young gen or direct-to-old-gen). This is very helpful in some situations when you’re trying to work out which allocation(s) grew the old gen occupancy between two young GCs.

FWIW, the existing JFR events follow the approach I described above:

* one event for each new TLAB + first alloc in that TLAB (my proposal basically generalizes this and removes the 1-1 relationship between object alloc sampling and new TLAB operation)

* one event for all allocs outside a TLAB

I think the above separation is helpful. But if you think it could confuse users, you can of course easily just combine the information (but I strongly believe it’s better to report the information separately).

(and, in fact, not work if TLAB support is turned off)? 

Who turns off TLABs? Is -UseTLAB even tested by Oracle? (This is a genuine question.)

 I might be missing something obvious (and see my response below).

All this requires is a separate counter that is set to the next sampling interval, and decremented when an allocation happens, which goes into a slow path when the decrement hits 0.  Doing a subtraction and a pointer bump in allocation instead of just a pointer bump is basically free.  

Maybe on intel is cheap, but maybe it’s not on other platforms that other folks care about.

Really?  A memory read and a subtraction?  Which architectures care about that?

I was not concerned with the read and subtraction, I was more concerned with the conditional that follows them (intel has great branch prediction).

And a personal pet peeve (based on past experience): How many “free” instructions do you have to add before they are not free any more?

Again, notice that no one has complained about the addition that was added for total bytes allocated per thread.  I note that was actually added in the 6u20 timeframe.

Note that it has been doing an additional addition (to keep track of per thread allocation) as part of allocation since Java 7, 

Interesting. I hadn’t realized that. Does that keep track of total size allocated per thread or number of allocated objects per thread? If it’s the former, why isn’t it possible to calculate that from the TLABs information?

Total size allocated per thread.  It isn't possible to calculate that from the TLAB because of out-of-TLAB allocation 

The allocating Thread is passed to the slow (outside-TLAB) alloc path so it would be trivial to update the per-thread allocation stats from there too (in fact, it does; see below).

(and hypothetically disabled TLABs).

Anyone cares? :-)

For some reason, they never included it in the ThreadMXBean interface, but it is in com.sun.management.ThreadMXBean, so you can cast your ThreadMXBean to a com.sun.management.ThreadMXBean and call getThreadAllocatedBytes() on it.

Thanks for the tip. I’ll look into this...

and no one has complained.

I'm not worried about the ease of implementation here, because we've already implemented it.  

Yeah, but someone will have to maintain it moving forward.

I've been maintaining it internally to Google for 5 years.  It's actually pretty self-contained.  The only work involved is when they refactor something (so I've had to move it), or when a bug in the existing implementation is discovered.  It is very closely parallel to the TLAB code, which doesn't change much / at all.

The TLAB code has really not changed much for a while. ;-) (but haven’t looked at the JDK 9 source very closely though…)

It hasn't even been hard for us to do the forward port, except when the relevant Hotspot code is significantly refactored.

We can also turn the sampling off, if we want.  We can set the sampling rate to 2^32, have the sampling code do nothing, and no one will ever notice.  

You still have extra instructions in the allocation path, so it’s not turned off (i.e., you have the tax without any benefit).

Hey, you have a counter in your allocation path you've never noticed, which none of your code uses.  Pipelining is a wonderful thing.  :)

See above re: “free” instructions.

In fact, we could just have the sampling code do nothing, and no one would ever notice.

Honestly, no one ever notices the overhead of the sampling, anyway.  JDK8 made it more expensive to grab a stack trace (the cost became proportional to the number of loaded classes), but we have a patch that mitigates that, which we would also be happy to upstream.

As for the other concern: my concern about *just* having the callback mechanism is that there is quite a lot you can't do from user code during an allocation, because of lack of access to JNI.

Maybe I missed something. Are the callbacks in Java? I.e., do you call them using JNI from the slow path you call directly from the allocation code?

(For context: this referred to the hypothetical feature where we can provide a callback that invokes some code from allocation.)

(It's not actually hypothetical, because we've already implemented it, but let's call it hypothetical for the moment.)


We invoke native code.  You can't invoke any Java code during allocation, including calling JNI methods, because that would make allocation potentially reentrant, which doesn't work for all sorts of reasons.

That’s what I was worried about….

  The native code doesn't even get passed a JNIEnv * - there is nothing it can do with it without making the VM crash a lot.

So, thanks for the clarification. Being able to attach a callback to this in, say, the JVM it’d be totally fine. I was worried that you wanted to call Java. :-)

Or, rather, you might be able to do that, but it would take a lot of Hotspot rearchitecting.  When I tried to do it, I realized it would be an extremely deep dive.

I believe you. :-)

  However, you can do pretty much anything from the VM itself.  Crucially (for us), we don't just log the stack traces, we also keep track of which are live and which aren't.  We can't do this in a callback, if the callback can't create weak refs to the object.

What we do at Google is to have two methods: one that you pass a callback to (the callback gets invoked with a StackTraceData object, as I've defined above), and another that just tells you which sampled objects are still live.  We could also add a third, which allowed a callback to set the sampling interval (basically, the VM would call it to get the integer number of bytes to be allocated before the next sample).  

Would people be amenable to that?  It makes the code more complex, but, as I say, it's nice for detecting memory leaks ("Hey!  Where did that 1 GB object come from?").

Well, that 1GB object would have most likely been allocated outside a TLAB and you could have identified it by instrumenting the “outside-of-TLAB allocation path” (just saying…).

That's orthogonal to the point I was making in the quote above - the point I was making there was that we want to be able to detect what sampled objects are live.  We can do that regardless of how we implement the sampling (although it did involve my making a new kind of weak oop processing mechanism inside the VM).

Yeah, I was thinking of doing something similar (tracking object lifetimes, and other attributes, with WeakRefs). 

But to the question of whether we can just instrument the outside-of-tlab allocation path...  There are a few weirdnesses here.  The first one that jumps to mind is that there's also a fast path for allocating in the YG outside of TLABs, if an object is too large to fit in the current TLAB.  Those objects would never get sampled.  So "outside of tlab" doesn't always mean "slow path".

CollectedHeap::common_mem_allocate_noinit() is the first-level of the slow path called when a TLAB allocation fails because the object doesn’t fit in the current TLAB. It checks (alocate_from_tlab() / allocate_from_tlab_slow()) whether to refill the current TLAB or keep the TLAB and delegate to the GC (mem_allocate()) to allocate the object outside a TLAB (either in the young or old gen; the GC might also decide to do a collection at this point if, say, the eden is full...). So, it depends on what you mean by slow path but, yes, any alloocations that go through the above path should be considered as “slow path” allocations.

One more piece of data: AllocTracer::send_allocation_outside_tlab_event() (the JFR entry point for outside-TLAB allocs) is fired from common_mem_allocate_noint(). So, if there are other non-TLAB allocation paths outside that method, that entry point has been placed incorrectly (it’s possible of course; but I think that it’s actually placed correctly).

(note: I only looked at the JDK 8 sources, haven’t checked the JDK 9 sources yet, the above might have been changed)

BTW, when looking at the common_mem_allocate_noinit() code I noticed the following:

THREAD->incr_allocated_bytes(size * HeapWordSize);
(as predicted earlier)

Another one that jumps to mind is that we don't know whether the outside-of-TLAB path actually passes the sampling threshold, especially if we let users configure the sampling threshold.  So how would we know whether to sample it?

See above (IMHO: sample all of them).

You also have to keep track of the sampling interval in the code where we allocate new TLABs, in case the sampling threshold is larger than the TLAB size.  That's not a big deal, of course.

Of course, but that’s kinda trivial. BTW, one approach here would be “given that refilling a TLAB is slow anyway, always sample the first object in each TLAB irrespective of desired sampling frequence”. Another would be “don’t do that, I set the sampling frequency pretty low not to be flooded with data when the TLABs are very small”. I have to say I’m in the latter camp.

And, every time the TLAB code changes, we have to consider whether / how those changes affect this sampling mechanism.

Yes, but how often does the TLAB code change? :-)

I guess my larger point is that there are so many little corner cases with TLAB allocation, including whether it even happens, that basing the sampling strategy around it seems like a cop-out.  

There are not many little corner cases. There are two cases: allocation inside a TLAB, allocation outside a TLAB. The former is by far the most common. The latter is generally very infrequent and has a well-defined code path (I described it earlier). And, as I said, it could be very helpful and informative to treat (and account for) the two cases separately.

And my belief is that the arguments against our strategy don't really hold water, especially given the presence of the per-thread allocation counter that no one noticed.  

I’ve already addressed that.

Heck, I've already had it reviewed internally by a Hotspot reviewer (Chuck Rasbold).  All we really need is to write an acceptable JEP, to adjust the code based on the changes the community wants, and someone from Oracle willing to say "yes".

For reference, to keep track of sampling, the delta to C2 is about 150 LOC (much of which is newlines-because-of-formatting for methods that take a lot of parameters), the delta to C1 is about 60 LOC, the delta to each x86 template interpreter is about 20 LOC, and the delta for the assembler is about 40 LOC.      It's not completely trivial, but the code hasn't changed substantially in the 5 years since I wrote it (other than a couple of bugfixes).

Obviously, assembler/template interpreter would have to be dup'd across platforms - we can do that for PPC and aarch64, on which we do active development, at least.

I’ll again vote for the simplicity of having a simple change in only one place (OK, two places…).

But, seriously, why didn’t you like my proposal? It can do anything your scheme can with fewer and simpler code changes. The only thing that it cannot do is to sample based on object count (i.e., every 100 objects) instead of based on object size (i.e., every 1MB of allocations). But I think doing sampling based on size is the right approach here (IMHO).

I agree that sampling based on size is the right approach.  

(And your approach is definitely simpler - I don't mean to discount it.  And if that's what it takes to get this feature accepted, we'll do it, but I'll grumble about it.)

That’s fine. :-)



Tony Printezis | JVM/GC Engineer / VM Team | Twitter

tprintezis at twitter.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/serviceability-dev/attachments/20150625/300559a1/attachment-0001.html>

More information about the serviceability-dev mailing list