Low-Overhead Heap Profiling

Jeremy Manson jeremymanson at google.com
Fri Jun 26 05:27:08 UTC 2015


On Thu, Jun 25, 2015 at 1:28 PM, Tony Printezis <tprintezis at twitter.com>
wrote:

> Hi Jeremy,
>
> Inline.
>
> On June 24, 2015 at 7:26:55 PM, Jeremy Manson (jeremymanson at google.com)
> wrote:
>
>
>
> On Wed, Jun 24, 2015 at 10:57 AM, Tony Printezis <tprintezis at twitter.com>
> wrote:
>
>> Hi Jeremy,
>>
>> Please see inline.
>>
>> On June 23, 2015 at 7:22:13 PM, Jeremy Manson (jeremymanson at google.com)
>> wrote:
>>
>> I don't want the size of the TLAB, which is ergonomically adjusted, to be
>> tied to the sampling rate.  There is no reason to do that.  I want
>> reasonable statistical sampling of the allocations.
>>
>>
>> As I said explicitly in my e-mail, I totally agree with this. Which is
>> why I never suggested to resize TLABs in order to vary the sampling rate.
>> (Apologies if my e-mail was not clear.)
>>
>
> My fault - I misread it.  Doesn't your proposal miss out of TLAB allocs
> entirely
>
>
> This is correct: We’ll also have to intercept the outside-TLAB allocs.
> But, IMHO, this is a feature as it’s helpful to know how many (and which)
> allocations happen outside TLABs. These are generally very infrequent (and
> slow anyway), so sampling all of those, instead of only sampling some of
> them, does not have much of an overhead. But, you could also do sampling
> for the outside-TLAB allocs too, if you want: just accumulate their size on
> a separate per-thread counter and sample the one that bumps that counter
> goes over a limit.
>
>
The outside-TLAB allocations generally get caught anyway, because they tend
to be large enough to jump over the sample size immediately.


> An additional observation (orthogonal to the main point, but I thought I’d
> mention it anyway): For the outside-TLAB allocs it’d be helpful to also
> know which generation the object ended up in (e.g., young gen or
> direct-to-old-gen). This is very helpful in some situations when you’re
> trying to work out which allocation(s) grew the old gen occupancy between
> two young GCs.
>

True.  We don't have this implemented, but it would be reasonably
straightforward to glean it from the oop.


> FWIW, the existing JFR events follow the approach I described above:
>
> * one event for each new TLAB + first alloc in that TLAB (my proposal
> basically generalizes this and removes the 1-1 relationship between object
> alloc sampling and new TLAB operation)
>
> * one event for all allocs outside a TLAB
>
> I think the above separation is helpful. But if you think it could confuse
> users, you can of course easily just combine the information (but I
> strongly believe it’s better to report the information separately).
>

I do think it would make a confusing API.  It might make more sense to have
a reporting mechanism that had a set number of fields with very concrete
information (size, class, stacktrace), but allowed for platform-specific
metadata.  We end up with a very long list of things we want in the sample:
generation (how do you describe a generation?), object age (by number of
GCs survived?  What kind of GC?), was it a TLAB allocation, etc.


(and, in fact, not work if TLAB support is turned off)?
>
>
> Who turns off TLABs? Is -UseTLAB even tested by Oracle? (This is a genuine
> question.)
>

I don't think they do.  I have turned them off for various reasons
(usually, I'm trying to instrument allocations and I don't want to muck
about with thinking about TLABs), and the code paths seem a little crufty.
ISTR at some point finding something that clearly only worked by mistake,
but I can't remember now what it was.

[snip]



>   However, you can do pretty much anything from the VM itself.  Crucially
>> (for us), we don't just log the stack traces, we also keep track of which
>> are live and which aren't.  We can't do this in a callback, if the callback
>> can't create weak refs to the object.
>>
>> What we do at Google is to have two methods: one that you pass a callback
>> to (the callback gets invoked with a StackTraceData object, as I've defined
>> above), and another that just tells you which sampled objects are still
>> live.  We could also add a third, which allowed a callback to set the
>> sampling interval (basically, the VM would call it to get the integer
>> number of bytes to be allocated before the next sample).
>>
>> Would people be amenable to that?  It makes the code more complex, but,
>> as I say, it's nice for detecting memory leaks ("Hey!  Where did that 1 GB
>> object come from?").
>>
>>
>> Well, that 1GB object would have most likely been allocated outside a
>> TLAB and you could have identified it by instrumenting the “outside-of-TLAB
>> allocation path” (just saying…).
>>
>
> That's orthogonal to the point I was making in the quote above - the point
> I was making there was that we want to be able to detect what sampled
> objects are live.  We can do that regardless of how we implement the
> sampling (although it did involve my making a new kind of weak oop
> processing mechanism inside the VM).
>
>
> Yeah, I was thinking of doing something similar (tracking object
> lifetimes, and other attributes, with WeakRefs).
>

We have all of that implemented, so hopefully I can save you the trouble.
:)

But to the question of whether we can just instrument the outside-of-tlab
> allocation path...  There are a few weirdnesses here.  The first one that
> jumps to mind is that there's also a fast path for allocating in the YG
> outside of TLABs, if an object is too large to fit in the current TLAB.
> Those objects would never get sampled.  So "outside of tlab" doesn't always
> mean "slow path".
>
>
> CollectedHeap::common_mem_allocate_noinit() is the first-level of the slow
> path called when a TLAB allocation fails because the object doesn’t fit in
> the current TLAB. It checks (alocate_from_tlab() /
> allocate_from_tlab_slow()) whether to refill the current TLAB or keep the
> TLAB and delegate to the GC (mem_allocate()) to allocate the object outside
> a TLAB (either in the young or old gen; the GC might also decide to do a
> collection at this point if, say, the eden is full...). So, it depends on
> what you mean by slow path but, yes, any alloocations that go through the
> above path should be considered as “slow path” allocations.
>

Let me be more specific.  Here is a place where allocations go through a
fast path that is outside of a TLAB:

http://hg.openjdk.java.net/jdk9/dev/hotspot/file/972580a0eef8/src/cpu/x86/vm/templateTable_x86.cpp#l3759

If the object won't fit in the TLAB, but will fit in the Eden, it will be
allocated in the Eden, with hand-generated assembly.  This case will be
entirely missed by sampling just the TLAB creation (or your variant) and
the slow path.  I may be missing something about that code, but I can't
really see what it is.

One more piece of data: AllocTracer::send_allocation_outside_tlab_event()
> (the JFR entry point for outside-TLAB allocs) is fired from
> common_mem_allocate_noint(). So, if there are other non-TLAB allocation
> paths outside that method, that entry point has been placed incorrectly
> (it’s possible of course; but I think that it’s actually placed correctly).
>

What is happening in the line to which I referred, then?  To me, it kind of
reads like "this is close enough to being TLAB allocation that I don't care
that it isn't".

And that's really what's going on here. Your strategy is to tie what I see
as a platform feature to a particular implementation.  If the
implementation changes, or if we really don't understand it as well as we
think we do, the whole thing falls on the floor.  If we mention TLABs in
the docs, and TLABs do change, then it won't mean anything anymore.

A particular example pops to mind: I believe Metronome doesn't have TLABs
at all.  Is that correct?  Can J9 developers implement this feature?

> For reference, to keep track of sampling, the delta to C2 is about 150 LOC
> (much of which is newlines-because-of-formatting for methods that take a
> lot of parameters), the delta to C1 is about 60 LOC, the delta to each x86
> template interpreter is about 20 LOC, and the delta for the assembler is
> about 40 LOC.      It's not completely trivial, but the code hasn't changed
> substantially in the 5 years since I wrote it (other than a couple of
> bugfixes).
>
> Obviously, assembler/template interpreter would have to be dup'd across
> platforms - we can do that for PPC and aarch64, on which we do active
> development, at least.
>
>
> I’ll again vote for the simplicity of having a simple change in only one
> place (OK, two places…).
>

This isn't a simple change anyway, if we're keeping track of live
references.  We have to hook into reference processing - when a weak oop is
detected to be dead, we have to delete the metadata.  And we have to change
JVMTI.

Jeremy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-dev/attachments/20150625/4a707103/attachment.html>


More information about the hotspot-gc-dev mailing list