Low-Overhead Heap Profiling
jeremymanson at google.com
Fri Jun 26 06:51:56 UTC 2015
Another thought. Since:
- It would be kind of surprising for Thread->allocated_bytes() to be
different from the number used as the interval for tracking (e.g., if your
interval is, say, 512K, you check allocated bytes, it says 0, you allocate
512K, you check allocated bytes, it says 512K, but no sample was taken),
- We're already taking the maintenance hit to maintain the allocated bytes
Maybe a good compromise would be to piggyback on the allocated bytes
counter? If the allocated bytes counter is at N, and we pick a next
sampling interval of K, we set a per-thread variable to N+K, and everywhere
we increment the allocated bytes counter, we just test to see if it is
greater than N+K?
This would add an additional load and another easily predicted branch, but
no additional subtraction. Also, it would have very obvious and tractable
modifications to make in existing places that already have logic for the
counter, so there wouldn't be much of an additional maintenance burden.
Finally, it would more-or-less address my concerns, because the non-TLAB
fast paths I'm worried about are already instrumented for it.
On Thu, Jun 25, 2015 at 10:27 PM, Jeremy Manson <jeremymanson at google.com>
> On Thu, Jun 25, 2015 at 1:28 PM, Tony Printezis <tprintezis at twitter.com>
>> Hi Jeremy,
>> On June 24, 2015 at 7:26:55 PM, Jeremy Manson (jeremymanson at google.com)
>> On Wed, Jun 24, 2015 at 10:57 AM, Tony Printezis <tprintezis at twitter.com>
>>> Hi Jeremy,
>>> Please see inline.
>>> On June 23, 2015 at 7:22:13 PM, Jeremy Manson (jeremymanson at google.com)
>>> I don't want the size of the TLAB, which is ergonomically adjusted, to
>>> be tied to the sampling rate. There is no reason to do that. I want
>>> reasonable statistical sampling of the allocations.
>>> As I said explicitly in my e-mail, I totally agree with this. Which is
>>> why I never suggested to resize TLABs in order to vary the sampling rate.
>>> (Apologies if my e-mail was not clear.)
>> My fault - I misread it. Doesn't your proposal miss out of TLAB allocs
>> This is correct: We’ll also have to intercept the outside-TLAB allocs.
>> But, IMHO, this is a feature as it’s helpful to know how many (and which)
>> allocations happen outside TLABs. These are generally very infrequent (and
>> slow anyway), so sampling all of those, instead of only sampling some of
>> them, does not have much of an overhead. But, you could also do sampling
>> for the outside-TLAB allocs too, if you want: just accumulate their size on
>> a separate per-thread counter and sample the one that bumps that counter
>> goes over a limit.
> The outside-TLAB allocations generally get caught anyway, because they
> tend to be large enough to jump over the sample size immediately.
>> An additional observation (orthogonal to the main point, but I thought
>> I’d mention it anyway): For the outside-TLAB allocs it’d be helpful to also
>> know which generation the object ended up in (e.g., young gen or
>> direct-to-old-gen). This is very helpful in some situations when you’re
>> trying to work out which allocation(s) grew the old gen occupancy between
>> two young GCs.
> True. We don't have this implemented, but it would be reasonably
> straightforward to glean it from the oop.
>> FWIW, the existing JFR events follow the approach I described above:
>> * one event for each new TLAB + first alloc in that TLAB (my proposal
>> basically generalizes this and removes the 1-1 relationship between object
>> alloc sampling and new TLAB operation)
>> * one event for all allocs outside a TLAB
>> I think the above separation is helpful. But if you think it could
>> confuse users, you can of course easily just combine the information (but I
>> strongly believe it’s better to report the information separately).
> I do think it would make a confusing API. It might make more sense to
> have a reporting mechanism that had a set number of fields with very
> concrete information (size, class, stacktrace), but allowed for
> platform-specific metadata. We end up with a very long list of things we
> want in the sample: generation (how do you describe a generation?), object
> age (by number of GCs survived? What kind of GC?), was it a TLAB
> allocation, etc.
> (and, in fact, not work if TLAB support is turned off)?
>> Who turns off TLABs? Is -UseTLAB even tested by Oracle? (This is a
>> genuine question.)
> I don't think they do. I have turned them off for various reasons
> (usually, I'm trying to instrument allocations and I don't want to muck
> about with thinking about TLABs), and the code paths seem a little crufty.
> ISTR at some point finding something that clearly only worked by mistake,
> but I can't remember now what it was.
>> However, you can do pretty much anything from the VM itself. Crucially
>>> (for us), we don't just log the stack traces, we also keep track of which
>>> are live and which aren't. We can't do this in a callback, if the callback
>>> can't create weak refs to the object.
>>> What we do at Google is to have two methods: one that you pass a
>>> callback to (the callback gets invoked with a StackTraceData object, as
>>> I've defined above), and another that just tells you which sampled objects
>>> are still live. We could also add a third, which allowed a callback to set
>>> the sampling interval (basically, the VM would call it to get the integer
>>> number of bytes to be allocated before the next sample).
>>> Would people be amenable to that? It makes the code more complex, but,
>>> as I say, it's nice for detecting memory leaks ("Hey! Where did that 1 GB
>>> object come from?").
>>> Well, that 1GB object would have most likely been allocated outside a
>>> TLAB and you could have identified it by instrumenting the “outside-of-TLAB
>>> allocation path” (just saying…).
>> That's orthogonal to the point I was making in the quote above - the
>> point I was making there was that we want to be able to detect what sampled
>> objects are live. We can do that regardless of how we implement the
>> sampling (although it did involve my making a new kind of weak oop
>> processing mechanism inside the VM).
>> Yeah, I was thinking of doing something similar (tracking object
>> lifetimes, and other attributes, with WeakRefs).
> We have all of that implemented, so hopefully I can save you the trouble.
> But to the question of whether we can just instrument the outside-of-tlab
>> allocation path... There are a few weirdnesses here. The first one that
>> jumps to mind is that there's also a fast path for allocating in the YG
>> outside of TLABs, if an object is too large to fit in the current TLAB.
>> Those objects would never get sampled. So "outside of tlab" doesn't always
>> mean "slow path".
>> CollectedHeap::common_mem_allocate_noinit() is the first-level of the
>> slow path called when a TLAB allocation fails because the object doesn’t
>> fit in the current TLAB. It checks (alocate_from_tlab() /
>> allocate_from_tlab_slow()) whether to refill the current TLAB or keep the
>> TLAB and delegate to the GC (mem_allocate()) to allocate the object outside
>> a TLAB (either in the young or old gen; the GC might also decide to do a
>> collection at this point if, say, the eden is full...). So, it depends on
>> what you mean by slow path but, yes, any alloocations that go through the
>> above path should be considered as “slow path” allocations.
> Let me be more specific. Here is a place where allocations go through a
> fast path that is outside of a TLAB:
> If the object won't fit in the TLAB, but will fit in the Eden, it will be
> allocated in the Eden, with hand-generated assembly. This case will be
> entirely missed by sampling just the TLAB creation (or your variant) and
> the slow path. I may be missing something about that code, but I can't
> really see what it is.
> One more piece of data: AllocTracer::send_allocation_outside_tlab_event()
>> (the JFR entry point for outside-TLAB allocs) is fired from
>> common_mem_allocate_noint(). So, if there are other non-TLAB allocation
>> paths outside that method, that entry point has been placed incorrectly
>> (it’s possible of course; but I think that it’s actually placed correctly).
> What is happening in the line to which I referred, then? To me, it kind
> of reads like "this is close enough to being TLAB allocation that I don't
> care that it isn't".
> And that's really what's going on here. Your strategy is to tie what I see
> as a platform feature to a particular implementation. If the
> implementation changes, or if we really don't understand it as well as we
> think we do, the whole thing falls on the floor. If we mention TLABs in
> the docs, and TLABs do change, then it won't mean anything anymore.
> A particular example pops to mind: I believe Metronome doesn't have TLABs
> at all. Is that correct? Can J9 developers implement this feature?
>> For reference, to keep track of sampling, the delta to C2 is about 150
>> LOC (much of which is newlines-because-of-formatting for methods that take
>> a lot of parameters), the delta to C1 is about 60 LOC, the delta to each
>> x86 template interpreter is about 20 LOC, and the delta for the assembler
>> is about 40 LOC. It's not completely trivial, but the code hasn't
>> changed substantially in the 5 years since I wrote it (other than a couple
>> of bugfixes).
>> Obviously, assembler/template interpreter would have to be dup'd across
>> platforms - we can do that for PPC and aarch64, on which we do active
>> development, at least.
>> I’ll again vote for the simplicity of having a simple change in only one
>> place (OK, two places…).
> This isn't a simple change anyway, if we're keeping track of live
> references. We have to hook into reference processing - when a weak oop is
> detected to be dead, we have to delete the metadata. And we have to change
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the hotspot-gc-dev