JDK-8171119: Low-Overhead Heap Profiling

JC Beyler jcbeyler at google.com
Thu Apr 26 18:02:51 UTC 2018

Hi all,

A question came up between myself and Serguei about how to protect the
newly allocated oop when the collector does the callback. We decided it
might be best to ask the mailing list for help/guidance/opinion?

 Consider the changes done in this file for example:

For example, for obj_allocate, the change becomes:
 oop CollectedHeap::obj_allocate(Klass* klass, int size, TRAPS) {
   assert(!Universe::heap()->is_gc_active(), "Allocation during gc not
   assert(size >= 0, "int won't convert to size_t");
+  HandleMark hm(THREAD);
+  Handle result;
+  {
+    JvmtiSampledObjectAllocEventCollector collector;
   HeapWord* obj = common_mem_allocate_init(klass, size, CHECK_NULL);
   post_allocation_setup_obj(klass, obj, size);
   NOT_PRODUCT(Universe::heap()->check_for_bad_heap_word_value(obj, size));
-  return (oop)obj;
+    result = Handle(THREAD, (oop) obj);
+  }
+  return result();

The question is: does anyone see an issue here in terms of performance or
something we missed? When I measured it via the Dacapo run, I saw no
performance degradation but I wanted to double check with you all if this
would become a big no no for the final webrev?

Were other benchmarks show that there is no overhead incurred, would this
be ok?

Thanks for your help,

On Tue, Apr 17, 2018 at 9:51 PM Jeremy Manson <jeremymanson at google.com>

> Great, thanks!
> Jeremy
> On Tue, Apr 17, 2018 at 2:01 PM serguei.spitsyn at oracle.com <
> serguei.spitsyn at oracle.com> wrote:
>> Hi Jeremy,
>> We had a private discussion with Jc on this and decided the same.
>> We would like to sample all allocations if possible and on a low level.
>> Thanks,
>> Serguei
>> On 4/17/18 12:38, Jeremy Manson wrote:
>> +1 to sampling anything the thread is allocating.  With the bytecode
>> rewriting based version of this, I had complaints about missing allocations
>> from JNI and APIs like Class.newInstance.  (I don't know how the placement
>> of the collectors would affect this, but it did matter).
>> Jeremy
>> On Thu, Apr 12, 2018 at 2:23 PM JC Beyler <jcbeyler at google.com> wrote:
>>> Hi Karen,
>>> I apologize for sending too many webrevs. I try/tend to iterate fast and
>>> move in an iterative fashion. I also try to solve most, if not all, of the
>>> current items that are requested in one go. Perhaps I failed in doing that
>>> recently? I apologize for that.
>>> So I promise to not send a new webrev in this email or until I'm pretty
>>> sure I got all the current (And any incoming comment/reviews) handled :-)
>>> For the points you brought up:
>>>    a) What are we sampling? In my mind, I'd rather have the sampler be
>>> sampling anything the thread is allocating and not only sample bytecode
>>> allocations. It turns out that I was focusing on that first to get it up.
>>> As I was stuck in figuring out how to get the VM collector and the sampling
>>> collector to co-exist, there was a bit of issues there.
>>>       - That has been solved by now delaying the posting of a sampled
>>> object if a VM collector is present. So now that I've better understood
>>> interactions between collectors and when you could post an event, I'm way
>>> more able to talk about the feasibility and validity of the next item about
>>> bigger objects.
>>>    b) You bring up an excellent point of if we have a multi-array object
>>> or a more complex object (such as a cloned object for example), if the
>>> sampler is tripped on an internal allocation, should we send that smaller
>>> allocation or should we send the bigger object
>>>    - Because we get the stacktrace and we only use the oop to figure out
>>> GC information about the liveness of the object in our use-case in the
>>> JVMTI agent, this changes nothing really in practice. I do see value in
>>> sending the multi-array object as a whole to a user.
>>>       - If that is what you think is best, I can work on getting that
>>> supported and the multi-array test would then prove that if part of the
>>> multi-array is sampled, the sampler returns the whole multi-array.
>>> Hopefully that answers your concern on me sending too many webrevs, to
>>> which I sincerely apologize. Probably a learning curve of different
>>> approaches of reviews. And I hope that my other answers do show the
>>> direction you were hoping to see.
>>> Thanks again for all your help,
>>> Jc
>>> On Thu, Apr 12, 2018 at 8:15 AM Karen Kinnear <karen.kinnear at oracle.com>
>>> wrote:
>>>> JC,
>>>> On Apr 11, 2018, at 8:17 PM, JC Beyler <jcbeyler at google.com> wrote:
>>>> Hi Karen,
>>>> I put up a new webrev that is feature complete in my mind in terms of
>>>> implementation. There could be a few tid-bits of optimizations here and
>>>> there but I believe everything is now there in terms of features and there
>>>> is the question of placement of collectors (I've now put it lower than what
>>>> you talk about in this email thread).
>>>> I believe that the primary goal of your JEP is to catch samples of
>>>> allocation due to bytecodes. Given that, it makes sense to
>>>> put the collectors in the code generators, so I am ok with your leaving
>>>> them where they are.
>>>> And it would save us all a lot of cycles if rather than frequent webrev
>>>> updates with subsets of the requested changes - if you
>>>> could wait until you’ve added all the changes people requested - then
>>>> that would increase the signal to noise ratio.
>>>> The incremental webrev is here:
>>>> http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.12_13/
>>>> and the full webrev is here:
>>>> http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.13/
>>>> The incremental webrev contains the name change of TLAB fields as per
>>>> the conversation going on in the GC thread and the Graal changes to make
>>>> this webrev not break anything. It also contains a test for when VM and
>>>> sampled events are enabled at the same time.
>>>> I'll answer your questions inlined below:
>>>>> I have a couple of design questions before getting to the detailed
>>>>> code review:
>>>>> 1. jvmtiExport.cpp
>>>>> You have simplified the JvmtiObjectAllocEventCollector to assume there
>>>>> is only a single object.
>>>>> Do you have a test that allocates a multi-dimensional array?
>>>>> I would expect that to have multiple subarrays  - and need the logic
>>>>> that you removed.
>>>> I "think" you misread this. I did not change the implementation of JvmtiObjectAllocEventCollector
>>>> to assume only one object. Actually the implementation is still doing what
>>>> it was doing initially but now JvmtiObjectAllocEventCollector is a main
>>>> class with two inherited behaviors:
>>>>    - JvmtiVMObjectAllocEventCollector is following the same logic as
>>>> before
>>>>    - JvmtiSampledObjectAllocEventCollector has the logic of a single
>>>> allocation during collection since it is placed that low. I'm thinking of
>>>> perhaps separating the two classes now that the sampled case is so low and
>>>> does not require the extra logic of handling a growable array.
>>>> I don't have a test that tests multi-dimensional array. I have not
>>>> looked at what exactly VMObjectAllocEvents track but I did do an example to
>>>> see:
>>>> - On a clean JDK, if I allocate:
>>>>      int[][][] myAwesomeTable = new int[10][10][10];
>>>> I get one single VMObject call back.
>>>> - With my change, I get the same behavior.
>>>> So instead, I did an object containing an array and cloned it. With the
>>>> clean JDK I get two VM callbacks with and without my change.
>>>> I'll add a test in the next webrev to ensure correctness.
>>>>> *** Please add a test in which you allocate a multi-dimensional array
>>>>> - and try it with your earlier version which
>>>>> will show the full set of allocated objects
>>>> As I said, this works because it is so low in the system. I don't know
>>>> the internals of the allocation of a three-dimensional array but:
>>>>   - If I try to collect all allocations required for the new
>>>> int[10][10][10], I get more than a hundred allocation callbacks if the
>>>> sample rate is 0, which is what I'd expect (getting a callback at each
>>>> allocation, so what I expect)
>>>> So though I only collect one object, I get a callback for each if that
>>>> is what we want in regard to the sampling rate.
>>>> I’ll let you work this one out with Serguei - if he is ok with your
>>>> change to not have a growable array and only report one object,
>>>> given that you are sampling, sounds like that is not a functional loss
>>>> for you.
>>>> And yes, please do add the test.
>>>>> 2. Tests - didn’t read them all - just ran into a hardcoded “10% error
>>>>> ensures a sanity test without becoming flaky”
>>>>> do check with Serguei on this - that looks like a potential future
>>>>> test failure
>>>> Because the sampling rate is a geometric variable around a mean of the
>>>> sampling rate, any meaningful test is going to have to be statistical.
>>>> Therefore, I do a bit of error acceptance to allow us to test for the real
>>>> thing and not hack the code to have less "real" tests. This is what we do
>>>> internally, let me know if you want me to do it otherwise.
>>>> Flaky is probably a wrong term or perhaps I need to better explain
>>>> this. I'll change the comment in the tests to explain that potentially
>>>> flakyness comes from the nature of the geometrical mean. Because we don't
>>>> want too long running tests, it makes sense to me to have this error
>>>> percentage.
>>>> Let me now answer the comments from the other email here as well so we
>>>> have all answers and conversations in a single thread:
>>>>>         - Note there is one current caveat: if the agent requests the
>>>>> VMObjectAlloc event, the sampler defers to that event due to a limitation
>>>>> in its implementation (ie: I am not convinced I can safely send out that
>>>>> event with the VM collector enabled, I'll happily white board that).
>>>>> Please work that one out with Serguei and the serviceability folks.
>>>> Agreed. I'll follow up with Serguei if this is a potential problem, I
>>>> have to double check and ensure I am right that there is an issue. I see it
>>>> as just a matter of life and not a problem for now. IF you do want both
>>>> events, having a sample drop due to this limitation does not invalidate the
>>>> system in my mind. I could be wrong about it though and would happily go
>>>> over what I saw.
>>>>>> So the Heap Sampling Monitoring System used to have more methods. It
>>>>>> made sense to have them in a separate category. I now have moved it to the
>>>>>> memory category to be consistent and grouped there. I also removed that
>>>>>> link btw.
>>>>> Thanks.
>>>> Actually it did seem weird to put it there since there was only
>>>> Allocate/Deallocate, so for now the method is still again in its own
>>>> category once again. If someone has a better spot, let me know.
>>>>>>> I was trying to figure out a way to put the collectors farther down
>>>>>>> the call stack so as to both catch more
>>>>>>> cases and to reduce the maintenance burden - i.e. if you were to add
>>>>>>> a new code generator, e.g. Graal -
>>>>>>> if it were to go through an existing interface, that might be a
>>>>>>> place to already have a collector.
>>>>>>> I do not know the Graal sources - I did look at
>>>>>>> jvmci/jvmciRuntime.cpp - and it appears that there
>>>>>>> are calls to instanceKlass::new_instance,
>>>>>>> oopFactory::new_typeArray/new_ObjArray and ArrayKlass::multi-allocate,
>>>>>>> so one possibility would be to put hooks in those calls which would
>>>>>>> catch many? (I did not do a thorough search)
>>>>>>> of the slowpath calls for the bytecodes, and then check the fast
>>>>>>> paths in detail.
>>>>>> I'll come to a major issue with the collector and its placement in
>>>>>> the next paragraph.
>>>>> Still not clear on why you did not move the collectors into
>>>>> instanceKlass::new_instance and oopFactory::newtypeArray/newObjArray
>>>>> and ArrayKlass::multi-allocate.
>>>> As I said above, I am ok with your leaving the collectors in the code
>>>> generators since that is your focus.
>>>> I think what was happening is that the collectors would wrap the
>>>> objects in handles but references to the originally allocated object would
>>>> be on the stack still and would not get updated if required. Due to that
>>>> issue, I believe was getting weird bugs.
>>>> Because of this, it seems that any VM collector enabled has to
>>>> guarantee that either:
>>>>    - its path to destruction (and thus posting of events) has no means
>>>> of triggering a GC that would move things around (as long as you are in VM
>>>> code you should be fine I believe)
>>>>    - if GC is occuring, the objects in its internal array are not
>>>> somewhere on the stack without a handle around them to be able to be moved
>>>> if need by from a GC operation.
>>>> I'm not convinced this holds in the multithreaded with sampling and VM
>>>> collection cases.
>>>> I will let you and Serguei work out whether you have sufficient test
>>>> coverage for the multithreaded cases.
>>>> thanks,
>>>> Karen
>>>>>>> I had wondered if it made sense to move the hooks even farther down,
>>>>>>> into CollectedHeap:obj_allocate and array_allocate.
>>>>>>> I do not think so. First reason is that for multidimensional arrays,
>>>>>>> ArrayKlass::multi_allocate the outer dimension array would
>>>>>>> have an event before storing the inner sub-arrays and I don’t think
>>>>>>> we want that exposed, so that won’t work for arrays.
>>>>>> So the major difficulty is that the steps of collection do this:
>>>>>> - An object gets allocated and is decided to be sampled
>>>>>> - The original pointer placement (where it resides originally in
>>>>>> memory) is passed to the collector
>>>>>> - Now one important thing of note:
>>>>>>     (a) In the VM code, until the point where the oop is going to be
>>>>>> returned, GC is not yet aware of it
>>>>>     (b) so the collector can't yet send it out to the user via JVMTI
>>>>>> otherwise, the agent could put a weak reference for example
>>>>>> I'm a bit fuzzy on this and maybe it's just that there would be more
>>>>>> heavy lifting to make this possible but my initial tests seem to show
>>>>>> problems when attempting this in the obj_allocate area.
>>>>> Not sure what you are seeing here -
>>>>> Let me state it the way I understand it.
>>>>> 1) You can collect the object into internal metadata at allocation
>>>>> point - which you already had
>>>>> Note: see comment in JvmtiExport.cpp:
>>>>> // In the case of the sampled object collector, we don’t want to
>>>>> perform the
>>>>> // oops_do because the object in the collector is still stored in
>>>>> registers
>>>>> // on the VM stack
>>>>>    - so GC will find these objects as roots, once we allow GC to run,
>>>>> which should be after the header is initialized
>>>>> Totally agree with you that you can not post the event in the source
>>>>> code in which you allocate the memory - keep reading
>>>>> 2) event posting:
>>>>>    - you want to ensure that the object has been fully initialized
>>>>>    - the object needs to have the header set up - not just the memory
>>>>> allocated - so that applies to all objects
>>>>>    (and that is done in a caller of the allocation code - so it can’t
>>>>> be done at the location which does the memory allocation)
>>>>>    - the example I pointed out was the multianewarray case - all the
>>>>> subarrays need to be allocated
>>>>> - please add a test case for multi-array, so as you experiment with
>>>>> where to post the event, you ensure that you can
>>>>> access the subarrays (e.g. a 3D array of length 5 has 5 2D arrays as
>>>>> subarrays)
>>>> Technically it's more than just initialized, it is the fact that you
>>>> cannot perform a callback about an object if any object of that thread is
>>>> being held by a collector and also in a register/stack space without
>>>> protections.
>>>>>   - prior to setting up the object header information, GC would not
>>>>> know about the object
>>>>>     - was this  by chance the issue you ran into?
>>>> No I believe the issue I was running into was above where an object on
>>>> the stack was pointing to an oop that got moved during a GC due to that
>>>> thread doing an event callback.
>>>> Thanks for your help,
>>>> Jc
>>>>> 3) event posting
>>>>>    - when you post the event to JVMTI
>>>>>    - in JvmtiObjectAllocEventMark: sets _jobj (object)to_jobject(obj),
>>>>> which creates JNIHandles::make_local(_thread, obj)
>>>>>>> The second reason is that I strongly suspect the scope you want is
>>>>>>> bytecodes only. I think once you have added hooks
>>>>>>> to all the fast paths and slow paths that this will be pushing the
>>>>>>> performance overhead constraints you proposed and
>>>>>>> you won’t want to see e.g. internal allocations.
>>>>>> Yes agreed, allocations from bytecodes are mostly our concern
>>>>>> generally :)
>>>>>> But I think you need to experiment with the set of allocations (or
>>>>>>> possible alternative sets of allocations) you want recorded.
>>>>>>> The hooks I see today include:
>>>>>>> Interpreter: (looking at x86 as a sample)
>>>>>>>   - slowpath in InterpreterRuntime
>>>>>>>   - fastpath tlab allocation - your new threshold check handles that
>>>>>> Agreed
>>>>>>>   - allow_shared_alloc (GC specific): for _new isn’t handled
>>>>>> Where is that exactly? I can check why we are not catching it?
>>>>>>> C1
>>>>>>>   I don’t see changes in c1_Runtime.cpp
>>>>>>>   note: you also want to look for the fast path
>>>>>> I added the calls to c1_Runtime in the latest webrev, but was still
>>>>>> going through testing before pushing it out. I had waited on this one a
>>>>>> bit. Fast path would be handled by the threshold check no?
>>>>>>> C2: changes in opto/runtime.cpp for slow path
>>>>>>>    did you also catch the fast path?
>>>>>> Fast path gets handled by the same threshold check, no? Perhaps I've
>>>>>> missed something (very likely)?
>>>>>>> 3. Performance -
>>>>>>> After you get all the collectors added - you need to rerun the
>>>>>>> performance numbers.
>>>>>> Agreed :)
>>>>>>> thanks,
>>>>>>> Karen
>>>>>>> On Apr 5, 2018, at 2:15 PM, JC Beyler <jcbeyler at google.com> wrote:
>>>>>>> Thanks Boris and Derek for testing it.
>>>>>>> Yes I was trying to get a new version out that had the tests ported
>>>>>>> as well but got sidetracked while trying to add tests and two new features.
>>>>>>> Here is the incremental webrev:
>>>>>>> Here is the full webrev:
>>>>>>> http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.11/
>>>>>>> Basically, the new tests assert this:
>>>>>>>   - Only one agent can currently ask for the sampling, I'm currently
>>>>>>> seeing if I can push to a next webrev the multi-agent support to start
>>>>>>> doing a code freeze on this one
>>>>>>>   - The event is not thread-enabled, meaning like the
>>>>>>> VMObjectAllocationEvent, it's an all or nothing event; same as the
>>>>>>> multi-agent, I'm going to see if a future webrev to add the support is a
>>>>>>> better idea to freeze this webrev a bit
>>>>>>> There was another item that I added here and I'm unsure this webrev
>>>>>>> is stable in debug mode: I added an assertion system to ascertain that all
>>>>>>> paths leading to a TLAB slow path (and hence a sampling point) have a
>>>>>>> sampling collector ready to post the event if a user wants it. This might
>>>>>>> break a few thing in debug mode as I'm working through the kinks of that as
>>>>>>> well. However, in release mode, this new webrev passes all the tests in
>>>>>>> hotspot/jtreg/serviceability/jvmti/HeapMonitor.
>>>>>>> Let me know what you think,
>>>>>>> Jc
>>>>>>> On Thu, Apr 5, 2018 at 4:56 AM Boris Ulasevich <
>>>>>>> boris.ulasevich at bell-sw.com> wrote:
>>>>>>>> Hi JC,
>>>>>>>>    I have just checked on arm32: your patch compiles and runs ok.
>>>>>>>>    As I can see, jtreg agentlib name "-agentlib:HeapMonitor" does
>>>>>>>> not
>>>>>>>> correspond to actual library name: libHeapMonitorTest.c ->
>>>>>>>> libHeapMonitorTest.so
>>>>>>>> Boris
>>>>>>>> On 04.04.2018 01:54, White, Derek wrote:
>>>>>>>> > Thanks JC,
>>>>>>>> >
>>>>>>>> > New patch applies cleanly. Compiles and runs (simple test
>>>>>>>> programs) on
>>>>>>>> > aarch64.
>>>>>>>> >
>>>>>>>> >   * Derek
>>>>>>>> >
>>>>>>>> > *From:* JC Beyler [mailto:jcbeyler at google.com]
>>>>>>>> > *Sent:* Monday, April 02, 2018 1:17 PM
>>>>>>>> > *To:* White, Derek <Derek.White at cavium.com>
>>>>>>>> > *Cc:* Erik Österlund <erik.osterlund at oracle.com>;
>>>>>>>> > serviceability-dev at openjdk.java.net; hotspot-compiler-dev
>>>>>>>> > <hotspot-compiler-dev at openjdk.java.net>
>>>>>>>> > *Subject:* Re: JDK-8171119: Low-Overhead Heap Profiling
>>>>>>>> >
>>>>>>>> > Hi Derek,
>>>>>>>> >
>>>>>>>> > I know there were a few things that went in that provoked a merge
>>>>>>>> > conflict. I worked on it and got it up to date. Sadly my lack of
>>>>>>>> > knowledge makes it a full rebase instead of keeping all the
>>>>>>>> history.
>>>>>>>> > However, with a newly cloned jdk/hs you should now be able to use:
>>>>>>>> >
>>>>>>>> > http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.10/
>>>>>>>> >
>>>>>>>> > The change you are referring to was done with the others so
>>>>>>>> perhaps you
>>>>>>>> > were unlucky and I forgot it in a webrev and fixed it in another?
>>>>>>>> I
>>>>>>>> > don't know but it's been there and I checked, it is here:
>>>>>>>> >
>>>>>>>> >
>>>>>>>> http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.10/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp.udiff.html
>>>>>>>> >
>>>>>>>> > I double checked that tlab_end_offset no longer appears in any
>>>>>>>> > architecture (as far as I can tell :)).
>>>>>>>> >
>>>>>>>> > Thanks for testing and let me know if you run into any other
>>>>>>>> issues!
>>>>>>>> >
>>>>>>>> > Jc
>>>>>>>> >
>>>>>>>> > On Fri, Mar 30, 2018 at 4:24 PM White, Derek <
>>>>>>>> Derek.White at cavium.com
>>>>>>>> > <mailto:Derek.White at cavium.com>> wrote:
>>>>>>>> >
>>>>>>>> >     Hi Jc,
>>>>>>>> >
>>>>>>>> >     I’ve been having trouble getting your patch to apply
>>>>>>>> correctly. I
>>>>>>>> >     may have based it on the wrong version.
>>>>>>>> >
>>>>>>>> >     In any case, I think there’s a missing update to
>>>>>>>> >     macroAssembler_aarch64.cpp, in
>>>>>>>> MacroAssembler::tlab_allocate(),
>>>>>>>> >     where “JavaThread::tlab_end_offset()” should become
>>>>>>>> >     “JavaThread::tlab_current_end_offset()”.
>>>>>>>> >
>>>>>>>> >     This should correspond to the other port’s changes in
>>>>>>>> >     templateTable_<cpu>.cpp files.
>>>>>>>> >
>>>>>>>> >     Thanks!
>>>>>>>> >     - Derek
>>>>>>>> >
>>>>>>>> >     *From:* hotspot-compiler-dev
>>>>>>>> >     [mailto:hotspot-compiler-dev-bounces at openjdk.java.net
>>>>>>>> >     <mailto:hotspot-compiler-dev-bounces at openjdk.java.net>] *On
>>>>>>>> Behalf
>>>>>>>> >     Of *JC Beyler
>>>>>>>> >     *Sent:* Wednesday, March 28, 2018 11:43 AM
>>>>>>>> >     *To:* Erik Österlund <erik.osterlund at oracle.com
>>>>>>>> >     <mailto:erik.osterlund at oracle.com>>
>>>>>>>> >     *Cc:* serviceability-dev at openjdk.java.net
>>>>>>>> >     <mailto:serviceability-dev at openjdk.java.net>;
>>>>>>>> hotspot-compiler-dev
>>>>>>>> >     <hotspot-compiler-dev at openjdk.java.net
>>>>>>>> >     <mailto: <hotspot-compiler-dev at openjdk.java.net>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20180426/ba6282fe/attachment-0001.html>

More information about the hotspot-compiler-dev mailing list