Low-Overhead Heap Profiling
thomas.schatzl at oracle.com
Fri Jun 23 10:52:27 UTC 2017
On Wed, 2017-06-21 at 13:45 -0700, JC Beyler wrote:
> Hi all,
> First off: Thanks again to Robbin and Thomas for their reviews :)
> Next, I've uploaded a new webrev:
> Here is an update:
> - @Robbin, I forgot to say that yes I need to look at implementing
> this for the other architectures and testing it before it is all
> ready to go. Is it common to have it working on all possible
> combinations or is there a subset that I should be doing first and we
> can do the others later?
> - I've tested slowdebug, built and ran the JTreg tests I wrote with
> slowdebug and fixed a few more issues
> - I've refactored a bit of the code following Thomas' comments
> - I think I've handled all the comments from Thomas (I put
> comments inline below for the specifics)
Thanks for handling all those.
> - Following Thomas' comments on statistics, I want to add some
> quality assurance tests and find that the easiest way would be to
> have a few counters of what is happening in the sampler and expose
> that to the user.
> - I'll be adding that in the next version if no one sees any
> objections to that.
> - This will allow me to add a sanity test in JTreg about number of
> samples and average of sampling rate
> @Thomas: I had a few questions that I inlined below but I will
> summarize the "bigger ones" here:
> - You mentioned constants are not using the right conventions, I
> looked around and didn't see any convention except normal naming then
> for static constants. Is that right?
I looked through https://wiki.openjdk.java.net/display/HotSpot/StyleGui
de and the rule is to "follow an existing pattern and must have a
distinct appearance from other names". Which does not help a lot I
guess :/ The GC team started using upper camel case, e.g.
SomeOtherConstant, but very likely this is probably not applied
consistently throughout. So I am fine with not adding another style
(like kMaxStackDepth with the "k" in front with some unknown meaning)
(Chances are you will find that style somewhere used anyway too,
apologies if so :/)
> PS: I've also inlined my answers to Thomas below:
> On Tue, Jun 13, 2017 at 8:03 AM, Thomas Schatzl <thomas.schatzl at oracl
> e.com> wrote:
> > Hi all,
> > On Mon, 2017-06-12 at 11:11 -0700, JC Beyler wrote:
> > > Dear all,
> > >
> > > I've continued working on this and have done the following
> > webrev:
> > > http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/
> > [...]
> > > Things I still need to do:
> > > - Have to fix that TLAB case for the FastTLABRefill
> > > - Have to start looking at the data to see that it is
> > consistent and does gather the right samples, right frequency, etc.
> > > - Have to check the GC elements and what that produces
> > > - Run a slowdebug run and ensure I fixed all those issues you
> > saw > Robbin
> > >
> > > Thanks for looking at the webrev and have a great week!
> > scratching a bit on the surface of this change, so apologies for
> > rather shallow comments:
> > - macroAssembler_x86.cpp:5604: while this is compiler code, and I
> > am not sure this is final, please avoid littering the code with
> > TODO remarks :) They tend to be candidates for later wtf moments
> > only.
> > Just file a CR for that.
> Newcomer question: what is a CR and not sure I have the rights to do
> that yet ? :)
Apologies. CR is a change request, this suggests to file a bug in the
bug tracker. And you are right, you can't just create a new account in
the OpenJDK JIRA yourselves. :(
I was mostly referring to the "... but it is a TODO" part of that
comment in macroassembler_x86.cpp. Comments about the why of the code
[Note that I now understand that this is to some degree still work in
progress. As long as the final changeset does no contain TODO's I am
fine (and it's not a hard objection, rather their use in "final" code
is typically limited in my experience)]
5603 // Currently, if this happens, just set back the actual end to
where it was.
5604 // We miss a chance to sample here.
Would be okay, if explaining "this" and the "why" of missing a chance
to sample here would be best.
// If we needed to refill TLABs, just set the actual end point to
// the end of the TLAB again. We do not sample here although we could.
I am not sure whether "miss a chance to sample" meant "we could, but
consciously don't because it's not that useful" or "it would be
necessary but don't because it's too complicated to do.".
Looking at the original comment once more, I am also not sure if that
comment shouldn't referring to the "end" variable (not actual_end)
because that's the variable that is responsible for taking the sampling
path? (Going from the member description of ThreadLocalAllocBuffer).
But maybe I am only confused and it's best to just leave the comment
Thinking about it some more, doesn't this not-sampling in this case
mean that sampling does not work in any collector that does inline TLAB
allocation at the moment? (Or is inline TLAB alloc automatically
disabled with sampling somehow?)
That would indeed be a bigger TODO then :)
> > - calling HeapMonitoring::do_weak_oops() (which should probably be
> > called weak_oops_do() like other similar methods) only if string
> > deduplication is enabled (in g1CollectedHeap.cpp:4511) seems wrong.
> The call should be at least around 6 lines up outside the if.
> Preferentially in a method like process_weak_jni_handles(), including
> additional logging. (No new (G1) gc phase without minimal logging
> Done but really not sure because:
> I put for logging:
> log_develop_trace(gc, freelist)("G1ConcRegionFreeing [other] : heap
I would think that "gc, ref" would be more appropriate log tags for
this similar to jni handles.
(I am als not sure what weak reference handling has to do with
G1ConcRegionFreeing, so I am a bit puzzled)
> Since weak_jni_handles didn't have logging for me to be inspired
> from, I did that but unconvinced this is what should be done.
The JNI handle processing does have logging, but only in
process_weak_jni_handles() only overall time is measured (in a G1
specific way, since only G1 supports disabling reference procesing) :/
The code in ReferenceProcessor prints both time taken
referenceProcessor.cpp:254, as well as the count, but strangely only in
I have no idea why this logging is that unimportant to only print that
in a debug VM. However there are reviews out for changing this area a
bit, so it might be useful to wait for that (JDK-8173335).
> > - the change doubles the size of
> > CollectedHeap::allocate_from_tlab_slow() above the "small and nice"
> > threshold. Maybe it could be refactored a bit.
> Done I think, it looks better to me :).
In ThreadLocalAllocBuffer::handle_sample() I think the
set_back_actual_end()/pick_next_sample() calls could be hoisted out of
the "if" :)
> > - referenceProcessor.cpp:261: the change should add logging about
> > the number of references encountered, maybe after the corresponding
> > "JNI weak reference count" log message.
> Just to double check, are you saying that you'd like to have the heap
> sampler to keep in store how many sampled objects were encountered in
> the HeapMonitoring::weak_oops_do?
> - Would a return of the method with the number of handled
> references and logging that work?
Yes, it's fine if HeapMonitoring::weak_oops_do() only returned the
number of processed weak oops.
> - Additionally, would you prefer it in a separate block with its
Yes. Both kinds of information is interesting: while the time taken is
typically more important, the next question would be why, and the
number of references typically goes a long way there.
See above though, it is probably best to wait a bit.
> > - threadLocalAllocBuffer.cpp:331: one more "TODO"
> Removed it and added it to my personal todos to look at.
> > - threadLocalAllocBuffer.hpp: ThreadLocalAllocBuffer class
> > documentation should be updated about the sampling additions. I
> > would have no clue what the difference between "actual_end" and
> > "end" would be from the given information.
> If you are talking about the comments in this file, I made them more
> clear I hope in the new webrev. If it was somewhere else, let me know
> where to change.
Thanks, that's much better. Maybe a note in the comment of the class
that ThreadLocalBuffer provides some sampling facility by modifying the
end() of the TLAB to cause "frequent" calls into the runtime call where
actual sampling takes place.
> > - in heapMonitoring.hpp: there are some random comments about some
> > code that has been grabbed from "util/math/fastmath.[h|cc]". I
> > can't tell whether this is code that can be used but I assume that
> > Noam Shazeer is okay with that (i.e. that's all Google code).
> Jeremy and I double checked and we can release that as I thought. I
> removed the comment from that piece of code entirely.
> > - heapMonitoring.hpp/cpp static constant naming does not correspond
> > to Hotspot's. Additionally, in Hotspot static methods are cased
> > like other methods.
> I think I fixed the methods to be cased the same way as all other
> methods. For static constants, I was not sure. I fixed a few other
> variables but I could not seem to really see a consistent trend for
> constants. I made them as variables but I'm not sure now.
Sorry again, style is a kind of mess. The goal of my suggestions here
is only to prevent yet another style creeping in.
> > - in heapMonitoring.cpp there are a few cryptic comments at the top
> > that seem to refer to internal stuff that should probably be
> > removed.
> Sorry about that! My personal todos not cleared out.
I am happy about comments, but I simply did not understand any of that
and I do not know about other readers as well.
If you think you will remember removing/updating them until the review
proper (I misunderstood the review situation a little it seems).
> > I did not think through the impact of the TLAB changes on collector
> > behavior yet (if there are). Also I did not check for problems with
> > concurrent mark and SATB/G1 (if there are).
> I would love to know your thoughts on this, I think this is fine. I
I think so too now. No objects are made live out of thin air :)
> see issues with multiple threads right now hitting the stack storage
> instance. Previous webrevs had a mutex lock here but we took it out
> for simplificity (and only for now).
:) When looking at this after some thinking I now assume for this
review that this code is not MT safe at all. There seems to be more
synchronization missing than just the one for the StackTraceStorage. So
no comments about this here.
Also, this would require some kind of specification of what is allowed
to be called when and where.
One potentially relevant observation about locking here: depending on
sampling frequency, StackTraceStore::add_trace() may be rather
frequently called. I assume that you are going to do measurements :)
I am not sure what the expected usage of the API is, but
StackTraceStore::add_trace() seems to be able to grow without bounds.
Only a GC truncates them to the live ones. That in itself seems to be
problematic (GCs can be *wide* apart), and of course some of the API
methods add to that because they duplicate that unbounded array. Do you
have any concerns/measurements about this?
Also, these stack traces might hold on to huge arrays. Any
consideration of that? Particularly it might be the cause for OOMEs in
tight memory situations.
- please consider adding a safepoint check in
HeapMonitoring::weak_oops_do to prevent accidental misuse.
- in struct StackTraceStorage, the public fields may also need
underscores. At least some files in the runtime directory have structs
with underscored public members (and some don't). The runtime team
should probably comment on that.
- In StackTraceStorage::weak_oops_do(), when examining the
StackTraceData, maybe it is useful to consider having a non-NULL
reference outside of the heap's reserved space an error. There should
be no oop outside of the heap's reserved space ever.
Unless you allow storing random values in StackTraceData::obj, which I
would not encourage.
- HeapMonitoring::weak_oops_do() does not seem to use the
- I do not understand allowing to call this method with a NULL
complete_gc closure. This would mean that objects referenced from the
object that is referenced by the StackTraceData are not pulled, meaning
they would get stale.
- same with is_alive parameter value of NULL
- heapMonitoring.cpp:590: I do not completely understand the purpose of
this code: in the end this results in a fixed value directly dependent
on the Thread address anyway? In the end this results in a fixed value
directly dependent on the Thread address anyway?
IOW, what is special about exactly 20 rounds?
- also I would consider stripping a few bits of the threads' address as
initialization value for your rng. The last three bits (and probably
more, check whether the Thread object is allocated on special
boundaries) are always zero for them.
Not sure if the given "random" value is random enough before/after,
this method, so just skip that comment if you think this is not
Some more random nits I did not find a place to put anywhere:
- ThreadLocalAllocBuffer::_extra_space does not seem to be used
- Maybe indent the declaration of ThreadLocalAllocBuffer::_bytes_until_sample to align below the other members of that group.
More information about the hotspot-compiler-dev