8229189: Improve JFR leak profiler tracing to deal with discontiguous heaps

Peter Kessler OS peter.kessler at os.amperecomputing.com
Wed Aug 7 20:36:47 UTC 2019

A trick I used in the GraalVM/SubstrateVM discontiguous heap is to put the ancillary data structures in the memory allocated for the shards of the discontiguous heap.  In a given shard I could fit somewhat fewer objects, but could put things like a bit map for the shard in the shard, because it is of a fixed size given the size of the shard.  By adjusting `start` (or `end`) for the shard, I could keep the allocator from using the space for the ancillary structure for object allocation.  The shard has to be sufficiently aligned, but I think you have that in the ZGC heap.  To find the ancillary data for an object: use the high-order bits of the object address to find the base address of the shard, and use the low-order bits of the object address to find the offset of the object within the objects in the shard.  The object offset can then be scaled to find the offset into the ancillary data for the shard.  Multiple ancillary data tables can be kept in the same shard since each ancillary table is fixed sized and can live at a fixed offset from the base address of the shard.

The goal is locate the ancillary data using only address arithmetic without any additional memory references (e.g., to your hash table), or wasted space (in your sparse hash table).

A nit in your webrev: http://cr.openjdk.java.net/~eosterlund/8229189/webrev.00/src/hotspot/share/jfr/leakprofiler/chains/pathToGcRootsOperation.cpp.udiff.html

@@ -55,12 +55,12 @@
   * We will attempt to dimension an initial reservation
   * in proportion to the size of the heap (represented by heap_region).
   * Initial memory reservation: 5% of the heap OR at least 32 Mb
   * Commit ratio: 1 : 10 (subject to allocation granularties)
- static size_t edge_queue_memory_reservation(const MemRegion& heap_region) {
-   const size_t memory_reservation_bytes = MAX2(heap_region.byte_size() / 20, 32*M);
+ static size_t edge_queue_memory_reservation() {
+   const size_t memory_reservation_bytes = MAX2(MaxHeapSize / 20, 32*M);
    assert(memory_reservation_bytes >= (size_t)32*M, "invariant");
    return memory_reservation_bytes;

I don’t think the assert adds anything.  If the assert fails, it is because `MAX2` has failed, in which case there are bigger problems.  Getting rid of the assert also gets rid of the second instance of the constant `32*M`.  Does it matter that `MaxHeapSize/ 20` does a truncating division?

I am not a Reviewer.

                                                … peter

-----Original Message-----
From: hotspot-runtime-dev <hotspot-runtime-dev-bounces at openjdk.java.net> on behalf of Erik Österlund <erik.osterlund at oracle.com>
Date: Wednesday, August 7, 2019 at 1:19 AM
To: "hotspot-runtime-dev at openjdk.java.net" <hotspot-runtime-dev at openjdk.java.net>
Subject: RFR: 8229189: Improve JFR leak profiler tracing to deal with discontiguous heaps


    The JFR leak profiler has marking bit maps that assume a contiguous Java
    heap. ZGC is discontiguous, and therefore does not work with JFR. If one
    tried to use the JFR leak profiler with ZGC, it would allocate a bit map
    for the multi-terabyte "reserved region", even though perhaps only 64 MB
    is used, spread out across this address space. That is one of the reason
    the leakprofiler is turned off for ZGC.

    In order to enable leakprofiler support on ZGC, the tracing must also
    use the Access API instead of raw oop loads. But that is outside the
    scope of this RFE; here we deal only with the discontiguous address
    space problem.

    My solution involves implementing a segmented bit map, that makes no
    assumptions about the layout of the Java heap. Given an address, it
    looks up a bitmap fragment from a hash table, given the high order bits
    of a pointer. If there is no such fragment, it is populated to the
    table. The low order bits (shifted by LogMinObjAlignmentInBytes) are
    used to find the bit in the bit map for marking an object in the traversal.

    In order to not cause regressions in the speed, some optimizations have
    been made:

    1) The table uses & instead of % to lookup buckets, ensuring the table
    is always a power of two size.
    2) The hot paths get inlined.
    3) There is a cache for the last fragment, as the probability of two
    subsequent bit accesses for two objects found during tracing in the heap
    do not cross the set up fragment granule (64 MB heap memory) boundary.
    This is something G1 exploits for the cross region check, and the same
    general idea is applied here. The code also asks first if a bit is
    marked and then marks it, as two calls. The cache + inlining allows the
    compiler to lookup the fragment only once for the two operations.
    4) Keep the table sparse.

    As a result, no regressions that are outside of the noise can be noticed
    with this new more GC-agnostic approach.




More information about the hotspot-runtime-dev mailing list