RFR: 8146115 - Improve docker container detection and resource configuration usage

David Holmes david.holmes at oracle.com
Fri Oct 13 03:08:23 UTC 2017

Hi Bob,

On 13/10/2017 1:43 AM, Bob Vandette wrote:
>> On Oct 11, 2017, at 9:04 PM, David Holmes <david.holmes at oracle.com> wrote:
>> Hi Bob,
>> On 12/10/2017 5:11 AM, Bob Vandette wrote:
>>> Here’s an updated webrev for this RFE that contains changes and cleanups based on feedback I’ve received so far.
>>> I’m still investigating the best approach for reacting to cpu shares and quotas.  I do not believe doing nothing is the answer.
>> I do. :) Let me try this again. When you run outside of a container you don't get 100% of the CPUs - you have to share with whatever else is running on the system. You get a fraction of CPU time based on the load. We don't try to communicate load information to the VM/application so it can adapt. Within a container setting shares/quotas is just a way of setting an artificial load. So why should we be treating it any differently?
> Because today we optimize for a lightly loaded system and when running serverless applications in containers we should be
> optimizing for a fully loaded system.  If developers don’t want this, then don’t use shares or quotas and you’ll have exactly
> the behavior you have today.  I think we just have to document the new behavior (and how to turn it off) so people know what
> to expect.

The person deploying the app may not have control over how the app is 
deployed in terms of shares/quotas. It all depends how (and who) manages 
the containers. This is a big part of my problem/concerns here that I 
don't know exactly how all this is organized and who knows what in 
advance and what they can control.

But I'll let this drop, other than raising an additional concern. I 
don't think just allowing the user to hardwire the number of processors 
to use will necessarily solve the problem with what 
available_processors() returns. I'm concerned the execution of the VM 
may occur in a context where the number of processors is not known in 
advance, and the user can not disable shares/quotas. In that case we may 
need to have a flag that says to ignore shares/quotas in the processor 
count calculation.

> You seem to discount the added cost of 100s of VMs creating lots of un-necessaary threads.  In the current JDK 10 code base,
> In a heavily loaded system with 88 processors, VmData grows from 60MBs (1 cpu) to 376MB (88 cpus).  This is only mapped
> memory and it depends heavily on how deep in the stack these threads go before it impacts VmRSS but it shows the potential downside
> of having 100s of VMs thinking they each own the entire machine.

I agree that the default ergonomics does not scale well. Anyone doing 
any serious Java deployment tunes the VM explicitly and does not rely on 
the defaults. How will they do that in a container environment? I don't 

I would love to see some actual deployment scenarios/experiences for 
this to understand things better.

> I haven’t even done any experiments to determine the added context switching cost if the VM decides to use excessive
> pthreads.
>> That's not to say an API to provide load/shares/quota information may not be useful, but that is a separate issue to what the "active processor count" should report.
> I don’t have a problem with active processor count reporting the number of processors we have, but I do have a problem
> with our current usage of this information within the VM and Core libraries.

That is a somewhat separate issue. One worth pursuing separately.

>>> http://cr.openjdk.java.net/~bobv/8146115/webrev.01
>>> Updates:
>>> 1. I had to move the processing of AggressiveHeap since the container memory size needs to be known before this can be processed.
>> I don't like the placement of this - we don't call os:: init functions from inside Arguments - we manage the initialization sequence from Threads::create_vm. Seems to me that container initialization can/should happen in os::init_before_ergo, and the AggressiveHeap processing can occur at the start of Arguments::apply_ergo().
>> That said we need to be sure nothing touched by set_aggressive_heap_flags will be used before we now reach that code - there are a lot of flags being set in there.
> This is exactly the reason why I put the call where it did.  I put the call to set_aggressive_heap_flags in finalize_vm_init_args
> because that is exactly what this call is doing.  It’s finalizing flags used after the parsing.  The impacted flags are definitely being
> used shortly after and before init_before_ergo is called.

I see that now and it is very unfortunate because I really do not like 
what you had to do here. As you can tell from the logic in create_vm we 
have always refactored to ensure we can progressively manage the 
interleaving of OS initialization with Arguments processing. So having a 
deep part of Argument processing go off and call some more OS 
initialization is not nice. That said I can't see a way around it 
without very unreasonable refactoring.

But I do have a couple of changes I'd like to request please:

1. Move the call to os::initialize_container_support() up a level to 
before the call to finalize_vm_init_args(), with a more elaborate comment:

// We need to ensure processor and memory resources have been properly
// configured - which may rely on arguments we just processed - before
// doing the final argument processing. Any argument processing that
// needs to know about processor and memory resources must occur after
// this point.


// Do final processing now that all arguments have been parsed
result = finalize_vm_init_args(patch_mod_javabase);

2. Simplify and modify os.hpp as follows:

+  LINUX_ONLY(static void pd_initialize_container_support();)

     static void init(void);                      // Called before 
command line parsing

+   static void initialize_container_support() { // Called during 
command line parsing
+     LINUX_ONLY(pd_initialize_container_support();)
+   }

     static void init_before_ergo(void);          // Called after 
command line parsing
                                                  // before VM ergonomics

3. In thread.cpp add a comment here:

    // Parse arguments
+  // Note: this internally calls os::initialize_container_support()
    jint parse_result = Arguments::parse(args);


>>> 2. I no longer use the cpuset.cpus contents since sched_getaffinity reports the correct results
>>> even if someone manually updates the cgroup data.  I originally didn’t think this was the case since
>>> sched_setaffinity didn’t automatically update the cpuset file contents but the inverse is true.
>> Ok.
>>> 3. I ifdef’d the container function support in src/hotspot/share/runtime/os.hpp to avoid putting stubs in all other os
>>> platform directories.  I can do this if it’s absolutely necessary.
>> You should not need to do this if initialization moves as I suggested above. os::init_before_ergo() in os_linux.cpp can call OSContainer::init().
>> No need for os::initialize_container_support() or os::pd_initialize_container_support.
> But os::init_before_ergo is in shared code.

Yep my bad - point is moot now anyway.


>> src/hotspot/os/linux/os_linux.cpp/.hpp
>> 187         log_trace(os)("available container memory: " JULONG_FORMAT, avail_mem);
>> 188         return avail_mem;
>> 189       } else {
>> 190         log_debug(os,container)("container memory usage call failed: " JLONG_FORMAT, mem_usage);
>> Why "trace" (the third logging level) to show the information, but "debug" (the second level) to show failed calls? You use debug in other files for basic info. Overall I'm unclear on your use of debug versus trace for the logging.
> I use trace for noisy information that is not reporting errors and debug for failures that are informational and not fatal.
> In this case, the call could return -1 or -2.  -1 is unlimited and -2 is an error.  In either case we fallback to the
> standard system call to get available memory.  I would have used warning but since these messages were occurring
> during a test run causing test failures.

Okay. Thanks for clarifying.

>> ---
>> src/hotspot/os/linux/osContainer_linux.cpp
>> Dead code:
>> 376 #if 0
>> 377   os::Linux::print_container_info(tty);
>> ...
>> 390 #endif
> I left it in for standalone testing.  Should I use some other #if?

We don't generally leave in dead code in the runtime code. Do you see 
this as useful after you've finalized the changes?

Is this testing just for showing the logging? Is it worth making this a 
logging controlled call? Is it suitable for a Gtest test?


> Bob.
>> Thanks,
>> David
>>> Bob.

More information about the hotspot-dev mailing list