RFR: 8203359: Container level resources events
jbachorik at openjdk.java.net
Fri Apr 2 11:16:27 UTC 2021
On Thu, 1 Apr 2021 15:55:59 GMT, Jaroslav Bachorik <jbachorik at openjdk.org> wrote:
>>> Does each getter call result in parsing /proc, or do things aggregated over several calls or hooks?
>> From the looks of it the event emitting code uses `Metrics.java` interface for retrieving the info. Each call to a method exposed by Metrics result in file IO on some cgroup (v1 or v2) interface file(s) in `/sys/fs/...`. I don't see any aggregation being done.
>> On the hotspot side, we implemented some caching for frequent calls (JDK-8232207, JDK-8227006), but we didn't do that yet for the Java side since there wasn't any need (so far). If calls are becoming frequent with this it should be reconsidered.
>> So +1 on getting some data on what the perf penalty of this is.
> Thanks to all for chiming in!
> I have added the tests to `test/hotspot/jtreg/containers/docker/TestJFREvents.java` where there already were some templates for the container event data.
> As for the performance - as expected, extracting the data from `/proc` is not exactly cheap. On my test c5.4xlarge instance I am getting an average wall-clock time to generate the usage/throttling events (one instance of each) of ~15ms.
> I would argue that 15ms per 30s (the default emission period for those events) might be acceptable to start with.
> Caching of cgroups parsed data would help if the emission period is shorter than the cache TTL. This is exacerbated by the fact that (almost) each container event type requires data from a different cgroups control file - hence the data will not be shared between the event type instances even if cached. Realistically, caching benefits would become visible only for sub-second emission periods.
> If the caching is still required I would suggest having a follow up ticket just for that - it will require setting up some benchmarks to justify the changes that would need to be done in the metrics implementation.
I tried to measure the startup regression and here are my observations:
* Startup is not affected unless the application is started with JFR
* The extra events and hooks take ~5ms on my work machine
* It is possible not to register those events and hooks in a non-container env - then the overhead is 20-50us which it takes to figure out whether running in container
In order to minimize the effect this change will have on the startup I would suggest using conditional registration unless I hear strong objections to that.
More information about the core-libs-dev