RFR: 8272807: Permit use of memory concurrent with pretouch
shade at openjdk.java.net
Tue Aug 24 11:01:27 UTC 2021
On Mon, 23 Aug 2021 11:35:18 GMT, Kim Barrett <kbarrett at openjdk.org> wrote:
> Please review this change to os::pretouch_memory to permit use of the memory
> concurrently with the pretouch operation. This is accomplished by using an
> atomic add of zero as the operation for touching the memory, ensuring the
> virtual location is backed by physical memory while not changing any values
> being read or written by the application.
> While I was there, fixed some other lurking issues in os::pretouch_memory.
> There was a potential overflow in the iteration that has been fixed. And if
> the range arguments weren't page aligned then the last page might not get
> touched. The latter was even mentioned in the function's description. Both
> of those have been fixed by careful alignment and some extra checks. The
> resulting code is a little more complicated, but more robust and complete.
> This change doesn't make use of the new capability; I have some other
> changes in development to do that.
> mach5 tier1-3.
> I've been using this change while developing uses of the new capability.
> Performance testing hasn't found any regressions related to this change.
> I _haven't_ written a microbenchmark to commit memory, time touching it with
> one of the approaches, uncommit, repeat. I could do that, though I don't
> expect it to show anything either.
Yeah, the overhead is measurable. See for example Epsilon with 100G heap (several runs, most typical result is shown):
$ time ~/trunks/jdk/build/baseline/bin/java -Xms100g -Xmx100g -XX:+AlwaysPreTouch -XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC Hello
$ time ~/trunks/jdk/build/patched/bin/java -Xms100g -Xmx100g -XX:+AlwaysPreTouch -XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC Hello
real 0m23.568s ; + 500ms
user 0m2.306s ; + 420ms
sys 0m21.189s ; + 80ms (noise?)
This correlates with 100G / 4K = 25M pages to touch with atomics, which gives us roughly additional 500ms/25M = 20 ns per atomic/page (most likely cache-missing atomic costing extra). In the test above, this adds up to ~2% overhead. I do believe this overhead is inconsequential (since user already kinda loses startup performance "privileges" with `-XX:+AlwaysPreTouch` anyway), especially if we would be able to leverage this feature to pre-touch heap in background in future RFEs.
And this is x86_64. Whereas I see that AArch64 seems to do the call to the helper with `memory_order_conservative` always.
More information about the hotspot-runtime-dev