Review Request: UseNUMAInterleaving
dmdabbs at gmail.com
Mon May 16 11:21:03 PDT 2011
> -----Original Message-----
> From: hotspot-compiler-dev-bounces at openjdk.java.net [mailto:hotspot-
> compiler-dev-bounces at openjdk.java.net] On Behalf Of Deneau, Tom
> Sent: Monday, May 16, 2011 12:54 PM
> To: 'hotspot-compiler-dev at openjdk.java.net'
> Subject: Review Request: UseNUMAInterleaving
> Please review this patch which adds a new flag called
> UseNUMAInterleaving. This flag provides a subset of the functionality
> provided by UseNUMA, and its main purpose is to provide that subset on
> OSes like Windows which do not support the full UseNUMA functionality.
> In UseNUMA terminology, UseNUMAInterleaved makes all memory
> "numa_global" which is implemented as interleaved.
> The situations where this shows the biggest benefits would be:
> * Windows platforms with multiple numa nodes (eg, 4)
> * The JVM process is run across all the nodes (not affinitized to
> one node).
> * A workload that uses the majority of the cores in the machine, so
> that the heap is being accessed from many cores, including remote
> * Enough memory per node and a heap size such that the default heap
> placement policy on windows would end up with the heap (or
> nursery) placed on one node.
> jbb2005 and SPECPower_ssj2008 are examples of such workloads. In our
> measurements, we have seen some cases where the performance with
> UseNUMAInterleaving was 2.7x vs. the performance without. There were
> gains of varying sizes across all systems.
> As currently implemented this flag is ignored on Linux and Solaris
> since they already support the full UseNUMA flag.
> The webrev is at
> Summary of changes:
> * Other than adding the new UseNUMAInterleaving global flag, all of
> the changes are in src/os/windows/vm/os_windows.cpp
> * Some static routines were added to set things up init time. These
> * check that the required APIs (VirtualAllocExNuma,
> GetNumaHighestNodeNumber, GetNumaNodeProcessorMask) exist in
> the OS
> * build the list of numa nodes on which this process has affinity
> * Changes to os::reserve_memory
> * There was already a routine that reserved pages one page at a
> time (used for Individual Large Page Allocation on WS2003).
> This was abstracted to a separate routine, called
> allocate_pages_individually. This gets called both for the
> Individual Large Page Allocation thing mentioned above and for
> UseNUMAInterleaving (for both small and large pages)
> * When used for NUMA Interleaving this just goes thru the numa
> node list in a round-robin fashion, using a different one for
> each chunk (with 4K pages, the minimum allocation granularity
> is 64K, with 2M pages it is 1 Page)
> * Whether we do just a reserve or a combined reserve/commit is
> determined by the caller of allocate_pages_individually
> * When used with large pages, we do a Reserve and Commit at
> the same time which is the way it always worked and the way
> it has to work on windows.
> * For small pages, only the reserve is done, the commit will
> come later. (which is the way it worked for
> * os::commit_memory changes
> * If UseNUMAIntereaving is true, os::commit_memory has to check
> whether it was being asked to commit memory that might have
> come from multiple Reserve allocations, if so, the commits
> must also be broken up. We don't keep any data structure to
> keep track of this, we just use VirtualQuery which queries the
> properties of a VA range and can tell us how much came from
> one VirtualAlloc call.
> I do not have a bug id for this.
> -- Tom Deneau, AMD
Could this flag help Linux systems with kernel < 2.6.19, or is that the
minimum kernel needed for any JVM NUMA support?
Unfortunately, we run CentOS 5.5 (2.6.18)
Linux node01.int 2.6.18-194.17.4.el5 #1 SMP Mon Oct 25 15:50:53 EDT 2010
x86_64 x86_64 x86_64 GNU/Linux
and so -XX:+UseNUMA does not activate (at least not according to
In the Java HotSpot VM, the NUMA-aware allocator has been implemented to
provide automatic memory placement optimisations for Java applications.
Typically, every processor in the system has a local memory that provides
low access latency and high bandwidth, and remote memory that is
considerably slower to access. The NUMA-aware allocator is implemented for
Solaris (>= 9u2) and Linux (kernel >= 2.6.19, glibc >= 2.6.1) operating
systems, and can be turned on for the Parallel Scavenger garbage collector
with the -XX:+UseNUMA flag. Parallel Scavenger remains the default for a
server-class machine and can also be turned on explicitly by specifying the
-XX:+UseParallelGC option. The impact of the change is significant: When
evaluated against the SPEC JBB 2005 benchmark on an 8 chip Opteron machine,
NUMA-aware systems gave about a 30% (for 32-bit) to 40% (for 64-bit)
increase in performance.
More information about the hotspot-compiler-dev