NUMA

From wiki.mikejung.biz
Jump to: navigation, search

Liquidweb 728x90.jpg

SMP and NUMA

There are two main types of topology used for modern servers:

Symmetric Multi-Processor (SMP) -- Allows all processors to access memory in the same way / amount of time. All single socket servers utilize this topology. This does not scale well however, so most larger servers will be using a NUMA based topology, which includes more than 1 CPU socket.

Non-Uniform Memory Access (NUMA) -- Developed more recently than SMP. NUMA systems allow for more than one CPU socket. Typically we see 2 and 4 socket systems.

Each socket has it's own local memory bank, which is the fastest in terms of access time (besides the CPU L1,L2, and L3 caches). However each socket can also utilize remote memory banks, however there is additional latency involved in these remote locations. These remote accesses are also called numa_misses. The thing is, the amount of time spent waiting for remote access is forever to a CPU. It's really important to make sure your server is correctly configured to use the most efficient NUMA node possible.

I'm running Ubuntu 14.10 in a VM on my home PC, in this case it's because I'm using an Intel i7-4790k, which is a single socket CPU. I don't have to worry about NUMA since I only have one socket, there are no remote RAM calls for single socket CPUs.

To determine if a server is SMP or NUMA, you can run the following command:

[email protected]:~# numactl --hardware 
available: 1 nodes (0)
node 0 cpus: 0 1 2 3
node 0 size: 3938 MB
node 0 free: 2333 MB
node distances:
node   0 
  0:  10 
 


This will tell you how many nodes the server has. If there is more than one node, then it’s safe to assume that this is a NUMA system.

You can also use lscpu to get more detailed info on the CPU. In this case there is one NUMA node with 4 cores per node.

[email protected]:~# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             4
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 60
Model name:            Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
Stepping:              3
CPU MHz:               3990.759
BogoMIPS:              7981.51
Hypervisor vendor:     VMware
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              8192K
NUMA node0 CPU(s):     0-3

What is NUMA?

http://developerblog.redhat.com/2013/08/27/numa-hurt-app-perf/

Will fill this in later.

Checking if programs are NUMAing correctly

Lets say you are running multiple JAVA processes on a NUMA box (Quad Socket AMD Opteron 6128). You want to make sure that these processes are being efficient and you want them to run on different NUMA nodes so that they each utilize a local RAM bank and run on the CPU socket that is located close to each bank of RAM. How do?

Start by figuring out where the processes are running

numastat JAVA

This will show you the PIDs of the processes as well as how much RAM they are using on each NUMA node. Ideally, if you run two processes and have 4 NUMA nodes available, you would want to utilize two nodes, one for each process. The command above should show something like the example below if they are correctly configured:

       NODE1  NODE 2  TOTAL
---------------------------
PID 1  1000   5      1005
PID 2  5      1000   1005

Now, if you want to find out what CPUs the processes are allowed to run on, or able to run on you would run the following command for each PID

grep Cpus_allowed_list /proc/$PID/status

This will display what CPUs can be used for this process. Now, we still need to know if the CPUs the processes are running on are using the local NUMA nodes to store data. To check this you would run:

numactl --hardware | grep cpus

If everything matches up then you are doing pretty good. If things do not match up then you might want to start moving process around to make sure the CPU and NUMA node are both local.

There are other tools that can help to dig a bit deeper such as linpack and perf I will add more on these tools once I have some time to play around with them.