SMP and NUMA
There are two main types of topology used for modern servers:
Symmetric Multi-Processor (SMP) -- Allows all processors to access memory in the same way / amount of time. All single socket servers utilize this topology. This does not scale well however, so most larger servers will be using a NUMA based topology, which includes more than 1 CPU socket.
Non-Uniform Memory Access (NUMA) -- Developed more recently than SMP. NUMA systems allow for more than one CPU socket. Typically we see 2 and 4 socket systems.
Each socket has it's own local memory bank, which is the fastest in terms of access time (besides the CPU L1,L2, and L3 caches). However each socket can also utilize remote memory banks, however there is additional latency involved in these remote locations. These remote accesses are also called numa_misses. The thing is, the amount of time spent waiting for remote access is forever to a CPU. It's really important to make sure your server is correctly configured to use the most efficient NUMA node possible.
I'm running Ubuntu 14.10 in a VM on my home PC, in this case it's because I'm using an Intel i7-4790k, which is a single socket CPU. I don't have to worry about NUMA since I only have one socket, there are no remote RAM calls for single socket CPUs.
To determine if a server is SMP or NUMA, you can run the following command:
[email protected]:~# numactl --hardware available: 1 nodes (0) node 0 cpus: 0 1 2 3 node 0 size: 3938 MB node 0 free: 2333 MB node distances: node 0 0: 10
This will tell you how many nodes the server has. If there is more than one node, then it’s safe to assume that this is a NUMA system.
You can also use lscpu to get more detailed info on the CPU. In this case there is one NUMA node with 4 cores per node.
[email protected]:~# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 4 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 60 Model name: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz Stepping: 3 CPU MHz: 3990.759 BogoMIPS: 7981.51 Hypervisor vendor: VMware Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 8192K NUMA node0 CPU(s): 0-3
What is NUMA?
Will fill this in later.
Checking if programs are NUMAing correctly
Lets say you are running multiple JAVA processes on a NUMA box (Quad Socket AMD Opteron 6128). You want to make sure that these processes are being efficient and you want them to run on different NUMA nodes so that they each utilize a local RAM bank and run on the CPU socket that is located close to each bank of RAM. How do?
Start by figuring out where the processes are running
This will show you the PIDs of the processes as well as how much RAM they are using on each NUMA node. Ideally, if you run two processes and have 4 NUMA nodes available, you would want to utilize two nodes, one for each process. The command above should show something like the example below if they are correctly configured:
NODE1 NODE 2 TOTAL --------------------------- PID 1 1000 5 1005 PID 2 5 1000 1005
Now, if you want to find out what CPUs the processes are allowed to run on, or able to run on you would run the following command for each PID
grep Cpus_allowed_list /proc/$PID/status
This will display what CPUs can be used for this process. Now, we still need to know if the CPUs the processes are running on are using the local NUMA nodes to store data. To check this you would run:
numactl --hardware | grep cpus
If everything matches up then you are doing pretty good. If things do not match up then you might want to start moving process around to make sure the CPU and NUMA node are both local.
There are other tools that can help to dig a bit deeper such as linpack and perf I will add more on these tools once I have some time to play around with them.