- 1 Performance Analysis Overview
- 2 Performance Analysis Tools and Utilities
- 3 Performance Tuning Tools and Utilities
- 4 Section 2: CPU
- 5 Section 3: Memory
- 6 Section 4: Storage and File Systems
- 7 Section 5: Networking
- 8 Section 6: Real World Examples
- 9 Section 7: Troubleshooting
- 10 RHEL 6 and 7 Official Performance Tuning Links and Guides
Performance Analysis Overview
This wiki is meant to serve two main purposes:
1) Provide a basic framework for troubleshooting and identifying performance issues, with a focus on CentOS, but it's applicable to most Linux distributions.
2) Explain how the various subsystems of a server operate, this includes storage, Memory, CPU, and networking.
Performance Analysis Tools and Utilities
Using top to analyze bottlenecks
Top provides a dynamic view of the processes running on a Linux system. By default, the processes are ordered based on the amount of CPU % they are using, so the most CPU intensive processes will always be listed at the top. This command is very useful for getting a quick idea of what is running on the system. This should be one of the first commands to run when investigating a server with high load. Top does not do the best job at displaying all running processes as there is a slight delay between refreshes. Because of this I highly recommend you also use "ps faux" to get a full list of processes. Sometimes a process may only run for a very short period of time, but use a lot of CPU resources, or cause a massive IO wait spike. If you just use top you may not notice this.
Some red flags to look out for...
- Swap row shows a large amount of used Swap space. A little swap usage is not an issue, but if you see 10% + used consistently, then there is more than likely not enough RAM to handle all the processes. If this is the case you should either identify processes that are wasting RAM, or using too much, or suggest a resize / upgrade to a server with more RAM. Upgrades should only be recommended if the server is CONSISTENTLY using Swap, over long periods of time.
- %wa is consistently above 10% - 20%. Typically this means that the storage device / array is slowing down the server. This value displays the amount of time the CPU spends waiting on the storage system to process a request. In general, the lower this value is, the more response the server will be, large values here mean that either the server is swapping, which slows down disk IO, or the disk is simply being pushed to it's limits. If this is the case then you should make sure there are no rogue, or wasteful processes running on the server. If the server is not swapping, and the application is performing a typical workload then it may be a good idea to suggest an upgrade to either an SSD, or RAID array to help improve performance.
- Please keep in mind that a "high" load average is not necessarily a bad thing. High load averages usually mean that the server is busy, and handling a lot of work, this is usually ideal. If there is not a high IO wait value, or Swap usage, and there are no suspicious processes using a ton of CPU, then there is not really an issue with the server. There comes a point where the customer's workload simply out grows the server. If this consistently is the case, suggesting an upgrade is recommended.
- %us is the CPU time spent in the User Space. This includes most applications, specifically, anything that is NOT in Kernel Space.
- %sys is the CPU time spent in Kernel Space. This does not include most applications, and is only the amount of time the CPU spends doing Kernel things.
Example output using "top -c"
top - 15:29:39 up 16:00, 1 user, load average: 0.29, 0.11, 0.03 Tasks: 210 total, 2 running, 208 sleeping, 0 stopped, 0 zombie Cpu(s): 6.6%us, 4.5%sy, 0.0%ni, 85.6%id, 3.3%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 24711380k total, 1526880k used, 23184500k free, 47408k buffers Swap: 3145712k total, 0k used, 3145712k free, 175376k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 12451 qemu 20 0 5271m 1.1g 9512 R 125.1 4.5 53:00.49 qemu-system-x86 130 root 39 19 0 0 0 S 2.0 0.0 10:21.17 kipmi0 1 root 20 0 21492 1596 1276 S 0.0 0.0 0:00.89 init 2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
Example output using "top -c", then pressing "1" to view more detailed CPU info.
top - 15:35:14 up 16:06, 1 user, load average: 2.41, 1.54, 0.65 Tasks: 213 total, 1 running, 212 sleeping, 0 stopped, 0 zombie Cpu0 : 1.8%us, 14.3%sy, 0.0%ni, 53.4%id, 29.4%wa, 0.0%hi, 1.1%si, 0.0%st Cpu1 : 38.8%us, 10.5%sy, 0.0%ni, 50.0%id, 0.7%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 : 7.4%us, 2.7%sy, 0.0%ni, 89.0%id, 1.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 0.0%us, 0.3%sy, 0.0%ni, 99.3%id, 0.3%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 : 0.3%us, 2.7%sy, 0.0%ni, 94.3%id, 2.7%wa, 0.0%hi, 0.0%si, 0.0%st Cpu5 : 26.9%us, 12.4%sy, 0.0%ni, 60.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 : 7.7%us, 1.3%sy, 0.0%ni, 90.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu7 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 24711380k total, 1530532k used, 23180848k free, 47616k buffers Swap: 3145712k total, 0k used, 3145712k free, 176356k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 12451 qemu 20 0 5271m 1.1g 9512 S 136.3 4.5 57:59.76 /opt/kvm-stack-3/bin/qemu-system-x86_64 -cpu host,l 130 root 39 19 0 0 0 S 0.7 0.0 10:23.49 [kipmi0] 1 root 20 0 21492 1596 1276 S 0.0 0.0 0:00.89 /sbin/init 2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 [kthreadd] 3 root RT 0 0 0 0 S 0.0 0.0 0:00.03 [migration/0]
Explanation of some of the columns displayed. For a full description, please see this (or use man top).
%CPU -- CPU usage The task's share of the elapsed CPU time since the last screen update, expressed as a percentage of total CPU time. %MEM -- Memory usage (RES) A task's currently used share of available physical memory. VIRT -- Virtual Image (kb) The total amount of virtual memory used by the task. It includes all code, data and shared libraries plus pages that have been swapped out. VIRT = SWAP + RES. RES -- Resident size (kb) The non-swapped physical memory a task has used. RES = CODE + DATA. SHR -- Shared Mem size (kb) The amount of shared memory used by a task. It simply reflects mem- ory that could be potentially shared with other processes.
- For more information on Linux CPU stats, please visit this excellent blog - http://blog.scoutapp.com/articles/2015/02/24/understanding-linuxs-cpu-stats
The Virtual Memory Statistics tool, known as vmstat, provides reports on the system's processes, memory, paging, block IO, interrupts and CPU activity. You can change the sample time, to get near real time updates on system activity. vmstat is one of my favorite utilities because of it's flexibility. Often times I find it to be more useful than top because it allows you to view real time, or one off reports of all the main Linux subsystems. I especially like the way it displays si(Swap In) and so(Swap Out), which are significantly more useful than just view swap usage. If your server is using half it's swap space, but only needs to swap in / out every once and a while that's perfectly fine, but if your server is constantly swapping in and out and you see reduced performance, odds are you need to upgrade to more RAM or move some processes / services on the server to avoid constant SWAP activity.
Example output of "vmstat"
vmstat procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 0 0 23181036 48284 176364 0 0 14 866 46 57 1 1 98 0 0
Example output of "vmstat -S M -n 1". This displays in MB with 1 second reports.
vmstat -S M -n 1 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 0 22638 47 172 0 0 555 847 47 85 1 1 97 1 0 0 0 0 22638 47 172 0 0 0 8 241 93 0 0 100 0 0 0 0 0 22638 47 172 0 0 0 0 206 93 0 0 100 0 0 0 0 0 22638 47 172 0 0 0 0 208 90 0 0 100 0 0
Example output of "vmstat -s -S M". This provides a little more information than the default command.
vmstat -s -S M 24132 M total memory 1494 M used memory 1181 M active memory 150 M inactive memory 22637 M free memory 47 M buffer memory 172 M swap cache 3071 M total swap 0 M used swap 3071 M free swap 423301 non-nice user cpu ticks 49 nice user cpu ticks 295469 system cpu ticks 46874963 idle cpu ticks 304360 IO-wait cpu ticks 171 IRQ cpu ticks 14562 softirq cpu ticks 0 stolen cpu ticks 95571539 pages paged in 410354267 pages paged out 0 pages swapped in 0 pages swapped out 179644394 interrupts 689671073 CPU context switches 1407194940 boot time 222784 forks
Explanation of some of the values displayed by vmstat:
si - Swap in, or writes to swap space in KB so - Swap out, or reads from swap space in KB bi - Block in, or block write operations in KB bo - Block out, or block read operations in KB wa - The portion of the queue that is waiting for IO operations to complete.
Swap in and Swap out can be useful for identifying high memory usage, as swap space is used much more often when free RAM is hard to come by, high values here indicate that more RAM will help to improve IO in general by avoiding constant swap activity, which slows down the block device.
If vmstat shows that the IO system is responsible for reduced performance, you can then use iostat to determine what block device is causing the issue.
The System Activity Reporter, known as sar, collects and reports information about Linux system activity, this information is collected throughout the day and displayed in 10 minute intervals. This can be used to view historical CPU usage, IO wait, RAM usage and many other metrics. This tool is very useful for determining if server load is consistently an issue, or if it only happens at certain times of the day.
Some red flags to look out for with sar....
- Consistently high %iowait. If this value is constantly high, like the example below, this is a sign that the server is IO bound, this will increase the server load because the CPU is spending a lot of time waiting for IO requests to complete. If activity is bursty, then look at crons or traffic patterns to see what is causing the IO. IO wait can significantly increase CPU load, so identifying periods with IO wait is a good start to determining why performance is reduced.
- Consistently low %idle. If this value is below 5%, then it's safe to assume that CPU utilization is a concern. Keep in mind that high IO wait will cause CPU usage to appear high. If there is low IO wait, and high CPU usage, then odds are, there are some processes running that are extremely CPU intensive. If this is consistently the case, and IO is not an issue, then suggesting a CPU upgrade is recommended.
Here is an example of the output that sar gives
sar ..... 02:30:01 PM CPU %user %nice %system %iowait %steal %idle 02:40:01 PM all 0.02 0.00 0.16 0.00 0.00 99.81 02:50:01 PM all 0.02 0.00 0.16 0.00 0.00 99.82 03:00:01 PM all 0.02 0.00 0.16 0.00 0.00 99.81 03:10:01 PM all 0.03 0.00 0.17 0.00 0.00 99.81 03:20:01 PM all 0.02 0.00 0.16 0.00 0.00 99.82 03:30:01 PM all 1.79 0.00 1.29 0.86 0.00 96.05 03:40:01 PM all 10.83 0.00 5.84 4.56 0.00 78.77 03:50:01 PM all 18.54 0.00 9.97 9.86 0.00 61.64 04:00:01 PM all 7.07 0.00 4.14 6.77 0.00 82.02 04:10:01 PM all 14.62 0.00 8.48 20.56 0.00 56.34 04:20:01 PM all 17.51 0.00 12.22 30.43 0.00 39.84 Average: all 1.08 0.00 0.78 0.97 0.00 97.16
To view just the sar load averages and number of processes running, you can use sar -q. In this sar example, you can see that there is a period of high load on the server that appears to come out of no where. If you notice patterns throughout the day, checking for random / malicious crons is a good place to start. If you don't see any cronjobs causing load then the next thing to check for is backups, if you use cPanel and notice high load at night / early morning then odds are cPanel is running backups. If the time of the day when backups is run is not ideal, then change the time when backups are run. Another thing to watch out for would be web crawler traffic, if this becomes an issue you can use robots.txt to just tell google / bing to slow down on the crawling a bit.
sar -q 02:30:01 PM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15 02:40:01 PM 3 249 0.05 0.01 0.00 02:50:01 PM 3 252 0.00 0.00 0.00 03:00:01 PM 3 260 0.00 0.00 0.00 03:10:01 PM 4 254 0.37 0.08 0.02 03:20:01 PM 4 252 0.01 0.03 0.00 03:30:01 PM 5 262 0.40 0.15 0.05 03:40:01 PM 6 272 2.45 2.22 1.18 03:50:01 PM 6 298 5.02 4.02 2.54 04:00:01 PM 5 280 3.82 2.65 2.43 04:10:01 PM 6 294 12.11 8.86 5.47 04:20:01 PM 3 251 8.96 12.69 9.26 Average: 4 255 0.44 0.41 0.30
The iostat command is used for monitoring system input/output device loading by observing the time the devices are active in relation to their average transfer rates. The iostat command generates reports that can be used to change system configuration to better balance the input/output load between physical disks.
If IO wait is an issue on the server, this command will help you to determine the device that is causing IO wait, it will also display some statistics in real time that show how many IO operations the device is performing, along with a lot of other really useful stats. If you are using a RAID array and iostat is showing unusual IO activity for a partition on the array, it's possible that the RAID is degraded, or something funny is going on. You can use CLI utilities to check the status of Adaptec or LSI RAID cards.
Example output from "iostat"
iostat Linux 2.6.32-431.17.1.el6.x86_64 (host53.c.lan.xvps.staging.liquidweb.com) 08/05/2014 _x86_64_ (8 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 1.17 0.00 0.77 0.98 0.00 97.08 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 2266.08 9378.17 14107.81 577995182 869491886 dm-0 2361.54 9371.79 14102.56 577601564 869168390
For more detailed output, and current usage, you can use the following command to get a much better idea of the IO activity iostat -d -m -p -x 1
iostat -d -m -p -x 1 Linux 2.6.32-431.17.1.el6.x86_64 (host53.c.lan.xvps.staging.liquidweb.com) 08/05/2014 _x86_64_ (8 CPU) Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 53.20 41.05 903.56 1359.31 4.92 7.30 11.06 0.30 0.13 0.03 5.66 sda1 0.05 0.29 0.12 0.36 0.00 0.00 23.53 0.00 0.33 0.27 0.01 sda2 0.00 0.00 0.01 0.00 0.00 0.00 8.50 0.00 0.12 0.12 0.00 sda3 53.14 40.76 903.42 1358.95 4.92 7.30 11.06 0.30 0.13 0.02 5.65 dm-0 0.00 0.00 956.56 1399.70 4.92 7.30 10.62 0.34 0.15 0.02 5.66 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.00 0.00 141.00 0.00 33.91 492.48 1.52 10.79 0.22 3.10 sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda3 0.00 0.00 0.00 141.00 0.00 33.91 492.48 1.52 10.79 0.22 3.10 dm-0 0.00 0.00 0.00 141.00 0.00 33.91 492.48 1.52 10.79 0.22 3.10
The options listed below are the most useful options for identifying disk and partition performance.
-d Display the device utilization report. -m Display statistics in megabytes per second instead of blocks or kilobytes per second. -p The -p option displays statistics for block devices and all their partitions that are used by the system. -x Display extended statistics. 1 report statistics in 1 second intervals
The columns listed may seem confusing at first, but here is a quick explaination of the key areas to look at
r/s -- Read Operations performed per second. w/s -- Write Operations performed per second. avgrq-sz -- The average size (in sectors) of the requests that were issued to the device. avgqu-sz -- The average queue length of the requests that were issued to the device. await -- The average time (in milliseconds) for I/O requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them.
Performance Tuning Tools and Utilities
tuned and tuned-adm
"tuned is a tuning daemon that can adapt the operating system to perform better under certain workloads by setting a tuning profile. tuned-adm is a command line tool that lets users switch between different tuning profiles."
This can be a great way to quickly and easily tune performance on a customer's server. Installation is simple and activating a profile is even easier. Most of the profiles can be activated on the fly, with no reboot needed. Tuned will make some adjustments to sysctl.conf based on the profile you select.
To install and configure tuned
##Install, enable and start tuned yum install tuned chkconfig tuned on service tuned start ##Set Profile to latency-performance tuned-adm profile latency-performance ##Check what the active profile is set to tuned-adm active Current active profile: latency-performance Service tuned: enabled, running Service ktune: enabled, running
For RHEL 7, the default profile is "throughput-performance". There are many other profiles that you can select, to view more information on these profile, visit the tuned manpage.
To enable dynamic tuning behavior, edit the "dynamic_tuning" parameter in:
You can also configure the amount of time in seconds between tuned checking usage and updating tuning details with the "update_interval" parameter.
Tuned profiles can be viewed and modified in this location
The main files used by a profile are
tuned.conf -- Enables or disables various monitoring and tuning plugins ktune.sysconfig -- Enables or disables ktune and selects disk IO scheduler sysctl.ktune -- Modifies various /proc/sys/ files ktune.sh -- Shell script that can be started or stopped, can use this to tune settings not found in profiles, refernce /etc/tune-profiles/functions for ideas
This tool uses hardware performance counters and kernel tracepoints to track the impact of other commands and applications on the system. There are various commands that will display and record statistics for common performance events and the tool allows you to later analyze these recordings.
Using this tool is a little outside of the scope for this wiki, but it's mentioned here in case you really want to get detailed performance analysis.
Command line tool that prints statistics about sockets. This displays performance over time, by default the tool lists open non-listening TCP sockets that have established connections.
This tool displays memory statistics for processes and the OS on a per NUMA node basis. This tool displays the amount of NUMA hits or misses from the time when the system booted.
Large amounts of misses indicates that the processes are not performing as efficiently as it otherwise could. Each miss means that extra latency was added by having to skip from the NUMA node the process is running on, and grab data from another NUMA node.
Here is an example of numastat output. This is on a single socket server, but if this was a NUMA system, there would be multiple nodes listed. If you see a lot of "numa_miss", then you may want to install and enable "numad" to see if this helps reduce numa_misses.
numastat node0 numa_hit 47140699 numa_miss 0 numa_foreign 0 interleave_hit 28491 local_node 47140699 other_node 0
"numad is an automatic NUMA affinity management daemon. It monitors NUMA topology and resource usage within a system in order to dynamically improve NUMA resource allocation and management (and therefore system performance)."
"Depending on system workload, numad can provide up to 50 percent improvements in performance benchmarks. It also provides a pre-placement advice service that can be queried by various job management systems to provide assistance with the initial binding of CPU and memory resources for their processes."
To install and enable numad
yum install numad chkconfig numad on service numad start
Section 2: CPU
It's important to understand the CPU(s) that are in the server and the topology that is used. It's hard to understand and tune performance if you do not know the following:
- What CPU is the server using? What is the model and the amount of CPUs?
- Is this an SMP, or NUMA based system?
SMP and NUMA
There are two main types of topology used for modern servers:
Symmetric Multi-Processor (SMP) -- Allows all processors to access memory in the same way / amount of time. All single socket servers utilize this topology. This does not scale well however, so most larger servers will be using a NUMA based topology, which includes more than 1 CPU socket.
Non-Uniform Memory Access (NUMA) -- Developed more recently than SMP. NUMA systems allow for more than one CPU socket. Typically we see 2 and 4 socket systems.
Each socket has it's own local memory bank, which is the fastest in terms of access time. However each socket can also utilize remote memory banks, however there is additional latency involved in these remote locations. These remote accesses are also called numa_misses.
To determine if a server is SMP or NUMA, you can run the following command:
This will tell you how many nodes the server has. If there is more than one node, then it’s safe to assume that this is a NUMA system.
You can also use:
The smallest piece of a process is known as a thread. The system scheduler determines which threads run on which processors, and when. By default the scheduler's main goal is to keep the system as busy as possible, so it may not schedule processes optimally, at least for application performance.
An example of this would be on a NUMA system with two nodes. If process A has it's data stored in NUMA Node A, but the scheduler decides to run process A's thread on NUMA Node B, the application will not run as fast as it would if the thread ran on Node A.
By using the numastat command listed in the previous section, you should be able to identify if this is a common issue by viewing NUMA hits and misses. If there are a lot of misses, and the application is running slow, then making sure that threads run on the same CPU and memory bank is critical for performance tuning.
In previous versions of RHEL (pre 6), the kernel would interrupt the CPU on a regular basis to check what work needed to be done. This interrupt was used to collect data that was then used for future scheduling and CPU load balancing changes.
With RHEL 6 and 7, the kernel no longer interrupts idle CPUs, which helps to lower power usage on near idle systems.
with RHEL 7 or CentOS 7, there is a new dynamic tickless option, which reduces kernel interference with user-space tasks. This can be enabled on a per core basis, when enabled, all timekeeping activities are moved to a core that is not as active, and has lower latency.
Interrupt Request (IRQ) Handling
An IRQ is a signal for immediate attention that is sent from a piece of hardware to the CPU. Each device is assigned one or more IRQ numbers which allow it to send it's own unique interrupt. A processor that receives an interrupt will immediately pause whatever thread it's running and wait for the interrupt to complete before it resumes processing.
Because of this, a system with a high amount of interrupts will run in a degraded performance state.
Section 3: Memory
Physical Memory is managed in small chunks called pages. Physical Memory is mapped to a virtual location so that the processor can access the memory. The page table is the mapping of physical to virtual memory, the page table is essentially an index used for quickly looking up where something is in memory.
The default size of a page is 4KB. For small amounts of RAM this is not a huge issue, however for large amounts of RAM, the table becomes large, and lookups begin to slow down. Because of this, the use of Transparent Huge Pages is recommended. The default size of a page is 2MB for transparent huge pages, which makes it much easier for applications to scale if they need large amounts of RAM.
Translation Lookaside Buffer size
This is a cache for recently-used addresses. This speeds up look up time since utilizing the page table would not scale, and slow down applications, this cache helps to speed up the system.
This does not scale well with large amounts of RAM, so there is an alternative called Huge Translation Lookaside Buffer, which allows memory to be managed in very large segments.
Configuring huge pages
"Huge pages rely on contiguous areas of memory, so it is best to define huge pages at boot time, before memory becomes fragmented. To do so, add the following parameters to the kernel boot command line:"
View the current huge pages value:
cat /proc/sys/vm/nr_hugepages cat /proc/meminfo | grep Huge
To set the number of huge pages:
echo $amount_of_huge_pages > /proc/sys/vm/nr_hugepages
Alternatively, to make the setting persistent, modify the vm.nr_hugepages value in /etc/sysctl.conf. Make sure you run sysctl -p after you make the changes.
By allowing all free memory to be used as cache, performance is increased. Transparent Hugepages are used by default if /sys/kernel/mm/redhat_transparent_hugepage/enabled is set to always.
Section 4: Storage and File Systems
The IO scheduler determines when and for how long IO operations run on a storage device. This is also known as the IO elevator. There are three schedulers for RHEL 7, and there is a large performance difference between the three schedulers, so picking the correct one is very important for performance.
deadline -- In RHEL7, this is the default scheduler for all block devices other than SATA. This scheduler attempts to keep latency as low as possible, and is the best choice for most use cases. Queued IO operations are put into two categories, read and write. Read IO queues are scheduled more often than write IO since read IO typically causes applications to block while the read is happening.
cfq -- Completely Fair Queueing is the default scheduler for SATA devices. CFQ divides processes into three classes: real time, best effort and idle. Real Time processes are always performed before processes in the Best Effort class, which are performed before processes in the Idle class.
Because of the way that CFQ uses historical data to determine scheduling, there is often a lot of idling going on. CFQ is not efficient at handling SSDs, or large raid arrays. This scheduler should really not be used any more, deadline is almost always the best bet.
noop -- This scheduler uses a simple FIFO (first in, first out) algorithm. This can be the best scheduler for CPU bound systems that use very fast storage.
To find the scheduler of a device.
root@server [~]# cat /sys/block/vda/queue/scheduler noop anticipatory deadline [cfq]
XFS -- This is the default File system in RHEL 7. "XFS uses extent-based allocation, and features a number of allocation schemes, including pre-allocation and delayed allocation, both of which reduce fragmentation and aid performance."
EXT4 -- "Ext4 is a scalable extension of the ext3 file system. Its default behavior is optimal for most workloads. However, it is supported only to a maximum file system size of 50 TB, and a maximum file size of 16 TB."
Mount Options which affect performance
Barriers -- File system barriers make sure that meta data is correctly written and ordered on persistent storage. Barriers also make sure that data transmitted with fsync persists even after a power outage.
Before RHEL 7, there used to be a large performance hit with barriers enabled, however with RHEL 7, performance loss is less than 3% with this option enabled.
Access Time -- Every time a file is read, its metadata is updated with the time that the access occurred (atime). This involves additional write IO. Usually this is not a huge deal, however on very active systems this can start to add up and reduce performance.
If accurate access times is not needed, then you can disable this option by using noatime when mounting the volume.
File system maintenance
It is a good idea to discard blocks that are no longer in use by the filesystem for SSDs and thinly-provisioned storage. There are two methods of discarding unused blocks:
Batch discard -- This is part of the fstrim command. You can manually run this command, or use a cron if you wish, but you can also set RHEL 7 to handle physical discard operations by putting values in the following:
Both should be set to a non zero value if you want to utilize this type of discard.
HDD /sys/block/devname/queue/discard_max_bytes SSD /sys/block/sda/queue/discard_granularity
Online discard -- This type of discard is configured at mount time with the discard option. This runs in real time without user intervention, however it only discards blocks that transition from used to free.
Redhat recommends using batch discard over online discard in most cases.
blktrace -- Provides information about how time is spent in the IO subsystem.
blkparse -- Reads the raw output from blktrace and provides a human readable summary of input and output operations recorded by blktrace.
Tuned and tuned-adm -- Provide a number of profiles which help to improve performance for specific types of workloads. The two most relevant profiles for IO and storage are:
For systems that require fast IO and low latency, choose the latency profile. For systems that require high throughput instead of lower latency, use the throughput profile.
tuned-adm profile $profile_name
Setting the default IO scheduler -- This is the scheduler that is used by default if no scheduler is specified in a device's mount options. To change this on boot, edit /etc/grub2.conf and append this to the kernel command line:
Configure IO scheduler on per device basis -- If you wish to use different schedulers for different devices, you can modify the following file to change this on the fly.
echo deadline > /sys/block/$hda/queue/scheduler
$hda should be replaced with the actual device.
Tuning the deadline scheduler
Deadline puts more priority on READ operations than it does WRITE operations. After READ batches are processed, deadline checks to see how long the WRITE operations have been starved for, and processes them accordingly.
There are a few parameters that can be tuned to change how the scheduler behaves.
fifo_batch -- The number of read or write operations to issue in a single batch. Default value is 16. Raising this to a higher value can increase throughput, but it will do so at the expense of higher latency.
front_merges -- If your workload will never generate front merges, this can be set to 0. Unless you have measured the overhead of this, and 100% understand how the application works, please leave this alone. Default value is 1
read_expire -- The number in milliseconds in which a read request should be scheduled for service. The default value is 500 (0.5 seconds).
write_expire -- The number in milliseconds in which a write request should be scheduled for service. The default value is 5000 (5 seconds).
writes_starved -- The number of read batches that can be processed before processing a write batch. The higher this value is set, the greater the preference given to read batches.
Tuning the noop scheduler
"The noop I/O scheduler is primarily useful for CPU-bound systems using fast storage. Requests are merged at the block layer, so noop behavior is modified by editing block layer parameters in the files under the /sys/block/sdX/queue/ directory. "
add_random -- Some IO events contribute to the entropy pool for /dev/random. This parameter can be set to 0 if the overhead becomes an issue.
max_sectors_kb -- Specifies the max size of an IO request in KB. The default is 512. The minimum value for this parameter is determined by the logical block size of the storage device. The maximum value for this parameter is determined by the value of max_hw_sectors_kb.
Some solid state disks perform poorly when I/O requests are larger than the internal erase block size. In these cases, Red Hat recommends reducing max_hw_sectors_kb to the internal erase block size.
nr_requests -- “Specifies the maximum number of read and write requests that can be queued at one time. The default value is 128; that is, 128 read requests and 128 write requests can be queued before the next process to request a read or write is put to sleep.
For latency-senstitive applications, lower the value of this parameter and limit the command queue depth on the storage so that write-back I/O cannot fill the device queue with write requests. When the device queue fills, other processes attempting to perform I/O operations are put to sleep until queue space becomes available. Requests are then allocated in a round-robin fashion, preventing one process from continuously consuming all spots in the queue. “
optimal_io_size -- “Some storage devices report an optimal I/O size through this parameter. If this value is reported, Red Hat recommends that applications issue I/O aligned to and in multiples of the optimal I/O size wherever possible. “
read_ahead_kb -- “Defines the number of kilobytes that the operating system will read ahead during a sequential read operation in order to store information likely to be needed soon in the page cache. Device mappers often benefit from a high read_ahead_kb value; 128 KB for each device to be mapped is a good starting point. “
rotational -- “Some solid state disks do not correctly advertise their solid state status, and are mounted as traditional rotational disks. If your solid state device does does not set this to 0 automatically, set it manually to disable unnecessary seek-reducing logic in the scheduler.” rq_affinity -- “By default, I/O completions can be processed on a different processor than the processor that issued the I/O request. Set rq_affinity to 1 to disable this ability and perform completions only on the processor that issued the I/O request. This can improve the effectiveness of processor data caching.”
File system metadata and striped arrays.
If the metadata is not aligned with the disk correctly, then there might be cases when a write happens two times because the metadata is split between two disks in an array.
On striped arrays, a chunk of data of CHUNKSIZE is written to a single disk before moving to the next disk in the array. Once all disks have been used, the array will return to the first disk again.
If the File system is not lined up correctly on this array, then additional requests might be made to more than one disk, which is slow down performance. Additionally, if all the meta data ends up going to a single disk in the array, and not others, then this single disk may become a hotspot, and slow down the entire array.
To counter this type of activity, you need to understand the following:
- The File system block size that will be used
- The chunk size of the array
- The number of disks in the array
- The number of parity disks in the array
The next step is to calculate the filesystem stride and stripe-width
stride = # of filesystem blocks that fit inside one chunk Example: If filesystem block size = 4K, and the chunk size is 64K, the stride would be 64/4 = 16 blocks stripe-width = # of filesystem blocks that fit on one stripe of the RAID array. To determine this, you need to know how many disks in the array actually carry data blocks. Example: 6 disk RAID6, has two disks for parity, so 6 - 2 = 4 disks carrying block data. Each of the 4 disks will have stride number of blocks per chunk 4 (disks) x 16 (stride) = 64 filesystem blocks per stripe.
To create a file system that aligns with the array:
mkfs.ext4 -E stride=16,stripe-width=64 /dev/$my_array
Section 5: Networking
The networking subsystem in RHEL 7 consists of a lot of different parts and sensitive connections. Therefore it is not advised to do any manual tuning unless you fully understand what you are doing. RHEL 7 does a lot of optimization on the fly and usually does a good job at optimizing performance, however there may be some cases where manual tuning is needed.
Understanding how a packet is received and processed in RHEL
1) A packet sent to the server is received by the Network Interface Card (NIC) and placed in either an internal hardware buffer, or a ring buffer.
2) The NIC then sends a hardware interrupt request, which prompts the creation of a software interrupt operation to handle the interrupt request.
As part of this operation, the packet is transferred from the buffer to the network stack.
3) Depending on the packet, and network configuration, the packet is forwarded, discarded, or passed to a socket receive queue for an application, and is then removed from the network stack.
4) This process continues to happen until there are no packets left in the NIC hardware buffer, or a certain number of packets are transferred (specified in /proc/sys/net/core/dev_weight)
Bottlenecks in packet reception
There are a few places during network packet processing that can become a bottleneck and reduce performance.
The NIC hardware buffer or ring buffer -- This might be a bottleneck if a lot of packets are being dropped, however this is not always the case.
The hardware or software interrupt queues -- too many interrupts can cause increased latency and processor contention
The socket receive queue for the application -- A bottleneck in an application's receive queue is indicated by a large number of packets that are not copied to the requesting application, or by an increase in UDP input errors (InErrors) in /proc/net/snmp
This setting optimizes the TCP stack for either high throughput, or low latency. By default this is set to 0. Change this to 1 if you prefer lower latency over high throughput.
sysctl -w net.ipv4.tcp_low_latency=1
You can use qperf $host tcp_lat to test out the changes.
Section 6: Real World Examples
Section 7: Troubleshooting
There can sometimes be a lot of useful information that can be viewed by using dmesg:
If there is a hardware error, or issue, often times it will show up in dmesg. It never hurts to run this command and look over the output!
Useful to find information about what system calls a process is using, and how much time is spent on each system call.
Useful to find information about what libraries a process is using.