Cloud Performance Tuning

From wiki.mikejung.biz
Jump to: navigation, search

Meme-response-times-are-too-damn-high.png

Server Performance / Resource Analysis Overview

If a website owner complains about a slow server or website and you want to quickly rule out the server as the source of the problem then follow the steps below. The general process involves:

  • 1) Checking server resources (CPU, RAM, DISK) to find out if they are saturated / causing a bottleneck which is leading to poor performance
  • 2) Determine how much spare resources there are. If it's a slow website issue and the server has spare RAM, then you should be enabling caches, or increasing their size if needed.
  • 3) At this point you should know if the server is overloaded or not, if it isn't then you should investigate application settings / configurations and

Tumblr m70ascnKzv1r9hatz.jpg


Does this sounds like a good plan to you? I hope so, otherwise why the hell are you reading this in the first place?

View and note server resource usage

Step 1: Use top

Top provides a dynamic view of the processes running on a Linux system. By default, the processes are ordered based on the amount of CPU % they are using, so the most CPU intensive processes will always be listed at the top. This command is very useful for getting a quick idea of what is running on the system. This should be one of the first commands to run when investigating a server with high load. Top does not do the best job at displaying all running processes as there is a slight delay between refreshes. Because of this I highly recommend you also use "ps faux" to get a full list of processes. Sometimes a process may only run for a very short period of time, but use a lot of CPU resources, or cause a massive IO wait spike. If you just use top you may not notice this.

Some red flags to look out for...

  • %wa is consistently above 10% - 20%. Typically this means that the storage device / array is slowing down the server. This value displays the amount of time the CPU spends waiting on the storage system to process a request. In general, the lower this value is, the more response the server will be, large values here mean that either the server is swapping, which slows down disk IO, or the disk is simply being pushed to it's limits. If this is the case then you should make sure there are no rogue, or wasteful processes running on the server. If the server is not swapping, and the application is performing a typical workload then it may be a good idea to suggest an upgrade to either an SSD, or SSD RAID array to help improve performance.

Or just host with Liquid Web's StormVPS which I designed to be fast as hell :D If you don't believe me at this point, go work for AWS, I don't care.

  • Please keep in mind that a "high" load average is not necessarily a bad thing. High load averages usually mean that the server is busy, and handling a lot of work, this is usually ideal. If there is not a high IO wait value, or Swap usage, and there are no suspicious processes using a ton of CPU, then there is not really an issue with the server. At some point the client's workload simply out grows the server, if that doesn't happen then the site is shrinking in terms of viewers. Personally I like growing sites. If a website is consistently popular and consistently growing, suggesting an upgrade is A VERY GOOD IDEA.
  • %us is the CPU time spent in the User Space. This includes most applications, specifically, anything that is NOT in Kernel Space.
  • %sys is the CPU time spent in Kernel Space. This does not include most applications, and is only the amount of time the CPU spends doing Kernel things.


The first thing I like to do is run "top -c" and look at current resource utilization. Usually I'll watch top for at least a few seconds to get an idea of server activity.

top -c


You will want to find out the following information

a) Record the value for "wa", is it greater than 20%?
b) Record the value for "us" and "sys", is the combined value greater than 80%?
c) Record any processes that are near the top of the list 
  • wa displays the percent of time that the CPU waits for the disk to do something, typically values that are lower than 10% - 20% are ok, but anything higher is a sign that the disks are potentially a bottleneck and causing reduced performance.
  • us and sys display the percent of time that the CPU spends in user land and in system land. Most of the time CPU will spend more time in user land, that's where applications like apache and mysql run.

I'm using my personal VPS as an example, so the values are fairly low because my server is well optimized and only has 2 websites hosted on it. If you see values like this then there's a very good chance that CPU and DISK are not a bottleneck.

wa = 0.0
us = 0.2
sys = 0.0

It's always a good idea to look at the top processes, this gives you an idea of the workload the server is dealing with. In this case the top processes are barely using any resources

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                             
 3143 root      20   0  363076  13136   3688 S   0.7  1.3   0:33.19 /opt/draios/bin/dragent --daemon --dragentpid=/var  
  925 mysql     20   0 1214968  91276   2128 S   0.3  9.2   1:13.19 /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib

At this point we have confirmed that:

  • Resource utilization is very low, and the server has plenty of resources to spare
  • The most active processes are using reasonable amounts of resources.
  • The server is probably not running out of memory because if it was, it would be swapping, which will cause higher than normal IO (which the "wa" value will show)

Step 2: Use vmstat

Brace-yourselves-5968547f78b37cc9982884f165f71f84e81abb7e5041c038b1855cdfc6aaed84.png

The Virtual Memory Statistics tool, known as vmstat, provides reports on the system's processes, memory, paging, block IO, interrupts and CPU activity. You can change the sample time, to get near real time updates on system activity. vmstat is one of my favorite utilities because of it's flexibility. Often times I find it to be more useful than top because it allows you to view real time, or one off reports of all the main Linux subsystems. I especially like the way it displays si(Swap In) and so(Swap Out), which are significantly more useful than just view swap usage. If your server is using half it's swap space, but only needs to swap in / out every once and a while that's perfectly fine, but if your server is constantly swapping in and out and you see reduced performance, odds are you need to upgrade to more RAM or move some processes / services on the server to avoid constant SWAP activity.


The next thing to check is for active swap usage. Sometimes even if there is plenty of RAM to spare, Linux still wants to use SWAP. This is TOTALLY OK as long as Linux is not actively swapping in and out a ton of data. This causes latency which can slow down applications and websites. To find out what SWAP activity looks like, run "vmstat 1" and watch the output for a few seconds.

vmstat 1

We are looking at the SWAP columns, specifically for very high amounts of "si" and "so". Some activity is not a problem but if you are seeing 99999956756756799 si and 251576575614561 so constantly that is a sign that either the server needs more RAM, or application settings need to be adjusted to conserve ram.

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0 518216 118772  21580 414292    0    1   272    31   84   49  1  0 99  0  0
 0  0 518216 118756  21588 414292    0    0     0    12  231  442  0  0 100  0  0
 0  0 518216 118756  21588 414292    0    0     0    16  191  384  0  0 100  0  1
 0  0 518216 118756  21588 414292    0    0     0     0  201  398  0  0 99  0  0
 0  0 518216 118756  21588 414292    0    0     0    84  217  414  0  1 100  0  0
 0  0 518216 118756  21588 414292    0    0     0     0  196  391  0  0 100  0  0
si - Swap in, or writes to swap space in KB

so - Swap out, or reads from swap space in KB

bi - Block in, or block write operations in KB

bo - Block out, or block read operations in KB

wa - The portion of the queue that is waiting for IO operations to complete.

At this point we've confirmed that there's no active swapping.

Let's see how much spare ram there is

vmstat -s -S M

Usually I look at "used memory" and subtract that from "total memory" to get an idea of how much RAM we have to play with. Linux will "use" as much RAM as it can for buffers and caches, but usually it's best to allocate this ram for Memcached and other application / LAMP caches.

          969 M total memory
          427 M used memory
          302 M active memory
          465 M inactive memory
          109 M free memory
           23 M buffer memory
          408 M swap cache
         1999 M total swap
          506 M used swap
         1493 M free swap
       252703 non-nice user cpu ticks
         8242 nice user cpu ticks
        84652 system cpu ticks
     49428723 idle cpu ticks
        26620 IO-wait cpu ticks
            1 IRQ cpu ticks
          350 softirq cpu ticks
        66533 stolen cpu ticks
    135097813 pages paged in
     15489170 pages paged out
        43145 pages swapped in
       171392 pages swapped out
     41757223 interrupts
     67516216 CPU context switches
   1445284938 boot time
       363668 forks

At this point we know:

  • There's well over 90% CPU to spare
  • There's well over 90% DISK IO to spare
  • There's almost no SWAP activity and we have about 300MB of RAM to play with

If you are seeing the opposite and feel that the server is performing horribly and want help, have you thought about performance tuning?

  • If MySQL is one of the top offenders, make sure that /etc/my.cnf has sane settings.
  • If Apache is one of the top offenders, view httpd.conf settings and ensure that you're using the EVENT MPM.
  • If PHP is one of the top offenders, make sure that you're using FCGI , PHP-FPM or HHVM

Step 3: Use sar

The System Activity Reporter, known as sar, collects and reports information about Linux system activity, this information is collected throughout the day and displayed in 10 minute intervals. This can be used to view historical CPU usage, IO wait, RAM usage and many other metrics. This tool is very useful for determining if server load is consistently an issue, or if it only happens at certain times of the day.

Some red flags to look out for with sar....

  • Consistently high %iowait. If this value is constantly high, like the example below, this is a sign that the server is IO bound, this will increase the server load because the CPU is spending a lot of time waiting for IO requests to complete. If activity is bursty, then look at crons or traffic patterns to see what is causing the IO. IO wait can significantly increase CPU load, so identifying periods with IO wait is a good start to determining why performance is reduced.
  • Consistently low %idle. If this value is below 5%, then it's safe to assume that CPU utilization is a concern. Keep in mind that high IO wait will cause CPU usage to appear high. If there is low IO wait, and high CPU usage, then odds are, there are some processes running that are extremely CPU intensive. If this is consistently the case, and IO is not an issue, then suggesting a CPU upgrade is recommended.


We know that currently the server is not overloaded, but we don't know if this is always the case. To get a better idea of this I like to use sar, which will display the stats we've looked at previously but for an entire day.

Usually I don't like to use Load Average as a measurement, but it does give you a quick idea if RAM, CPU or DISK have been stressed at any point during the past 24 hours.

sar -q

Here's a sample of the output. The VPS I'm using has 2 vCPUs, so if you see a Load Average that is higher than 2 for consistent periods, note the time and investigate further.

12:00:01 AM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15   blocked
12:10:01 AM         2       337      0.00      0.01      0.05         0
12:20:01 AM         0       337      0.02      0.02      0.05         1
12:30:01 AM         4       346      0.00      0.01      0.05         3
12:40:01 AM         1       336      0.00      0.02      0.05         0
12:50:01 AM         2       337      0.00      0.01      0.05         0
01:00:01 AM         1       339      0.00      0.01      0.05         0
01:10:01 AM         2       336      0.00      0.01      0.05         0
01:20:01 AM         0       336      0.02      0.02      0.05         1
01:30:01 AM         2       340      0.00      0.01      0.05         1

If the output looks like my example then there's nothing really going on on the server.

Step 4: Determine what to look at next

  • If you noticed high DISK utilization, please go to this disk section
  • If you noticed high RAM/SWAP utilization, please go to this ram/swap section
  • If you noticed high DISK utilization, please go to this cpu section


If you are not seeing any evidence of resource issues then congrats! You just determined that there is probably not an issue with resources!

If you are running a website or application and it's running slowly, you should check applications configurations and utilize caching wherever possible.

Disk Analysis

Run iotop and identify the processes that are generating the most IO, is this IO read or write?

iotop
  • If MySQL is one of the top offenders, make sure that /etc/my.cnf has sane settings. To learn more about this please go here
  • If Apache is one of the top offenders, view httpd.conf settings and ensure that you're using the EVENT MPM. To learn more about Apache optimization please go here
  • If PHP is one of the top offenders, make sure that you're using FCGI , PHP-FPM or HHVM

If nothing can be tuned, and everything looks good, other than lack of Liquid Web Zone C servers then:

3sl6up.jpg

Please provide performance upgrade suggestions prior to suggesting upgrade. What do they need? why do they need it? what do you recommend to solve the issue? Will the upgrade give the site room to grow?

RAM/SWAP Analysis

create a file and paste this into the file, save fix and make sure it's executable by running chmod +x

#!/bin/bash
ps -C $1 -O rss | gawk '{ count ++; sum += $2 }; END {count --; print "Number of processes =",count; print "Memory usage per process =",sum/1024/count, "MB"; print "Total memory usage =", sum/1024, "MB" ;};'

Run the commands below to find out what the most common processes are using in terms of memory.

./mem_use.sh php
./mem_use.sh httpd
./mem_use.sh mysql
  • If MySQL is one of the top offenders, make sure that /etc/my.cnf has sane settings. To learn more about this please go here
  • If Apache is one of the top offenders, view httpd.conf settings and ensure that you're using the EVENT MPM. To learn more about Apache optimization please go here
  • If PHP is one of the top offenders, make sure that you're using FCGI , PHP-FPM or HHVM


If nothing can be tuned, and everything looks good, other than lack of RAM and swapping of epic proportions

3sl6up.jpg

Please provide performance upgrade suggestions prior to suggesting upgrade. What do they need? why do they need it? what do you recommend to solve the issue? Will the upgrade give the site room to grow?


CPU Analysis

Use top and vmstat to identify processes that are using the most CPU.

Using ps can help to identify processes that may not show up in top.

ps faux


  • If MySQL is one of the top offenders, make sure that /etc/my.cnf has sane settings. To learn more about this please go here
  • If Apache is one of the top offenders, view httpd.conf settings and ensure that you're using the EVENT MPM. To learn more about Apache optimization please go here
  • If PHP is one of the top offenders, make sure that you're using FCGI , PHP-FPM or HHVM

If CPU usage is constantly greater than 80% and the processes are legitimate and properly tuned, with no IO wait or SWAP, then odds are you need a faster CPU!

3sl6up.jpg

Please provide performance upgrade suggestions prior to suggesting upgrade. What do they need? why do they need it? what do you recommend to solve the issue? Will the upgrade give the site room to grow?