KVM / Xen

From wiki.mikejung.biz
Jump to: navigation, search

Libvirt and QEMU Performance Tweaks for KVM Guests

I've found that with SSD backed VMs, the best configuration options to use with the KVM guest are

x-data-plane=on (now known as iothreads)

There is no other combination of QEMU guest settings that provide better random read and write performance. You cannot use writeback caching with x-data-plane (iothreads) and directsync seems to be the safest and fastest cache option to use with a KVM host. With SSD backed KVM guests, you normally want to avoid enabling both the host and guest page cache, and just write data straight through to the underlying SSD(s). Using native IO seems to be the way to go as threads seems to be slower in terms of pure IOPS, but then again IOPS isn't always the most important thing to measure. Anyway, I generally like to go with native IO if the host kernel supports it.

You will also get a nice performance boost if the KVM host is using a newer Linux Kernel. There is a very large performance gain when you use Linux Kernel 3.16 or later. This is because the newer Linux kernels utilize blk-mq, which allows for multiple software and hardware queues per block device.

virtio Block Device Driver Tweaks

virtio-blk iothreads (x-data-plane)

iothreads overview

Each iothread object (block device) creates it's own thread outside of the QEMU global mutex. Previously all the block devices had to use the same mutex thread as everything else, so if you got a ping and were reading data at the same time, there's going to be some extra latency waiting on the lock. Iothreads solves a lot of problems, and it significantly speeds up guest virtio-blk I/O.

By having their own thread, virtio-blk devices can read and write as much as they want without having to wait in a queue with everyone else, like some random ping from a bot or spam, this small pause adds up when a lot of things are going on, so sparing your storage from this slow process works wonders on performance. Since disk I/O isn't slowing down the rest of the guest, everything else speeds up in return. Not only can you have better disk performance, you get better everything performance.

You can also pin specific iothreads to specific CPUs, so if you have already pinned a guest vCPU to a host CPU, you could choose to give the guest block device it's own thread on a separate CPU from the guest vCPU. You can also pin the iothread and the guest vCPU to the same host CPU. In addition to vCPU pinning you can also change the ratio of threads per device. You could assign 4 block devices a single thread, or you could give each device a single thread.

  • Libvirt 1.2.8+ supports iothreads

KVM guests with SSD backed storage have been bottlenecked up until iothreads. The issue is that block IO has to pass through the emulator thread with everything else which means there is a lot of waiting around going on. This really bums out SSDs because they are not really being used to their full extent. To get around this someone decided to call this option "x-data-plane", at least in the beginning, now it's called "iothreads".

I've performed a few tests with and without iothreads, and I saw massive performance gains for random read and write workloads. In some cases I've seen over 1000% improvements, seriously, like 30,000 to 300,000 iop performance gains, simply by enabling iothreads. I did those tests about a year ago when it was still called x-data-plane, so I would imagine that performance will increase as new patches come out. The people working on KVM / QEMU are heroes in my book!

How to configure iothreads for KVM guests

To check if data plane is enabled for a guest, run this command (replace $guest with the actual guest name)

virsh qemu-monitor-command --hmp $guest 'info qtree' | less

Alternatively, you can check with this command:

ps fauxx|grep qemu

You can also use qemu-system-x86_64 -device virtio-blk-device,help to view the current settings. If iothreads is an option then you should be able to enable it for block devices.

qemu-system-x86_64 -device virtio-blk-device,help

For QEMU 2.2.1, this is what the default settings and options look like for iothreads (previously called x-data-plane).

virtio-blk-device.x-data-plane=bool (on/off)

You can configure a virtio-blk device / guest drive to use iothreads or in this case, x-data-plane by starting the guest with the qemu-kvm options listed below.

qemu-kvm -drive if=none,id=drive0,cache=directsync,io=native,format=raw,file=<$disk_or_$parition> -device virtio-blk-pci,drive=drive0,scsi=off,x-data-plane=on

The virtio-blk-data-plane feature can be enabled or disabled by the x-data-plane=on|off option on the qemu command line when starting the VM Guest:

qemu [...] -drive if=none,id=drive0,cache=none,aio=native,\
format=raw,file=filename -device virtio-blk-pci,drive=drive0,scsi=off,\
config-wce=off,x-data-plane=on [...]

According to the slides above, using virtio-blk and x-data-plane=on provides the best possible IO with KVM. This appears to be an actual fact. For SSD backed VMs, I've see anywhere between 10 and 20 times higher random IO when using x-data-plane.

"The virtio-blk-data-plane is a new performance feature for KVM. It enables a high-performance code path for I/O requests coming from VM Guests. More specifically, this feature introduces dedicated threads (one per virtual block device) to process I/O requests going through the virtio-blk driver. It makes use of Linux AIO (asynchronous I/O interface) support in the VM Host Server Kernel directly—without the need to go through the QEMU block layer. Therefore it can sustain very high I/O rates on storage setups."

Awesome stuff about iothreads

  • Workaround for live migrations. Possible to disable iothreads during a migration and re-enable after without disruption to the guest!

iothreads limitations include:

  • RAW images only
  • No hotplug support
  • Unable to set IO limits
  • Resize and Snapshots are not thread safe yet

You should be able to add these QEMU commands to the KVM guest XML file, right before the </domain> line. This will enable x-data-plane for the block device. You can also do this with iothreads, which is the same thing as x-data-plane.

    <qemu:arg value='-set'/>
    <qemu:arg value='device.virtio-disk0.scsi=off'/>
    <qemu:arg value='-set'/>
    <qemu:arg value='device.virtio-disk0.x-data-plane=on'/>

As of QEMU 2.1 virtio-data-plane is more or less ready to go.

QEMU Git history for dataplane http://git.qemu.org/?p=qemu.git;a=history;f=hw/block/dataplane/virtio-blk.c;hb=466560b9fcada2656b276eb30e25da15a6e706df

Source https://www.suse.com/documentation/sled11/book_kvm/data/cha_qemu_running_devices.html

Virtio-blk cache

There are 4 cache mode options for KVM guests. In RedHat's tuning guide, they mention that usually you do not want to use both page caches, if you enable write caching on the guest, and on the host then you are caching twice for no good reason. There is an option with called WCE, or write cache enabled, for the guest that can be seen if you use QEMU to inspect a running guest. I found that you can independently toggle WCE and the virtio driver cache setting, but later versions of KVM/QEMU seem to automatically configure if WCE is on, based on the cache= option you define for the guest.

From the testing I've done with all the various possibilities, I found that cache=DirectSync was the fastest option for to use if the host has local SSD storage and you enable x-data-plane / iothreads.

Cache Type Uses Host Page Cache Guest Disk "WCE" on
WriteBack YES YES
WriteThrough YES NO
DirectSync NO NO

Whatever mode you set, you need to make sure that all the caches are set to the same setting. For instance, if you set the Guest to use write through, and you are using a RAID card, which has write cache enabled, but does not have a BBU, and disks that are not power loss protected, then this setting is not going to do you much good since the final two caches are not protected from power loss.

Virtio-blk IO Modes

SUSE docs explaining the cache modes

IBM best practices for KVM performance, does not cover directsync, but it does cover some additional tweaks

This setting is different from the AIO mode. Options are "threads" or "native". From the testing I have done, "native" seems to be the best option for performance. The XML format is:

      <driver name='qemu' type='raw' cache='directsync' io='native'/>
  • io=threads

By default QEMU sets this value is set to threads, but if you want to explicitly state it in the XML you can use the format below.

<driver name='qemu' type='raw' cache='none' io='threads'/>
  • io=native

If you are using x-data-plane or writeback cache, you may want to set io to native as it typically offers better performance than using threads. If you are using x-data-plane then you need to use directsync for the cache mode.

<driver name='qemu' type='raw' cache='directsync' io='native'/>

AIO Modes

IBM Docs on using AIO

There are two main modes for guest AIO:

aio=threads (user space thread pool). Default, performs better on filesystems.

aio=native (Linux AIO). Tends to perform better on block devices. Requires O_DIRECT (cache=none/directsync)

So, if you are using a block device, want to use directsync, and native Linux AIO, you would configure the guest to use these settings. You need to make sure you add the correct settings to the right section of the config file, otherwise the QEMU process will start up, but ignore any invalid settings.

      <driver name='qemu' type='raw' cache='directsync'/>
      <source dev="/dev/LVM/$vol" aio="native"/>

Virsh Commands

You can use virsh to dump an existing KVM guest configuration to an XML file, you can do this by using virsh dumpxml

virsh dumpxml $guest > $new_guest_config.xml

To stop a running KVM guest, you can use virsh destroy, which will stop the guest, but won't actually "destroy" it.

virsh destroy $instance

You can also use virsh to list all running KVM guests:

virsh list --all 

To console directly into a KVM guest, use the virsh console command.

virsh console $instance

Virsh emulatorpin

To view information on emulator threads and what CPUs they are pinned to, use virsh emulatorpin, followed by the KVM guest ID.

virsh emulatorpin $guest

You can also pin emulator threads to certain CPUs. Could be useful for having a few dedicated cores on the host serve as the emulator threads

virsh emulatorpin $guest $host_core-$host_core

Virsh vcpuinfo

View KVM guest vCPU information and CPU affinity

virsh vcpuinfo $guest

Virsh vcpupin

Pin each guest vCPU to it's corresponding host CPU (this can improve performance in some cases). For example, if the host has 12 CPU cores, and the guest is utilizing all of them, this command will change the guest from using ALL host CPUs for each vCPU, to use only one specific host CPU per guest vCPU. In bare metal instances, this significantly improves performance.

virsh vcpupin guest 0 0
virsh vcpupin guest 1 1
virsh vcpupin guest 2 2
virsh vcpupin guest 3 3
virsh vcpupin guest 4 4
virsh vcpupin guest 5 5
virsh vcpupin guest 6 6
virsh vcpupin guest 7 7
virsh vcpupin guest 8 8
virsh vcpupin guest 9 9
virsh vcpupin guest 10 10
virsh vcpupin guest 11 11

You can also set this in the guest XML file so that changes persist.

  <vcpu placement='static'>12</vcpu>
    <vcpupin vcpu='0' cpuset='0'/>
    <vcpupin vcpu='1' cpuset='1'/>
    <vcpupin vcpu='2' cpuset='2'/>
    <vcpupin vcpu='3' cpuset='3'/>
    <vcpupin vcpu='4' cpuset='4'/>
    <vcpupin vcpu='5' cpuset='5'/>
    <vcpupin vcpu='6' cpuset='6'/>
    <vcpupin vcpu='7' cpuset='7'/>
    <vcpupin vcpu='8' cpuset='8'/>
    <vcpupin vcpu='9' cpuset='9'/>
    <vcpupin vcpu='10' cpuset='10'/>
    <vcpupin vcpu='11' cpuset='11'/>

Virtio Device and PCI Settings

For each type of device, there is a corresponding PCI controller, which has a few options you can change. The two main devices are virtio-blk-device and virtio-scsi-device. Both of these devices have some parameters that you can enable / disable. There is also a corresponding PCI controller for each, they are virtio-blk-pci and virtio-scsi-pci.

## To list all device options
/$bin/qemu-system-x86_64 -device help

## To list specific device settings
/$bin/qemu-system-x86_64 -device virtio-blk-device,help

name "virtio-blk-device", bus virtio-bus

/$bin/qemu-system-x86_64 -device virtio-blk-pci,help

name "virtio-blk-pci", bus PCI, alias "virtio-blk"

/$bin/qemu-system-x86_64 -device virtio-scsi-device,help

name "virtio-scsi-device", bus virtio-bus

/$bin/qemu-system-x86_64 -device virtio-scsi-pci,help

name "virtio-scsi-pci", bus PCI


  • virtio-blk is a para-virtualized IO block driver.
  • To configure a device (block device) to use a special driver, and special driver properties, use the -device option, and identify with drive=$options
  • name "virtio-blk-pci", bus PCI, alias "virtio-blk"



Virtio-scsi is designed to be the next gen replacement for the current virtio-blk driver. It provides direct access to SCSI commands and bypasses the QUMU emulator to talk directly to the target SCSI device loaded in the host’s kernel.

Virtio-blk uses a Ring Buffer, which is basically a bunch of slots in a ring shape that the Guest uses to place in IO commands such as READ, WRITE, etc, etc. The Host grabs commands out of the buffer as fast as possible, then executes them on behalf of the Guest, and once the operation is completed, the Host places a “complete” back into the Ring Buffer, and the Guest grabs the “complete” and assumes the operation is complete.

Commands like TRIM or DISCARD can be added to the virtio-blk driver, however QUMU must be updated, and the Guest drivers must also be updated to include the new commands.

All Virtio-blk devices are presented to the Guest as a PCI device. The problem with this is that there is a 1:1 relationship between devices and PCI devices, which causes PCI buses to fill up if many devices are attached. So scaling can be difficult.

Virtio-scsi aims to access many host storage devices through one Guest device, but still only use one PCI slot, making it easier to scale.

QEMU is a user space target option for block devices, this makes is really flexible, but not the fastest.

Vhost-net uses in kernel devices as well, which bypasses QEMU emulation, this improves performance as well.

To enable multi-queue, add this to te guest xml file. Replace $vcpu with the amount of vCPUs that the guest has

<controller type='scsi' index='0' model='virtio-scsi'>
<driver queues='$vcpu'>

To tell if a KVM guest is actually using virtio drivers, you can run this command from inside the guest, and it should output something like what is shown below.

dmesg | grep virtio

[    0.527020] virtio-pci 0000:00:04.0: irq 40 for MSI/MSI-X
[    0.527034] virtio-pci 0000:00:04.0: irq 41 for MSI/MSI-X
[    0.557013] virtio-pci 0000:00:03.0: irq 42 for MSI/MSI-X
[    0.557028] virtio-pci 0000:00:03.0: irq 43 for MSI/MSI-X
[    0.557041] virtio-pci 0000:00:03.0: irq 44 for MSI/MSI-X


KVM and QEMU Troubleshooting

How to check if the KVM module is loaded in the kernel

To see if the KVM module is loaded, you can simply run lsmod and grep for "kvm". If you get no output at all then you may need to load the KVM module. If you are using an Intel CPU, you should also see "kvm_intel" as a loaded kernel module along with the plain old "KVM" module.

lsmod | grep kvm

To load the KVM module you can run the command below. If this still is not working then odds are you don't have KVM installed.

modprobe kvm

How to restart libvirt on CentOS 7

If you are not sure if libvirtd is running on a CentOS or RHEL server you can run this command to find out:

ps faux | grep -i libvirt

If you don't see the libvirt daemon running, or if you simply want to restart it, use the command below if you are running CentOS 7

systemctl restart libvirtd

Links to Docs


Link to QEMU GIT document that explains IOthreads / dataplane.

Live Migration Overview for KVM guests

KVM Forum 2014 Slides and Videos

Slides are available at the link below.

Slides linked below cover the current state of virtio-scsi, dataplane, and vhost-scsi

virtio-scsi QEMU Userspace

  • Previously performance limitied due to the QEMU lock. Issues scaling with IO.
  • With Kernel 3.17+ scsi-mq helped, however performance was still limited because of scsi_request_fn() which had a lot of locking overhead.

virtio-blk-dataplane QEMU userspace

  • Multithreaded AIO with O_Direct contect from host userspace. Posix thread per device which avoids QEMU locking, which helps to improve performance.
  • virtio-blk-dataplane now supports Live Migration.

vhost-scsi KVM Kernel on Host

  • Is able to bypass the second level AIO and O_Direct overhead by using LIO, this helps performance.
  • No changes to guest virtio-scsi LLD
  • Currently does not support Live Migration


  • This is a complete rewrite of the block subsystem, meant to better utilize very fast SSD and PCIe / NVM storage.
  • Per CPU software queues are now mapped to pre-allocated hardware queues, this helps IO scale. NUMA is also taken into consideration and queues are smartly placed based on NUMA layout.
  • Merged into Kernel 3.13


  • Uses blk-mq to bypass scsi_request_FN() which previously was a bottlneck for IO performance. Improves IO by 4 times compared to previous performance. Merged into Kernel 3.17

NVMe Host Interface specification

  • Meant to standardize the hardware host interface, should allow for a single OS driver to support all types of hardware.
  • Currently has a lot of backing by large companies like Dell, HGST, Intel, LSI, Micron, PMC-Sierra, Samsung, SanDisk and Seagate.
  • Includes a new NVMe command set. Currently has 3 commands.
  • Down the road EXTEND_COPY will be included.

KVM Performance in 2014 and beyond

  • The IO stack in guest instances is no longer the bottleneck for performance.
  • "blk-mq with scsi-mq is the fastest IO stack around". Improved enough that it exposes bottlenecks elsewhere in the PV IO stack.
  • NVMe Host Interface is ready to scale performance beyond flash, and should remain performant for a long time.
  • Expect more undetectable errors now that the IO stack is much faster, nothing new, but something to keep in mind.
  • virtio-blk-dataplane is still limited per device because of the second level O_DIRECT overheads on the host.
  • virtio-scsi-dataplane is also limited per device because of the second level O_DIRECT overheads on the host.
  • vhost-scsi offers almost 2 times better performance than dataplane for random 4K IOPs. Does not yet support live migration though, on the to do list.

Performance listed from highest to lowest.

NVMe passthrough

Virtio-blk Linux Driver History

  • Linux Kernel 2.6.24 through 3.12 uses the traditional request based approach. There is a single lock for protecting the request queue, this causes a huge performance bottleneck with guests using fast storage (SSDs, NVMe).
  • Linux Kernel 3.7 through 3.12 uses a BIO based approach. This generates more requests to the host, more VMexit and context switching, improved performance, still not great.
  • Linux Kernel 3.13 through 3.16 uses Block multi-queue, single dispatch queue. Request based, but does not have lock for protecting queue requests. Supports IO merging in software queue. Single virtqueue means a single vq lock is needed for both submit and complete paths.
  • Linux Kernel 3.17+ uses Block multi-queue, multiple dispatch queue. Takes advantage of block multi-queue. Is the biz in terms of performance.

Linux Block Multi-queue

  • Created to support high performance SSDs. Removes the coarse request queue lock.

Has two queue levels:

  • Per CPU software queue. This is the staging queue, allows for merging and tagging.
  • Hardware queue (dispatch queue). Submits request to hardware, but requires hardware support. There is an N:M mapping between the sofrware and hardware queue.

Linux 3.13 allows for a single hardware queue, improves performance for fast devices like SSDs. Does not require QEMU changes. Still a single virtqueue lock for both submit and completetion, because of this IO does not scale perfectly.

Linux 3.17+ allows for multiple hardware queues. blk-mq's hardware queue can get mapped to virtqueue. This removes a lot of bottlenecks from inside the VM. This does require some changes to QEMU to support new features.

QEMU 2.0 Optimization

Multi virtqueue conversion patch is simple and gets really good performance.

However once dataplane used the QEMU block layer for IO performance began to drop, an investigation was started to see what the issue was.

QEMU IO Batch Submission

IO batch submission handles more requests in a single system call (io_submit), this reduces the amount of calls that need to be made by handling multiple IO requests at once. Reducing this call improves performance.

The Linux-AIO interface can be used to queue multiple read and write requests at the same time, this helps to improve efficiency.

These methods have been used with dataplane from the beginning to help improve performance.

Random read and write IOPs can be improved by roughly 50% by using batch IO.

QEMU Multi Virtqueue Support

Enabling multi virtqueues can help with VMs using SSD based storage and with applications that need fast and frequent IO.

Can be enabled by using the num_queues parameter. Works with dataplane on, or off.




View Help Options. Replace "/$QEMU_DIR" with the proper path for your enviroment.

/$QEMU_DIR/bin/qemu-system-x86_64 -machine help
/$QEMU_DIR/bin/qemu-system-x86_64 -device help
/$QEMU_DIR/bin/qemu-system-x86_64 -cpu help

View specific device driver options. You can replace "virtio-blk" with whatever device you want to know more about.

/$QEMU_DIR/bin/qemu-system-x86_64 -device virtio-blk,help 

## Example output


To convert a KVM guest XML file to an actual QEMU command you can run the following virsh command which converts domxml to native QEMU arguments.

virsh domxml-to-native qemu-argv $guest_xml_file.xml

To add in extra QEMU command line arguments when creating a guest from an XML file, you can add in a block at the end of the file, right before </domain>. This would add the following -device options

    <qemu:arg value='-device'/>
    <qemu:arg value='virtio-blk-pci,drive=drive0,scsi=off,x-data-plane=on'/>

Some Terminology

Context Switch: A kernel operation that switches a CPU to operate in a different address space. (switching between user and kernel space for example).

Interrupt: Signal sent from a physical device to the kernel, this is usually done when servicing an IO request.

User programs, also known as processes run in the user mode, they can then request privileged operations from the kernel via system calls for IO requests, etc, etc. For this to happen, execution will context-switch from user to kernel mode, and the request will then operate with higher privileges.

Context Switches takes time (CPU cycles), which causes some amount of overhead for each IO request, so performance will be lower when more context switches are made, and performance will be higher when less are made. Some programs have been optimized to run in kernel mode as much as possible, which helps to reduce this overhead.

  • Guest Exits mean that the VM has to stop executing while it waits for the hypervisor to handle it's request.