Ceph

From wiki.mikejung.biz
Jump to: navigation, search

Ceph Common Commands

To view the CEPH cluster health status

To get a basic idea of the cluster health, simply use the ceph health command. This will tell you the amount of pgs backfilling, the amount of degraded pgs and objects, and the percentage of the pool that is degraded.

ceph health

The command below is similar to the one above, however this will provide much more detail for the ceph health status. If you want per PG stats and detailed OSD information, use the ceph health detail command.

ceph health detail

View ceph pool io statistics

If you want IO statistics for a specific ceph pool, you can use the ceph osd pool stats command. This command will output the number of degraded objects in the pool (if any). It will also display the total recovery IO in KB/s and objects per second. It also outputs the client IO, with a section for read KB/s, write KB/s and total operations per second. These stats are per pool, so if you have multiple pools you need to run the command against multiple pool names.

ceph osd pool stats $pool_name

View all ceph OSD performance stats

To view the fs_commit_latency and fs_apply_latency for all the OSDs in a pool, run the command below. If you notice certain OSDs with very high latency, you may want to investigate further. The osd perf command will usually point you in the right direction if you are trying to troubleshoot ceph performance.

ceph osd perf
  • fs_commit_latency: The values here are in milliseconds (ms) and will usually be a lot higher than fs_apply_latency. This is because there is a syscall involved (syncfs). Values of 100ms - 600ms are generally considered acceptable times. It's recommended that you look at the values for fs_apply_latency instead of using commit latency to judge performance. If you notice a few OSDs with larger than normal latency you should certainly investigate further.
  • fs_apply_latency: The values here are the amount of time it takes in milliseconds (ms) to apply updates to the in-memory file system. The latency shown under this column should be a lot lower than with the commit column simply because it's a lot faster to update memory than it is to update a file on disk.

Source -- http://comments.gmane.org/gmane.comp.file-systems.ceph.user/6822

Ceph RADOS Architecture

Ceph Storage Structure

The underlying storage for CEPH consists of a file system (btrfs,xfs,ext4) on top of individual disks (HDDS,SSDs). Above the file system resides the OSDs which take each disk and make it a part of the overall storage structure.

Ceph Object Gateway

  • RADOSGW

Ceph Object Gateway is an S3 and Swift compatible gateway. This is an object store which holds files of any type. You do not mount Object Storage on your server, you send and get files from it.

Speaks REST on one side and uses a Socket on the other end. The API supports buckets and accounts which means multiple "top level" domain buckets and lots of "directories" underneath the buckets. Each user has access to only their bucket so files are kept secure from others unless you totally fuck up and set public permissions.

RADOSGW supports usage accounting so you can make that money from your clients. It's compatible with AWS S3 and any Swift powered applications.

Ceph Block Device

  • RBD

Ceph Block Device is a distributed virtual block device which can be used to boot vms off of, or mount inside a vm for more storage. RBD has a Linux client and QEMU/KVM driver.

LIBRBD powers the communication between the virtual machine and LIBRADOS to provide block storage. There is also a kernel module called KRBD that gives you a device to mount on your VM and put a file system on. Basically you can either run a VM using locally attached SSDs and mount a block device in that VM for storage space, or you can put an OS image on a RBD volume and boot the OS of the RBD device.

RBD supports snapshots, CoW clones and is supported by Qemu/KVM.

Ceph Filesystem

  • CEPHFS

A distributed filesystem with POSIX semantics that provides storage for modern and legacy applications. This also has a Linux client.

You must use a Metadata server if you want to use the Ceph filesystem. This handles the POSIX stuff like permissions, ownership, location and all that fun stuff.

The metadata node is also a controller type, much like how the monitors are passively controlling things. Metadata servers are only keeping track of files, not serving them. The metadata server is only required if you want to use cephfs with shared filesystems, otherwise you do not need to have multiple metadata servers.

Ceph Workers and Functions

OSDs (file servers)

There are usually 10s to 1000s of OSDs in a cluster. There is one OSD per disk so you can think of OSD as a single disk, or a RAID group with some software sprinkled on top. An OSD serves objects stored on it's disk to the clients that request them. OSDs peer with each other to replicate objects or perform recoveries.

You will want to make sure you place OSD nodes in multiple areas around your DC, deploying all your nodes in a single rack, using the same power and switch is not a good idea for redundancy. You want to place nodes in areas that have their own power and networking so that your entire cluster does not go down all at once.

All OSDs in a single pool should all have the same hardware and very similar performance. If you mix drives in a pool then eventually the cluster is going to slow down to the speed of the slowest drive. Basically you don't want to mix and match SSDs with spinning HDDs in the same pool. You can mix and match drives on nodes that do not handle data, such as a metadata server, or a monitor or whatever.

Journal

All writes that enter Ceph are first written out to the journal on an OSD. This is always direct IO and writes are done sequentially. If there is a replication factor the data is also written to the journal on the other nodes. After that is done the data is eventually written to disk using buffered IO. This helps ensure that data is safe and directly written to disk first.

Monitors (cluster masters and control)

Monitors maintain the state of the cluster as well as the members inside of it. It provides consensus for distributed decisions. Monitors do not have any part in the serving of files to clients, they are passive in that sense. Monitors should always be deployed in an odd number otherwise the nodes may not be able to reach a decision on whether or not a storage node is down or not. If 2 monitors think it's up and 2 think it's down a decision cannot be made. However if you have 3 monitor nodes a decision can always be made.

Typically 3, or 5 monitors is a good start.

LIBRADOS

LIBRADOS is a library that allows applications to directly access RADOS. The library supports C,C++,Java,Python, Ruby and PHP. Because of direct access to storage nodes there is no HTTP overhead involved, uses a socket so it's damn fast. You should use this as often as possible because it's fast.


CRUSH

CRUSH (Controllable Replication Under Scaleable Hashing) is an algorithm that Ceph uses to hash objects. The hash is an object name and location. Crush then handles the page, cluster state and rule sets. The objects then are placed evenly across the cluster in a reliable manner. It's a pseudo-random placement algorithm that is pretty fast.It's infrastructure aware and follows rules which are defined.

How objects are stored in Ceph

When an Objects enters Ceph the first thing that happens is it gets a hash, the hash is then sent to a placement group. From there the object goes through CRUSH and all kinds of chaos ensues. Shortly after the objects leaves CRUSH it heads home, to an OSD which sends the object home, on a hard drive.

Recovery is handled by storage nodes and their OSDs are aware when another node goes down for whatever reason. The OSDs get a new cluster map from the monitors which have been keeping track of the cluster and have voted that the node is actually down. There are also status for temporarily down nodes which aren't as harsh and fully down.

At this point the cluster is re-balancing based off some mind blowing calculations. If you have 100 nodes and 1 node goes down then CRUSH will move 1/100 of the data to recover rather quickly.

The client uses the same algorithm as CRUSH to find data. So the same algorithm is used on the client and server end. This apparently works so that's cool.

Thin Provisioning

Ceph allows you to boot and store tons of VMs and VM Images. RBD takes advantage of CoW (Copy on Write) techniques so you can use a single image that's say, 10GB in size, boot 100 VMs off that single image and initially only use 10GB of storage. As time goes on and changes are made then this savings goes away, but still it's a pretty useful thing to use if you spin up a lot of VMs.

Any reads from the client first go the copy image for reads or writes before going to the master version. This helps to improve performance by reading from the copy first, assuming the data is there, otherwise the master is read from.

Clustered Metadata

Ceph uses multiple metadata servers to form a single tree of authority when it comes to file metadata for the entire cluster.

Metadata consists of things like:

  • File User and Group Permissions
  • File Size
  • File Names
  • File Locations
  • File TimeStamps

Ceph uses Dynamic Sub-tree Partitioning to do this with many metadata servers. Can have one metadata server responsible for the entire tree. If you use two metadata servers about half the load from the first server is sent to the second node. The hand-off of this load can happen very quickly and it's not an intensive process. If you add a third metadata server then the load gets split 3 ways, etc,etc. This happens all the time and the metadata handling is very dynamic.

Ceph Performance

RBD Cache

The rbd_cache is specific to the client accessing the device. It behaves like a normal HDD cache, and can be configured as WriteBack or WriteThrough. With WriteThrough RBD cache all writes go directly to the OSDs, while read can still come out of the RBD cache.

With WriteBack RBD cache, writes return almost immediately, and later sent to the OSDs to be written to disk. Reads still come from the client RBD cache.

The default RBD cache size is 32MB, this can be adjusted by modifying the following setting:

rbd_cache_size = 32M

You can also modify the maximum size of dirty data that can exisit before it gets flushed to disks. To modify this maximum limit:

rbd_cache_max_dirty = 24M (writeback)
rbd_cache_max_dirty = 0 (writethrough)

In addition to configuring the max amount of dirty data that can be in the RBD cache, you can also configure the size at which flushing should start to happen in the background. At this stage the flushing process will not block writes to the cache.

rbd_cache_target_dirty = 16MB (default)

The default age for dirty files in the RBD cache is 1 second. You can raise this value if you want, it will improve performance, but it raises the risk of data loss should something happen to the client.

rbd_cache_max_dirty_age

There is no right or wrong way to configure the RBD cache. How you configure caching depends on the environment and workloads you plan on handling.

Ceph Performance & Benchmarking by Mark Nelson

In the video above some performance numbers were mentioned as well as server specs. Below is a summary from the video above.

Dual E5-2630L
4 x LSI 9207-8i HBA cards
24 x 1TB spinning HDDs
8 x Intel 530 SSDs
Bonded 10GbE Network
Cost ~ $12,000

If you are using all spinning HDDs on your OSDs and you store data and the journal on each HDD you will encounter reduced performance when it comes to small, random writes. Sequential writes are not as big of a deal. If you use a RAID card and write back caching you can improve random write performance quite a bit because the data that gets written to the journal is done so sequentially from the write cache. Still though, if you put the journal on the drive used to store data you will see a reduction in performance since each write gets written twice, once to the journal and once to the actual disk.

To get around this the idea setup is to use SSDs for the journal and spinning HDDs to store the data. Generally speaking though you do not want to enable write back caching with SSDs.

On cuttlefish, using RADOS Bench with 4M objects, 4 processes and 128 concurrent operations Ceph was able to achieve:

  • 2000MB/s Write
  • 1700MB/s Read

These results are SEQUENTIAL and not random, the test system was using the hardware mentioned above. Pretty decent results. Performance was very similar between filesystems, so XFS, BTRFS, and EXT4 were all very close.

Keep in mind these are not filesystem / client tests they are just testing RADOS speed inside the cluster so this is purely synthetic tests and not what you would see if you actually mounted a RBD volume in a guest and ran tests there.

RBD uses 4MB object sizes. If you are writing 4KB files on a client then a lot of the object written to Ceph is empty so you will not see anywhere near the performance mentioned above. The results below are from a single Ceph node and a single client node. These results are not from a large cluster, and there is only a single VM accessing the cluster.

Cuttlefish KRBD Version 4Kb FIO writes

  • Write throughput with 1 volume in VM: ~ 5MB/s
  • Write throughput with 8 volumes in VM: ~ 25MB/s
  • Random Write throughput with 1 volume in VM: ~ 8MB/s
  • Random Write throughput with 8 volumes in VM: ~ 7MB/s

Cuttlefish QEMU/KVM Version (RBD read and write cache on) 4Kb FIO writes

  • Write throughput with 1 volume in VM: ~ 400MB/s
  • Write throughput with 8 volumes in VM: ~ 400MB/s
  • Random Write throughput with 1 volume in VM: ~ 22MB/s
  • Random Write throughput with 8 volumes in VM: ~ 25MB/s
  • RBD caching is needed for high performance inside a QEMU/KVM instance

Ceph Replication Performance Cost

Performance can vary wildly amount different Ceph clusters, it all depends on what the replication factor is set to. With a replication factor of 2 you will see roughly half the write performance compared to a replication factor of 1. The drop in write performance between replication factor 2 and 3 is also pretty dramatic. This is not surprising since replication takes time and you must wait for multiple OSDs to complete a write instead of just one.

Read performance doesn't really improve or get better when replication factor is raised or lowered. There will be a slight performance drop in read speed as you raise the replication factor, but this is a very small difference and does not drop in half like it does with writes. This is somewhat surprising. I would expect Read speed to double if not triple with 2 or 3 replicas, as is the case with RAID 10, typically you will see much better speeds with an 8 disk RAID 10 versus a 4 disk RAID 10, but then again that is totally different technology.

Using CEPHFS does not seem to reduce performance much, performance seems to scale pretty well compared to the RADOSbench results.

Ceph Hardware Considerations

Ceph metadata servers

These servers are very CPU intensive as they are constantly readjusting their load and keeping track of metadata changes. These servers also need a lot of RAM to make sure they serve requests quickly to the rest of the cluster. Each metadata server should have at least a 4 core CPU and at least 1GB of ram per daemon instance. In general, you can't have too much CPU or RAM on these boxes. Consider using mid to high end Intel E3-1270 or E3-1290 at least with as much RAM as you can put in the box.

E3-1270v3 or E5-26xx
64GB + RAM
4 x SSD RAID 10 or faster
10Gb NIC

Ceph monitor servers

These servers are not very resource intensive. You can get by with a small amount of CPU resources and RAM. I would image that even an older Intel i5-750 or similar would be enough for the job. 8GB of RAM should be plenty for these servers.

To me it makes sense to have monitor nodes with similar hardware to:

Intel i5-750 or similar
8GB RAM
2 x 1TB hdd (raid 1)
1Gb NIC

Ceph OSD servers

You should have a decent amount of CPU resources for these servers, but not as much as you would for a metadata node. Dual or Quad CPUs should be ok but if you have tons of disks it wouldn't hurt to go with an Intel E3-1200 model. Ceph recommends at least 1GB of RAM per OSD daemon on each server.

If you have 16 x 1TB drives per server then I would give each server at least 16GB of RAM if not more to make sure that recovery operations are done quickly. If you have 16 x 2TB drives per server then you want to have 2GB per drive, or a total of 32GB RAM per server.

  • In general you want about 1GB of RAM per 1TB of storage space on each OSD server.
  • Do not run more than 1 OSD per drive. There is no performance gain when running multiple OSD instances per drive so don't do it.
  • DO NOT place the journal and data on the same drive. This will slow down performance significantly. Use SSDs for the journal and write the data to a spinning disk, or if you want to use all ssds make sure that the performance for the journal drive and data drive is similar otherwise the slowest SSD will bottleneck the faster ones.
  • Use a separate drive for the OS. Try not to run the OS on an OSD drive. All drives should be dedicated to a specific function and only used / controlled by a single process. Don't try and run your OS on a drive and partition space to use by the OSD.
  • If you want to use SSDs for the journal be sure to pay attention to write speeds of the drives. Not all SSDs are equal and not all SSDs perform consistently when writing often. Personally, I'd go with an Intel DC S3700 drive or similar for journals.

You also want to make sure that the host has enough network throughput to handle the amount of IO throughput for the drives on the server. For instance, if each host has a single 1Gb NIC it does not make sense to stick 16 2TB drives in that host. The throughput of 16 spinning HDDs will exceed what the network can push so a lot of performance is wasted. The average HDD will provide around 100MB/s - 200MB/s of sequential read and write performance. Multiple the average speed of a drive by the number of drives per host to figure out what makes sense.

This means that if you only have a 1Gb NIC (~111 MB/s) you really don't want to put more than a single HDD. If you have a single 10Gb NIC you could store about 10 - 15 drives per host. The more drives per host means the more RAM and CPU you need so be sure to keep this in mind.

To me it seems like it makes sense to have OSD hosts with a config similar to:

E3-1220
32GB RAM
8 x 4TB HDD
1 x Intel DC S3700 SSD
10Gb NIC

Ceph Troubleshooting

Ceph Admin Socket

There is an admin socket on every OSD that allows you to interact with the OSD and view things like the current IO. There is also a command called dump historic ops which gives you a list of the 10 slowest operations that occurred on that OSD in the past 10 minutes. You can change these values to be higher or lower, but that is the output the command gives by default. Very useful for performance troubleshooting and tuning. The command lets you know where the slow operation happened so you know if a slow write happened due to network latency, a slow journal, or something else.

You can interact with the admin socket by using the command below. Using "help" will give you a list of all the commands and options you can use.

ceph --admin-daemon /var/run/ceph/{socket-name} help

How to use CEPH admin socket to view configuration settings for OSD

You can view all the configuration settings for a specific OSD by running the command below. If you want very detailed configuration settings for an OSD this command will provide all the information you need.

ceph --admin-daemon $CEPH_OSD_number config show


http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/

collectl

This is sort of an all in one monitoring utility which displays IO stats, CPU and other resources all within a single utility which can make it easier to get an idea about what is going on on a single OSD node. You could also use something like iostat, or iotop, etc,etc. This tool just makes it easier to pick what you want to have displayed in the output.

Ceph Videos and Links

Ceph Intro and Architectural Overview by Ross Turk

A lot of the information in the video below has been added to this wiki. I suggest you watch the video and then review the information in the wiki.