比较KVM虚拟机本地SSD和Ceph RBD存储性能¶

我在比较IOMMU NVMe和原生NVMe存储性能中，对比了在 Open Virtual Machine Firmware(OMVF) 虚拟机内部采用IOMMU技术读写NVMe存储和裸物理机读写NVMe的性能差异。现在，按照私有云架构部署了 Ceph Atlas 存储来提供虚拟机存储，也需要考虑分布式存储Ceph对性能的损耗。所以本文将采用相同的 fio 存储性能测试测试方法，对比性能差异。

备注

我希望测试能够获得在分布式Ceph上运行的KVM虚拟机磁盘性能达到使用本地SSD的性能，毕竟底层硬件是性能更佳的NVMe，即使分布式消耗，也希望能够达到本地SSD磁盘性能。

测试环境¶

虚拟机采用 Libvirt集成Ceph RBD 部署完全相同的 Ubuntu Linux 20.04 ，虚拟机配置采用 4c8g 配置( vcpu=4 匹配 fio 测试命令 -numjobs=4 并发数量 )

磁盘性能测试¶

测试说明¶

/dev/vda 磁盘已经安装了虚拟机操作系统，所以 fio 采用文件进行测试，和直接读写块设备文件有差异(存在操作系统缓存影响)，不过对于测试两个一致操作系统还是可以有一定对比性。
提供Ceph服务的3个虚拟机已经做了 cpu pinning ，分配到socket 0上的CPU核心; 但是由于我的 HPE ProLiant DL360 Gen9服务器配置的XEON处理器是 Intel Xeon E5-2670 v3处理器，物理核心数量有限，不考虑超线程已经全部一一分配给这3个Ceph虚拟机；所以我不确定Ceph客户端虚拟机绑定在同一个Socket的处理器上超线程cpu core上性能更好还是绑定到Socket 1的CPU核心更好(需要实测，并且我估计和负载、软件版本有关)
本次测试简化，没有绑定虚拟机的vcpu pinning

测试结果不精确，仅供参考

随机写IOPS(文件)¶

测试命令:

fio -direct=1 -iodepth=32 -rw=randwrite -ioengine=libaio -bs=4k -numjobs=4 -time_based=1 -runtime=60 -group_reporting -filename=${HOME}/fio -size=2g -name=test

z-ubuntu20 本地SSD测试结果:

本地SSD虚拟机随机写IOPS¶

fio-3.16
Starting 4 processes
test: Laying out IO file (1 file / 2048MiB)
Jobs: 4 (f=4): [w(4)][100.0%][w=4932KiB/s][w=1233 IOPS][eta 00m:00s]
test: (groupid=0, jobs=4): err= 0: pid=505: Wed Dec  8 16:43:57 2021
  write: IOPS=1197, BW=4790KiB/s (4905kB/s)(281MiB/60071msec); 0 zone resets
    slat (usec): min=7, max=64376, avg=77.35, stdev=952.51
    clat (msec): min=13, max=258, avg=106.80, stdev=25.94
     lat (msec): min=13, max=258, avg=106.88, stdev=25.91
    clat percentiles (msec):
     |  1.00th=[   54],  5.00th=[   59], 10.00th=[   64], 20.00th=[   80],
     | 30.00th=[  106], 40.00th=[  111], 50.00th=[  114], 60.00th=[  116],
     | 70.00th=[  118], 80.00th=[  122], 90.00th=[  128], 95.00th=[  136],
     | 99.00th=[  180], 99.50th=[  190], 99.90th=[  203], 99.95th=[  205],
     | 99.99th=[  247]
   bw (  KiB/s): min= 3800, max= 5752, per=99.92%, avg=4786.30, stdev=92.26, samples=480
   iops        : min=  950, max= 1438, avg=1196.50, stdev=23.07, samples=480
  lat (msec)   : 20=0.01%, 50=0.63%, 100=25.21%, 250=74.14%, 500=0.01%
  cpu          : usr=0.23%, sys=0.69%, ctx=25625, majf=0, minf=47
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=99.8%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,71936,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=4790KiB/s (4905kB/s), 4790KiB/s-4790KiB/s (4905kB/s-4905kB/s), io=281MiB (295MB), run=60071-60071msec

Disk stats (read/write):
  vda: ios=0/73364, merge=0/4, ticks=0/7947375, in_queue=7800168, util=98.87%

z-ubuntu20-rbd Ceph RBD测试结果:

Ceph RBD虚拟机随机写IOPS¶

fio-3.16
Starting 4 processes
test: Laying out IO file (1 file / 2048MiB)
Jobs: 4 (f=4): [w(4)][100.0%][w=13.8MiB/s][w=3520 IOPS][eta 00m:00s]
test: (groupid=0, jobs=4): err= 0: pid=487: Wed Dec  8 16:46:50 2021
  write: IOPS=4801, BW=18.8MiB/s (19.7MB/s)(1126MiB/60015msec); 0 zone resets
    slat (usec): min=3, max=578742, avg=230.97, stdev=7263.63
    clat (msec): min=6, max=810, avg=26.42, stdev=46.37
     lat (msec): min=7, max=810, avg=26.65, stdev=47.00
    clat percentiles (msec):
     |  1.00th=[   12],  5.00th=[   14], 10.00th=[   15], 20.00th=[   16],
     | 30.00th=[   17], 40.00th=[   18], 50.00th=[   19], 60.00th=[   21],
     | 70.00th=[   22], 80.00th=[   26], 90.00th=[   32], 95.00th=[   41],
     | 99.00th=[  317], 99.50th=[  443], 99.90th=[  567], 99.95th=[  600],
     | 99.99th=[  701]
   bw (  KiB/s): min= 1496, max=30440, per=99.97%, avg=19199.78, stdev=1933.14, samples=480
   iops        : min=  374, max= 7610, avg=4799.87, stdev=483.29, samples=480
  lat (msec)   : 10=0.11%, 20=59.21%, 50=37.40%, 100=1.66%, 250=0.45%
  lat (msec)   : 500=0.95%, 750=0.22%, 1000=0.01%
  cpu          : usr=1.18%, sys=4.26%, ctx=194166, majf=0, minf=44
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,288149,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=18.8MiB/s (19.7MB/s), 18.8MiB/s-18.8MiB/s (19.7MB/s-19.7MB/s), io=1126MiB (1180MB), run=60015-60015msec

Disk stats (read/write):
  vda: ios=0/299107, merge=0/4072, ticks=0/6565335, in_queue=5964988, util=99.65%

让我有点吃惊， z-ubuntu20-rbd Ceph RBD 测试文件的随机写iops只有 4801 ，虽然比本地SSD测试iops 1197 好很多，大约是4倍性能；但是，比直接 Open Virtual Machine Firmware(OMVF) 读写单NVMe性能 ( 629k ) 差距太大了，只有原始性能 0.7% ？ 是不是测试方法的问题?

备注

z-b-data-1 、 z-b-data-2 和 z-b-data-3 上重新按照上述命令测试(读写文件而不是块设备)，减少测试差异进行对比(注意， z-b-data-1 和 z-b-data-2 位于物理主机的不同PCIe插槽，并且 1和2 是同一个PCIe采用 PCIe bifurcation 切分，而 3 是独立使用 PCIe 8x插槽 )

在 z-b-data-1 、 z-b-data-2 和 z-b-data-3 上划分一个测试分区，建立文件系统进行以便进行fio测试:
```
parted /dev/nvme0n1 print
```

显示:

Model: SAMSUNG MZVL21T0HCLR-00B00 (nvme)
Disk /dev/nvme0n1: 1024GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:

Number  Start   End    Size   File system  Name     Flags
 1      1049kB  500GB  500GB               primary

划分一个6G临时分区(从500GB开始到506GB结束，所以空间是6GB)，用于测试:

parted -s -a optimal /dev/nvme0n1 mkpart primary 500GB 506GB
mkfs.xfs /dev/nvme0n1p2

mount /dev/nvme0n1p2 /mnt
mkdir /mnt/test
chown huatai:huatai /mnt/test

执行fio测试(以 huatai 用户身份执行):

fio -direct=1 -iodepth=32 -rw=randwrite -ioengine=libaio -bs=4k -numjobs=4 -time_based=1 -runtime=60 -group_reporting -filename=/mnt/test/fio -size=2g -name=test

IOMMU虚拟机文件系统2GB文件随机写IOPS¶

fio-3.16
Starting 4 processes
test: Laying out IO file (1 file / 2048MiB)
Jobs: 4 (f=4): [w(4)][100.0%][w=40.0MiB/s][w=10.5k IOPS][eta 00m:00s]
test: (groupid=0, jobs=4): err= 0: pid=14017: Wed Dec  8 21:46:48 2021
  write: IOPS=8180, BW=31.0MiB/s (33.5MB/s)(1923MiB/60187msec); 0 zone resets
    slat (usec): min=2, max=444127, avg=267.96, stdev=7794.04
    clat (nsec): min=1754, max=446785k, avg=15375574.09, stdev=44988666.36
     lat (usec): min=23, max=449272, avg=15644.05, stdev=45610.46
    clat percentiles (usec):
     |  1.00th=[   445],  5.00th=[   971], 10.00th=[  1483], 20.00th=[  2040],
     | 30.00th=[  2933], 40.00th=[  3982], 50.00th=[  4752], 60.00th=[  5997],
     | 70.00th=[ 10814], 80.00th=[ 20579], 90.00th=[ 23987], 95.00th=[ 25822],
     | 99.00th=[308282], 99.50th=[354419], 99.90th=[417334], 99.95th=[425722],
     | 99.99th=[434111]
   bw (  KiB/s): min= 7752, max=240696, per=100.00%, avg=32807.11, stdev=5786.37, samples=480
   iops        : min= 1938, max=60174, avg=8201.66, stdev=1446.59, samples=480
  lat (usec)   : 2=0.01%, 4=0.01%, 20=0.01%, 50=0.04%, 100=0.07%
  lat (usec)   : 250=0.26%, 500=0.90%, 750=1.79%, 1000=2.27%
  lat (msec)   : 2=13.93%, 4=21.07%, 10=28.83%, 20=10.07%, 50=18.12%
  lat (msec)   : 100=0.37%, 250=0.52%, 500=1.75%
  cpu          : usr=0.99%, sys=2.42%, ctx=172912, majf=0, minf=50
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,492339,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=31.0MiB/s (33.5MB/s), 31.0MiB/s-31.0MiB/s (33.5MB/s-33.5MB/s), io=1923MiB (2017MB), run=60187-60187msec

Disk stats (read/write):
  nvme0n1: ios=3/533038, merge=0/7770, ticks=0/3590622, in_queue=2874700, util=99.78%

云盘性能分析¶

对 1TB NVMe 磁盘设备文件进行 fio 随机读写压测，IO是随机分散到整个固态硬盘，所以整体性能卓越；而对单个文件(特别是只有 2GB 的小文件)测试随机读写压测，则IO只能局限于2GB的NVMe局部进行读写测试，所以性能 “下降” 到 2G/1024G = 1/512 ，也就是只有整体设备读写性能的 1/512
固态设备规格越小，同样型号的SSD性能越差；要 公平 测试性能，只能对比相同规格固态磁盘
对于云盘，也有类似的分散读写效应，也即是云盘越大，云盘分散读写到更多的底层固态设备块上，就能获得更高性能；此外，类似 Ceph Atlas 的分布式存储，底层越多的 OSD 支撑，就能够获得更好的读写性能；这个性能是由 Ceph 的 PG 来决定的(分散度)，但同时分布式复制、网络性能、CPU繁忙、缓存冲突等等都会影响分布式存储性能，无法像直接访问NVMe设备那样获得稳定的一致性的性能

对磁盘文件系统上的文件进行 fio 测试得到的随机写IOPS非常低，之前在比较IOMMU NVMe和原生NVMe存储性能随机写IOPS高达 629k ，但同样的iommu虚拟机环境，对该设备上6GB分区中的2GB大小文件进行随机写测试，也只能获得 8180 IOPS。仅比我构建的 Ceph RBD 随机写性能 4801 高70%，并不是数量级的差异。这也说明，对小规格文件进行测试只能做横向相同环境比较，没有绝对的测试意义。粗略估计，采用分布式Ceph存储，性能下降约40%。不过，这个性能完全可以通过扩大Ceph规模，构建更大规格云盘来调整到接近甚至超越的性能。

所以，我感觉测试绝对数值没有太大意义，通过配置完全可以改变测试结果数值。只有相对同样规格横向对比才能有一定参考意义，然而也是受到底层参数调整影响，所以我就不再做进一步的性能压测了(太磨损固态硬盘寿命了)。后续只有对 Ceph 优化时，做参数对比才有测试意义。

目前初步可以确认:

ceph构建的虚拟化分布式存储，存储性能大约是直接pass-through本地NVMe性能的 60% ，但是是虚拟化本地SSD磁盘(非pass-through)性能的 4倍 ，所以应该能够满足我构建大规模私有云架构的需求

目前我将集中精力构建云计算，在完成部署的基础上，再做性能调优