blog: raid5f post

Change-Id: Ic4a396c9f11378c0d3104f625816cba8e6beef7f Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk.github.io/+/19538 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Reviewed-by: Shuhei Matsumoto <smatsumoto@nvidia.com> Reviewed-by: Konrad Sztyber <konrad.sztyber@intel.com>
spdk · Feb 23, 2024 · cc9524d · cc9524d
1 parent 84abd56
commit cc9524d
Showing 1 changed file with 105 additions and 0 deletions.
diff --git a/_posts/2024-02-12-raid5f.md b/_posts/2024-02-12-raid5f.md
@@ -0,0 +1,105 @@
+---
+layout: post
+title:  "raid5f: the SPDK RAID 5 implementation"
+author: Artur Paszkiewicz
+categories: news
+---
+
+RAID 5 is a popular RAID level which, similarly to RAID 1 (mirroring), protects
+data in the array against failure of a member drive and can improve read
+performance by reading from multiple drives in parallel. It is often preferred
+over mirroring because it "wastes" less storage capacity - the array can have
+many member drives but the total available capacity is always reduced only by
+the size of one member. Meanwhile, RAID 1 exposes the capacity of only one
+member. That's 50% in the most common two-way mirror case, which provides the
+same level of protection as RAID 5 - protection from failure of one drive.
+
+The biggest downside of RAID 5 has always been write performance. Every write
+to the array causes writes to at least two member drives, because apart from
+the actual data, the parity also has to be updated. Parity is the redundant data
+that allows recovery after a drive failure. It must stay in sync with the data
+on the other drives. But the problem is not in the additional write. After all,
+RAID 1 also has to duplicate writes. Computing the parity does add some
+overhead but is still much faster than accessing storage. The real performance
+hit comes when a write causes a read-modify-write cycle - to compute the
+updated parity some data needs to be read from at least one member drive first.
+It happens when a RAID stripe is partially updated.
+
+A stripe is the data plus parity interleaved across member drives. The size of
+the part of a stripe that is contained on one member is configurable and is
+known as strip (AKA *chunk*) size. The amount of data that a single stripe can
+hold is equal to `strip_size * (n - 1)`, where `n` is the number of members in
+the array. Here is a diagram showing an example layout of an array of 4 drives
+with strip size set to 4 blocks, so the stripe size (or *width*) is 12 blocks:
+
+```
+block     drive 0        drive 1        drive 2        drive 3
+     +--------------+--------------+--------------+--------------+
+ 0   | data strip 0 | data strip 1 | data strip 2 | parity strip | stripe 0
+ 1   |              |              |              |              |
+ 2   | RAID blocks  | RAID blocks  | RAID blocks  |              | RAID blocks
+ 3   | 0-3          | 4-7          | 8-11         |              | 0-11
+     +--------------+--------------+--------------+--------------+
+ 4   | data strip 0 | data strip 1 | parity strip | data strip 2 | stripe 1
+ 5   |              |              |              |              |
+ 6   | RAID blocks  | RAID blocks  |              | RAID blocks  | RAID blocks
+ 7   | 12-15        | 16-19        |              | 20-23        | 12-23
+     +--------------+--------------+--------------+--------------+
+ 8   | data strip 0 | parity strip | data strip 1 | data strip 2 | stripe 2
+ 9   |              |              |              |              |
+10   | RAID blocks  |              | RAID blocks  | RAID blocks  | RAID blocks
+11   | 24-27        |              | 28-31        | 32-35        | 24-35
+     +--------------+--------------+--------------+--------------+
+12   | parity strip | data strip 0 | data strip 1 | data strip 2 | stripe 3
+13   |              |              |              |              |
+14   |              | RAID blocks  | RAID blocks  | RAID blocks  | RAID blocks
+15   |              | 36-39        | 40-43        | 44-47        | 36-47
+     +--------------+--------------+--------------+--------------+
+16   | data strip 0 | data strip 1 | data strip 2 | parity strip | stripe 4
+17   |              |              |              |              |
+18   | RAID blocks  | RAID blocks  | RAID blocks  |              | RAID blocks
+19   | 48-51        | 52-55        | 56-59        |              | 48-59
+     +--------------+--------------+--------------+--------------+
+20   | data strip 0 | data strip 1 | parity strip | data strip 2 | stripe 5
+21   |              |              |              |              |
+...
+```
+
+Writing less than the size of a stripe causes read-modify-write, is bad for
+performance and can even lead to silent data corruption in case of a dirty
+shutdown combined with a drive failure, a phenomenon known as RAID Write Hole.
+For these reasons, it is recommended to optimize the workload to avoid partial
+stripe writes and use *full stripe writes* whenever possible.
+
+With raid5f we went beyond recommending full stripe writes and outright require
+them. That's what the "f" at the end of raid5f stands for: full stripe writes,
+to differentiate from standard RAID 5. Without having to support partial stripe
+updates, the code can be much simpler and possibly faster. The stripe size is
+set as the bdev's *write unit size* and is enforced by the bdev layer. This
+value can be retrieved with the API call `spdk_bdev_get_write_unit_size()`. If
+a write to a raid5f bdev is not aligned to or is not a multiple of a stripe,
+the I/O will fail. Reads are handled normally, without such restrictions.
+
+Obviously, the requirement to only use full stripe writes puts additional
+burden on the application. In some cases it may not be a problem, in others it
+will require big changes. An interesting option, and initially the main
+motivation behind raid5f, is to combine it with the
+[FTL bdev](https://spdk.io/doc/ftl.html). It can work on top of raid5f and,
+among other benefits, eliminates the requirement to issue writes in full
+stripes, thanks to its logical to physical (L2P) block mapping.
+
+If you would like to try raid5f, configure SPDK using the `--with-raid5f`
+option. You need at least 3 bdevs of any kind, but with the same block size and
+metadata format. Their size does not have to be equal, the array member size
+will be limited to the smallest base bdev. An example command to create a
+raid5f bdev is:
+
+`$ scripts/rpc.py bdev_raid_create -n raid_bdev0 -z 128 -r 5f -b "malloc0 malloc1 malloc2"`
+
+This creates a raid5f array named `raid_bdev0` from bdevs `malloc0 malloc1
+malloc2` with strip size set to 128 KiB and write unit size to 256 KiB.
+
+SPDK RAID modules are still in active development. Some big features like
+rebuild and superblock support have recently been merged and are available in
+SPDK release 24.01. More are coming in the future, we encourage you to review,
+share feedback, and submit your changes!