Skip to content

Commit

Permalink
blog: raid5f post
Browse files Browse the repository at this point in the history
Change-Id: Ic4a396c9f11378c0d3104f625816cba8e6beef7f
Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk.github.io/+/19538
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
Reviewed-by: Shuhei Matsumoto <smatsumoto@nvidia.com>
Reviewed-by: Konrad Sztyber <konrad.sztyber@intel.com>
  • Loading branch information
apaszkie authored and tomzawadzki committed Feb 23, 2024
1 parent 84abd56 commit cc9524d
Showing 1 changed file with 105 additions and 0 deletions.
105 changes: 105 additions & 0 deletions _posts/2024-02-12-raid5f.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
---
layout: post
title: "raid5f: the SPDK RAID 5 implementation"
author: Artur Paszkiewicz
categories: news
---

RAID 5 is a popular RAID level which, similarly to RAID 1 (mirroring), protects
data in the array against failure of a member drive and can improve read
performance by reading from multiple drives in parallel. It is often preferred
over mirroring because it "wastes" less storage capacity - the array can have
many member drives but the total available capacity is always reduced only by
the size of one member. Meanwhile, RAID 1 exposes the capacity of only one
member. That's 50% in the most common two-way mirror case, which provides the
same level of protection as RAID 5 - protection from failure of one drive.

The biggest downside of RAID 5 has always been write performance. Every write
to the array causes writes to at least two member drives, because apart from
the actual data, the parity also has to be updated. Parity is the redundant data
that allows recovery after a drive failure. It must stay in sync with the data
on the other drives. But the problem is not in the additional write. After all,
RAID 1 also has to duplicate writes. Computing the parity does add some
overhead but is still much faster than accessing storage. The real performance
hit comes when a write causes a read-modify-write cycle - to compute the
updated parity some data needs to be read from at least one member drive first.
It happens when a RAID stripe is partially updated.

A stripe is the data plus parity interleaved across member drives. The size of
the part of a stripe that is contained on one member is configurable and is
known as strip (AKA *chunk*) size. The amount of data that a single stripe can
hold is equal to `strip_size * (n - 1)`, where `n` is the number of members in
the array. Here is a diagram showing an example layout of an array of 4 drives
with strip size set to 4 blocks, so the stripe size (or *width*) is 12 blocks:

```
block drive 0 drive 1 drive 2 drive 3
+--------------+--------------+--------------+--------------+
0 | data strip 0 | data strip 1 | data strip 2 | parity strip | stripe 0
1 | | | | |
2 | RAID blocks | RAID blocks | RAID blocks | | RAID blocks
3 | 0-3 | 4-7 | 8-11 | | 0-11
+--------------+--------------+--------------+--------------+
4 | data strip 0 | data strip 1 | parity strip | data strip 2 | stripe 1
5 | | | | |
6 | RAID blocks | RAID blocks | | RAID blocks | RAID blocks
7 | 12-15 | 16-19 | | 20-23 | 12-23
+--------------+--------------+--------------+--------------+
8 | data strip 0 | parity strip | data strip 1 | data strip 2 | stripe 2
9 | | | | |
10 | RAID blocks | | RAID blocks | RAID blocks | RAID blocks
11 | 24-27 | | 28-31 | 32-35 | 24-35
+--------------+--------------+--------------+--------------+
12 | parity strip | data strip 0 | data strip 1 | data strip 2 | stripe 3
13 | | | | |
14 | | RAID blocks | RAID blocks | RAID blocks | RAID blocks
15 | | 36-39 | 40-43 | 44-47 | 36-47
+--------------+--------------+--------------+--------------+
16 | data strip 0 | data strip 1 | data strip 2 | parity strip | stripe 4
17 | | | | |
18 | RAID blocks | RAID blocks | RAID blocks | | RAID blocks
19 | 48-51 | 52-55 | 56-59 | | 48-59
+--------------+--------------+--------------+--------------+
20 | data strip 0 | data strip 1 | parity strip | data strip 2 | stripe 5
21 | | | | |
...
```

Writing less than the size of a stripe causes read-modify-write, is bad for
performance and can even lead to silent data corruption in case of a dirty
shutdown combined with a drive failure, a phenomenon known as RAID Write Hole.
For these reasons, it is recommended to optimize the workload to avoid partial
stripe writes and use *full stripe writes* whenever possible.

With raid5f we went beyond recommending full stripe writes and outright require
them. That's what the "f" at the end of raid5f stands for: full stripe writes,
to differentiate from standard RAID 5. Without having to support partial stripe
updates, the code can be much simpler and possibly faster. The stripe size is
set as the bdev's *write unit size* and is enforced by the bdev layer. This
value can be retrieved with the API call `spdk_bdev_get_write_unit_size()`. If
a write to a raid5f bdev is not aligned to or is not a multiple of a stripe,
the I/O will fail. Reads are handled normally, without such restrictions.

Obviously, the requirement to only use full stripe writes puts additional
burden on the application. In some cases it may not be a problem, in others it
will require big changes. An interesting option, and initially the main
motivation behind raid5f, is to combine it with the
[FTL bdev](https://spdk.io/doc/ftl.html). It can work on top of raid5f and,
among other benefits, eliminates the requirement to issue writes in full
stripes, thanks to its logical to physical (L2P) block mapping.

If you would like to try raid5f, configure SPDK using the `--with-raid5f`
option. You need at least 3 bdevs of any kind, but with the same block size and
metadata format. Their size does not have to be equal, the array member size
will be limited to the smallest base bdev. An example command to create a
raid5f bdev is:

`$ scripts/rpc.py bdev_raid_create -n raid_bdev0 -z 128 -r 5f -b "malloc0 malloc1 malloc2"`

This creates a raid5f array named `raid_bdev0` from bdevs `malloc0 malloc1
malloc2` with strip size set to 128 KiB and write unit size to 256 KiB.

SPDK RAID modules are still in active development. Some big features like
rebuild and superblock support have recently been merged and are available in
SPDK release 24.01. More are coming in the future, we encourage you to review,
share feedback, and submit your changes!

0 comments on commit cc9524d

Please sign in to comment.