-
Notifications
You must be signed in to change notification settings - Fork 14
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Change-Id: Ic4a396c9f11378c0d3104f625816cba8e6beef7f Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk.github.io/+/19538 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Reviewed-by: Shuhei Matsumoto <smatsumoto@nvidia.com> Reviewed-by: Konrad Sztyber <konrad.sztyber@intel.com>
- Loading branch information
1 parent
84abd56
commit cc9524d
Showing
1 changed file
with
105 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,105 @@ | ||
--- | ||
layout: post | ||
title: "raid5f: the SPDK RAID 5 implementation" | ||
author: Artur Paszkiewicz | ||
categories: news | ||
--- | ||
|
||
RAID 5 is a popular RAID level which, similarly to RAID 1 (mirroring), protects | ||
data in the array against failure of a member drive and can improve read | ||
performance by reading from multiple drives in parallel. It is often preferred | ||
over mirroring because it "wastes" less storage capacity - the array can have | ||
many member drives but the total available capacity is always reduced only by | ||
the size of one member. Meanwhile, RAID 1 exposes the capacity of only one | ||
member. That's 50% in the most common two-way mirror case, which provides the | ||
same level of protection as RAID 5 - protection from failure of one drive. | ||
|
||
The biggest downside of RAID 5 has always been write performance. Every write | ||
to the array causes writes to at least two member drives, because apart from | ||
the actual data, the parity also has to be updated. Parity is the redundant data | ||
that allows recovery after a drive failure. It must stay in sync with the data | ||
on the other drives. But the problem is not in the additional write. After all, | ||
RAID 1 also has to duplicate writes. Computing the parity does add some | ||
overhead but is still much faster than accessing storage. The real performance | ||
hit comes when a write causes a read-modify-write cycle - to compute the | ||
updated parity some data needs to be read from at least one member drive first. | ||
It happens when a RAID stripe is partially updated. | ||
|
||
A stripe is the data plus parity interleaved across member drives. The size of | ||
the part of a stripe that is contained on one member is configurable and is | ||
known as strip (AKA *chunk*) size. The amount of data that a single stripe can | ||
hold is equal to `strip_size * (n - 1)`, where `n` is the number of members in | ||
the array. Here is a diagram showing an example layout of an array of 4 drives | ||
with strip size set to 4 blocks, so the stripe size (or *width*) is 12 blocks: | ||
|
||
``` | ||
block drive 0 drive 1 drive 2 drive 3 | ||
+--------------+--------------+--------------+--------------+ | ||
0 | data strip 0 | data strip 1 | data strip 2 | parity strip | stripe 0 | ||
1 | | | | | | ||
2 | RAID blocks | RAID blocks | RAID blocks | | RAID blocks | ||
3 | 0-3 | 4-7 | 8-11 | | 0-11 | ||
+--------------+--------------+--------------+--------------+ | ||
4 | data strip 0 | data strip 1 | parity strip | data strip 2 | stripe 1 | ||
5 | | | | | | ||
6 | RAID blocks | RAID blocks | | RAID blocks | RAID blocks | ||
7 | 12-15 | 16-19 | | 20-23 | 12-23 | ||
+--------------+--------------+--------------+--------------+ | ||
8 | data strip 0 | parity strip | data strip 1 | data strip 2 | stripe 2 | ||
9 | | | | | | ||
10 | RAID blocks | | RAID blocks | RAID blocks | RAID blocks | ||
11 | 24-27 | | 28-31 | 32-35 | 24-35 | ||
+--------------+--------------+--------------+--------------+ | ||
12 | parity strip | data strip 0 | data strip 1 | data strip 2 | stripe 3 | ||
13 | | | | | | ||
14 | | RAID blocks | RAID blocks | RAID blocks | RAID blocks | ||
15 | | 36-39 | 40-43 | 44-47 | 36-47 | ||
+--------------+--------------+--------------+--------------+ | ||
16 | data strip 0 | data strip 1 | data strip 2 | parity strip | stripe 4 | ||
17 | | | | | | ||
18 | RAID blocks | RAID blocks | RAID blocks | | RAID blocks | ||
19 | 48-51 | 52-55 | 56-59 | | 48-59 | ||
+--------------+--------------+--------------+--------------+ | ||
20 | data strip 0 | data strip 1 | parity strip | data strip 2 | stripe 5 | ||
21 | | | | | | ||
... | ||
``` | ||
|
||
Writing less than the size of a stripe causes read-modify-write, is bad for | ||
performance and can even lead to silent data corruption in case of a dirty | ||
shutdown combined with a drive failure, a phenomenon known as RAID Write Hole. | ||
For these reasons, it is recommended to optimize the workload to avoid partial | ||
stripe writes and use *full stripe writes* whenever possible. | ||
|
||
With raid5f we went beyond recommending full stripe writes and outright require | ||
them. That's what the "f" at the end of raid5f stands for: full stripe writes, | ||
to differentiate from standard RAID 5. Without having to support partial stripe | ||
updates, the code can be much simpler and possibly faster. The stripe size is | ||
set as the bdev's *write unit size* and is enforced by the bdev layer. This | ||
value can be retrieved with the API call `spdk_bdev_get_write_unit_size()`. If | ||
a write to a raid5f bdev is not aligned to or is not a multiple of a stripe, | ||
the I/O will fail. Reads are handled normally, without such restrictions. | ||
|
||
Obviously, the requirement to only use full stripe writes puts additional | ||
burden on the application. In some cases it may not be a problem, in others it | ||
will require big changes. An interesting option, and initially the main | ||
motivation behind raid5f, is to combine it with the | ||
[FTL bdev](https://spdk.io/doc/ftl.html). It can work on top of raid5f and, | ||
among other benefits, eliminates the requirement to issue writes in full | ||
stripes, thanks to its logical to physical (L2P) block mapping. | ||
|
||
If you would like to try raid5f, configure SPDK using the `--with-raid5f` | ||
option. You need at least 3 bdevs of any kind, but with the same block size and | ||
metadata format. Their size does not have to be equal, the array member size | ||
will be limited to the smallest base bdev. An example command to create a | ||
raid5f bdev is: | ||
|
||
`$ scripts/rpc.py bdev_raid_create -n raid_bdev0 -z 128 -r 5f -b "malloc0 malloc1 malloc2"` | ||
|
||
This creates a raid5f array named `raid_bdev0` from bdevs `malloc0 malloc1 | ||
malloc2` with strip size set to 128 KiB and write unit size to 256 KiB. | ||
|
||
SPDK RAID modules are still in active development. Some big features like | ||
rebuild and superblock support have recently been merged and are available in | ||
SPDK release 24.01. More are coming in the future, we encourage you to review, | ||
share feedback, and submit your changes! |