btrfs-progs/Documentation/dev/design-raid-stripe-tree.txt
Johannes Thumshirn 04d9e13461 btrfs-progs: docs: add document for RAID stripe tree design
Add a document describing the layout and functionality of the newly
introduced RAID Stripe Tree.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-03 18:04:37 +01:00

318 lines
14 KiB
Plaintext
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

BTRFS RAID Stripe Tree Design
=============================
Problem Statement
-----------------
RAID on zoned devices
---------------------
The current implementation of RAID profiles in BTRFS is based on the implicit
assumption that data placement is deterministic in the device chunks used for
mapping block groups.
With deterministic data placement, all physical on-disk extents of one logical
file extent are positioned at the same offset relative to the starting LBA of
a device chunk. This prevents the need for reading any meta-data to access an
on-disk file extent. Figure 1 shows an example of it.
+------------+ +------------+
| | | |
| | | |
| | | |
+------------+ +------------+
| | | |
| D 1 | | D 1 |
| | | |
+------------+ +------------+
| | | |
| D 2 | | D 2 |
| | | |
+------------+ +------------+
| | | |
| | | |
| | | |
| | | |
+------------+ +------------+
Figure 1: Deterministic data placement
With non-deterministic data placement, the on-disk extents of a logical file
extent can be scattered around inside the chunk. To read back the data with
non-deterministic data placement, additional meta-data describing the position
of the extents inside a chunk is needed. Figure 2 shows an example for this
style of data placement.
+------------+ +------------+
| | | |
| | | |
| | | |
+------------+ +------------+
| | | |
| D 1 | | D 2 |
| | | |
+------------+ +------------+
| | | |
| D 2 | | D 1 |
| | | |
+------------+ +------------+
| | | |
| | | |
| | | |
| | | |
+------------+ +------------+
Figure 2: Non-deterministic data placement
As BTRFS support for zoned block devices uses the Zone Append operation for
writing file data extents, there is no guarantee that the written extents have
the same offset within different device chunks. This implies that to be able
to use RAID with zoned devices, non-deterministic data placement must be
supported and additional meta-data describing the location of file extents
within device chunks is needed.
Lessons learned from RAID 5
---------------------------
BTRFS implementation of RAID levels 5 and 6 suffer from the well-known RAID
write hole problem. This problem exists because sub-stripe write operations
are not done using a copy-on-write (COW) but done using Read-Modify-Write
(RMW). With out-of-place writing like COW, no blocks will be overwritten and
there is no risk of exposing bad data or corrupting a data stripe parity in
case of sudden power loss or other unexpected events preventing correctly
writing to the device.
RAID Stripe Tree Design overview
--------------------------------
To solve the problems stated above, additional meta-data is introduced: a RAID
Stripe Tree holds the logical to physical translation for the RAID stripes.
For each logical file extent (struct btrfs_file_extent_item) a stripe extent
is created (struct btrfs_stripe_extent). Each btrfs_stripe_extent entry is a
container for an array of struct btrfs_raid_stride. A struct btrfs_raid_stride
holds the device ID and the physical start location on that device of the
sub-stripe of a file extent, as well as the stride's length.
Each struct btrfs_stripe_extent is keyed by the struct btrfs_file_extent_item
disk_bytenr and disk_num_bytes, with disk_bytenr as the objectid for the
btrfs_key and disk_num_bytes as the offset. The keys type is
BTRFS_STRIPE_EXTENT_KEY.
On-disk format
--------------
struct btrfs_file_extent_item {
/* […] */
__le64 disk_bytenr; ------------------------------------+
__le64 disk_num_bytes; --------------------------------|----+
/* […] */ | |
}; | |
| |
struct btrfs_key key = { -------------------------------------|----|--+
.objectid = btrfs_file_extent_item::disk_bytenr, <------+ | |
.type = BTRFS_STRIPE_EXTENT_KEY, | |
.offset = btrfs_file_extent_item::disk_num_bytes, <----------+ |
}; |
|
struct btrfs_raid_stride { <------------------+ |
__le64 devid; | |
__le64 physical; | |
__le64 length; | |
}; | |
| |
struct btrfs_stripe_extent { <---------------|-------------------------+
u8 encoding; |
u8 reserved[7]; |
struct btrfs_raid_stride strides[]; ---+
};
User-space support
------------------
mkfs
----
Supporting the RAID Stripe Tree in user-space consists of three things to do
for mkfs. The first step is creating the root of the RAID Stripe Tree itself.
Then mkfs must set the incompat flag, so mounting a filesystem with a RAID
Stripe Tree is impossible for a kernel version without appropriate support.
Lastly it must allow RAID profiles on zoned devices once the tree is present.
Check
-----
The btrfs check support for RAID Stripe Tree is not implemented yet, but its
task is to read the struct btrfs_stripe_extent entries for each struct
btrfs_file_extent_item and verify that a correct mapping between these two
exists. If data checksum verification is requested as well, the tree also must
be read to perform the logical to physical translation, otherwise the data
cannot be read, and the checksums cannot be verified.
Example tree dumps
------------------
Example 1: Write 128k to and empty FS
RAID0
item 0 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 56
encoding: RAID0
stripe 0 devid 1 physical XXXXXXXXX length 65536
stripe 1 devid 2 physical XXXXXXXXX length 65536
RAID1
stripe 0 devid 1 physical XXXXXXXXX length 131072
stripe 1 devid 2 physical XXXXXXXXX length 131072
RAID10
item 0 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 104
encoding: RAID10
stripe 0 devid 1 physical XXXXXXXXX length 65536
stripe 1 devid 2 physical XXXXXXXXX length 65536
stripe 2 devid 3 physical XXXXXXXXX length 65536
stripe 3 devid 4 physical XXXXXXXXX length 65536
Example 2: Pre-fill one 65k stripe, write 4k to 2nd stripe, write 64k then
write 16k.
RAID0
item 0 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 32
encoding: RAID0
stripe 0 devid 1 physical XXXXXXXXX length 65536
item 1 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 32
encoding: RAID0
stripe 0 devid 2 physical XXXXXXXXX length 4096
item 2 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 56
encoding: RAID0
stripe 0 devid 2 physical XXXXXXXXX length 61440
stripe 1 devid 1 physical XXXXXXXXX length 4096
item 3 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 32
encoding: RAID0
stripe 0 devid 1 physical XXXXXXXXX length 4096
RAID1
item 0 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 56
encoding: RAID1
stripe 0 devid 1 physical XXXXXXXXX length 65536
stripe 1 devid 2 physical XXXXXXXXX length 65536
item 1 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 56
encoding: RAID1
stripe 0 devid 1 physical XXXXXXXXX length 4096
stripe 1 devid 2 physical XXXXXXXXX length 4096
item 2 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 56
encoding: RAID1
stripe 0 devid 1 physical XXXXXXXXX length 65536
stripe 1 devid 2 physical XXXXXXXXX length 65536
item 3 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 56
encoding: RAID1
stripe 0 devid 1 physical XXXXXXXXX length 4096
stripe 1 devid 2 physical XXXXXXXXX length 4096
RAID10
item 0 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 56
encoding: RAID10
stripe 0 devid 1 physical XXXXXXXXX length 65536
stripe 1 devid 2 physical XXXXXXXXX length 65536
item 1 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 56
encoding: RAID10
stripe 0 devid 3 physical XXXXXXXXX length 4096
stripe 1 devid 4 physical XXXXXXXXX length 4096
item 2 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 104
encoding: RAID10
stripe 0 devid 3 physical XXXXXXXXX length 61440
stripe 1 devid 4 physical XXXXXXXXX length 61440
stripe 2 devid 1 physical XXXXXXXXX length 4096
stripe 3 devid 2 physical XXXXXXXXX length 4096
item 3 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 56
encoding: RAID10
stripe 0 devid 1 physical XXXXXXXXX length 4096
stripe 1 devid 2 physical XXXXXXXXX length 4096
Example 3: Pre-fill stripe with 32K data, then write 64K of data and then
overwrite 8k in the middle.
RAID0
item 0 key (XXXXXX RAID_STRIPE_KEY 32768) itemoff XXXXX itemsize 32
encoding: RAID0
stripe 0 devid 1 physical XXXXXXXXX length 32768
item 1 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 80
encoding: RAID0
stripe 0 devid 1 physical XXXXXXXXX length 32768
stripe 1 devid 2 physical XXXXXXXXX length 65536
stripe 2 devid 1 physical XXXXXXXXX length 32768
item 2 key (XXXXXX RAID_STRIPE_KEY 8192) itemoff XXXXX itemsize 32
encoding: RAID0
stripe 0 devid 1 physical XXXXXXXXX length 8192
RAID1
item 0 key (XXXXXX RAID_STRIPE_KEY 32768) itemoff XXXXX itemsize 56
encoding: RAID1
stripe 0 devid 1 physical XXXXXXXXX length 32768
stripe 1 devid 2 physical XXXXXXXXX length 32768
item 1 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 56
encoding: RAID1
stripe 0 devid 1 physical XXXXXXXXX length 131072
stripe 1 devid 2 physical XXXXXXXXX length 131072
item 2 key (XXXXXX RAID_STRIPE_KEY 8192) itemoff XXXXX itemsize 56
encoding: RAID1
stripe 0 devid 1 physical XXXXXXXXX length 8192
stripe 1 devid 2 physical XXXXXXXXX length 8192
RAID10
item 0 key (XXXXXX RAID_STRIPE_KEY 32768) itemoff XXXXX itemsize 56
encoding: RAID10
stripe 0 devid 1 physical XXXXXXXXX length 32768
stripe 1 devid 2 physical XXXXXXXXX length 32768
item 1 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 152
encoding: RAID10
stripe 0 devid 1 physical XXXXXXXXX length 32768
stripe 1 devid 2 physical XXXXXXXXX length 32768
stripe 2 devid 3 physical XXXXXXXXX length 65536
stripe 3 devid 4 physical XXXXXXXXX length 65536
stripe 4 devid 1 physical XXXXXXXXX length 32768
stripe 5 devid 2 physical XXXXXXXXX length 32768
item 2 key (XXXXXX RAID_STRIPE_KEY 8192) itemoff XXXXX itemsize 56
encoding: RAID10
stripe 0 devid 1 physical XXXXXXXXX length 8192
stripe 1 devid 2 physical XXXXXXXXX length 8192
Glossary
--------
RAID
Redundant Array of Independent Disks. This is a storage mechanism
where data is not stored on a single disk alone but either mirrored
(in case of RAID 1) or split across multiple disks (RAID 0). Other
RAID levels like RAID5 and RAID6 stripe the data across multiple disks
and add parity information to enable data recovery in case of a disk
failure.
LBA
Logical Block Address. LBAs describe the address space of a block
device as a linearly increasing address map. LBAs are internally
mapped to different physical locations by the device firmware.
Zoned Block Device
Zoned Block Devices are a special kind of block devices that partition
their LBA space into so called zones. These zones can impose write
constraints on the host, e.g., allowing only sequential writes aligned
to a zone write pointer.
Zone Append
A write operation where the start LBA of a Zone is specified instead
of a destination LBA for the data to be written. On completion the
device reports the starting LBA used to write the data back to the
host.
Copy-on-Write
A write technique where the data is not overwritten, but a new version
of it is written out of place.
Read-Modify-Write
A write technique where the data to be written is first read from the
block device, modified in memory and the modified data written back in
place on the block device.