BTRFS RAID Stripe Tree Design ============================= Problem Statement ----------------- RAID on zoned devices --------------------- The current implementation of RAID profiles in BTRFS is based on the implicit assumption that data placement is deterministic in the device chunks used for mapping block groups. With deterministic data placement, all physical on-disk extents of one logical file extent are positioned at the same offset relative to the starting LBA of a device chunk. This prevents the need for reading any meta-data to access an on-disk file extent. Figure 1 shows an example of it. +------------+ +------------+ | | | | | | | | | | | | +------------+ +------------+ | | | | | D 1 | | D 1 | | | | | +------------+ +------------+ | | | | | D 2 | | D 2 | | | | | +------------+ +------------+ | | | | | | | | | | | | | | | | +------------+ +------------+ Figure 1: Deterministic data placement With non-deterministic data placement, the on-disk extents of a logical file extent can be scattered around inside the chunk. To read back the data with non-deterministic data placement, additional meta-data describing the position of the extents inside a chunk is needed. Figure 2 shows an example for this style of data placement. +------------+ +------------+ | | | | | | | | | | | | +------------+ +------------+ | | | | | D 1 | | D 2 | | | | | +------------+ +------------+ | | | | | D 2 | | D 1 | | | | | +------------+ +------------+ | | | | | | | | | | | | | | | | +------------+ +------------+ Figure 2: Non-deterministic data placement As BTRFS support for zoned block devices uses the Zone Append operation for writing file data extents, there is no guarantee that the written extents have the same offset within different device chunks. This implies that to be able to use RAID with zoned devices, non-deterministic data placement must be supported and additional meta-data describing the location of file extents within device chunks is needed. Lessons learned from RAID 5 --------------------------- BTRFS implementation of RAID levels 5 and 6 suffer from the well-known RAID write hole problem. This problem exists because sub-stripe write operations are not done using a copy-on-write (COW) but done using Read-Modify-Write (RMW). With out-of-place writing like COW, no blocks will be overwritten and there is no risk of exposing bad data or corrupting a data stripe parity in case of sudden power loss or other unexpected events preventing correctly writing to the device. RAID Stripe Tree Design overview -------------------------------- To solve the problems stated above, additional meta-data is introduced: a RAID Stripe Tree holds the logical to physical translation for the RAID stripes. For each logical file extent (struct btrfs_file_extent_item) a stripe extent is created (struct btrfs_stripe_extent). Each btrfs_stripe_extent entry is a container for an array of struct btrfs_raid_stride. A struct btrfs_raid_stride holds the device ID and the physical start location on that device of the sub-stripe of a file extent, as well as the stride's length. Each struct btrfs_stripe_extent is keyed by the struct btrfs_file_extent_item disk_bytenr and disk_num_bytes, with disk_bytenr as the objectid for the btrfs_key and disk_num_bytes as the offset. The key’s type is BTRFS_STRIPE_EXTENT_KEY. On-disk format -------------- struct btrfs_file_extent_item { /* […] */ __le64 disk_bytenr; ------------------------------------+ __le64 disk_num_bytes; --------------------------------|----+ /* […] */ | | }; | | | | struct btrfs_key key = { -------------------------------------|----|--+ .objectid = btrfs_file_extent_item::disk_bytenr, <------+ | | .type = BTRFS_STRIPE_EXTENT_KEY, | | .offset = btrfs_file_extent_item::disk_num_bytes, <----------+ | }; | | struct btrfs_raid_stride { <------------------+ | __le64 devid; | | __le64 physical; | | __le64 length; | | }; | | | | struct btrfs_stripe_extent { <---------------|-------------------------+ u8 encoding; | u8 reserved[7]; | struct btrfs_raid_stride strides[]; ---+ }; User-space support ------------------ mkfs ---- Supporting the RAID Stripe Tree in user-space consists of three things to do for mkfs. The first step is creating the root of the RAID Stripe Tree itself. Then mkfs must set the incompat flag, so mounting a filesystem with a RAID Stripe Tree is impossible for a kernel version without appropriate support. Lastly it must allow RAID profiles on zoned devices once the tree is present. Check ----- The ‘btrfs check’ support for RAID Stripe Tree is not implemented yet, but its task is to read the struct btrfs_stripe_extent entries for each struct btrfs_file_extent_item and verify that a correct mapping between these two exists. If data checksum verification is requested as well, the tree also must be read to perform the logical to physical translation, otherwise the data cannot be read, and the checksums cannot be verified. Example tree dumps ------------------ Example 1: Write 128k to and empty FS RAID0 item 0 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 56 encoding: RAID0 stripe 0 devid 1 physical XXXXXXXXX length 65536 stripe 1 devid 2 physical XXXXXXXXX length 65536 RAID1 stripe 0 devid 1 physical XXXXXXXXX length 131072 stripe 1 devid 2 physical XXXXXXXXX length 131072 RAID10 item 0 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 104 encoding: RAID10 stripe 0 devid 1 physical XXXXXXXXX length 65536 stripe 1 devid 2 physical XXXXXXXXX length 65536 stripe 2 devid 3 physical XXXXXXXXX length 65536 stripe 3 devid 4 physical XXXXXXXXX length 65536 Example 2: Pre-fill one 65k stripe, write 4k to 2nd stripe, write 64k then write 16k. RAID0 item 0 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 32 encoding: RAID0 stripe 0 devid 1 physical XXXXXXXXX length 65536 item 1 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 32 encoding: RAID0 stripe 0 devid 2 physical XXXXXXXXX length 4096 item 2 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 56 encoding: RAID0 stripe 0 devid 2 physical XXXXXXXXX length 61440 stripe 1 devid 1 physical XXXXXXXXX length 4096 item 3 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 32 encoding: RAID0 stripe 0 devid 1 physical XXXXXXXXX length 4096 RAID1 item 0 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 56 encoding: RAID1 stripe 0 devid 1 physical XXXXXXXXX length 65536 stripe 1 devid 2 physical XXXXXXXXX length 65536 item 1 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 56 encoding: RAID1 stripe 0 devid 1 physical XXXXXXXXX length 4096 stripe 1 devid 2 physical XXXXXXXXX length 4096 item 2 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 56 encoding: RAID1 stripe 0 devid 1 physical XXXXXXXXX length 65536 stripe 1 devid 2 physical XXXXXXXXX length 65536 item 3 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 56 encoding: RAID1 stripe 0 devid 1 physical XXXXXXXXX length 4096 stripe 1 devid 2 physical XXXXXXXXX length 4096 RAID10 item 0 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 56 encoding: RAID10 stripe 0 devid 1 physical XXXXXXXXX length 65536 stripe 1 devid 2 physical XXXXXXXXX length 65536 item 1 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 56 encoding: RAID10 stripe 0 devid 3 physical XXXXXXXXX length 4096 stripe 1 devid 4 physical XXXXXXXXX length 4096 item 2 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 104 encoding: RAID10 stripe 0 devid 3 physical XXXXXXXXX length 61440 stripe 1 devid 4 physical XXXXXXXXX length 61440 stripe 2 devid 1 physical XXXXXXXXX length 4096 stripe 3 devid 2 physical XXXXXXXXX length 4096 item 3 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 56 encoding: RAID10 stripe 0 devid 1 physical XXXXXXXXX length 4096 stripe 1 devid 2 physical XXXXXXXXX length 4096 Example 3: Pre-fill stripe with 32K data, then write 64K of data and then overwrite 8k in the middle. RAID0 item 0 key (XXXXXX RAID_STRIPE_KEY 32768) itemoff XXXXX itemsize 32 encoding: RAID0 stripe 0 devid 1 physical XXXXXXXXX length 32768 item 1 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 80 encoding: RAID0 stripe 0 devid 1 physical XXXXXXXXX length 32768 stripe 1 devid 2 physical XXXXXXXXX length 65536 stripe 2 devid 1 physical XXXXXXXXX length 32768 item 2 key (XXXXXX RAID_STRIPE_KEY 8192) itemoff XXXXX itemsize 32 encoding: RAID0 stripe 0 devid 1 physical XXXXXXXXX length 8192 RAID1 item 0 key (XXXXXX RAID_STRIPE_KEY 32768) itemoff XXXXX itemsize 56 encoding: RAID1 stripe 0 devid 1 physical XXXXXXXXX length 32768 stripe 1 devid 2 physical XXXXXXXXX length 32768 item 1 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 56 encoding: RAID1 stripe 0 devid 1 physical XXXXXXXXX length 131072 stripe 1 devid 2 physical XXXXXXXXX length 131072 item 2 key (XXXXXX RAID_STRIPE_KEY 8192) itemoff XXXXX itemsize 56 encoding: RAID1 stripe 0 devid 1 physical XXXXXXXXX length 8192 stripe 1 devid 2 physical XXXXXXXXX length 8192 RAID10 item 0 key (XXXXXX RAID_STRIPE_KEY 32768) itemoff XXXXX itemsize 56 encoding: RAID10 stripe 0 devid 1 physical XXXXXXXXX length 32768 stripe 1 devid 2 physical XXXXXXXXX length 32768 item 1 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 152 encoding: RAID10 stripe 0 devid 1 physical XXXXXXXXX length 32768 stripe 1 devid 2 physical XXXXXXXXX length 32768 stripe 2 devid 3 physical XXXXXXXXX length 65536 stripe 3 devid 4 physical XXXXXXXXX length 65536 stripe 4 devid 1 physical XXXXXXXXX length 32768 stripe 5 devid 2 physical XXXXXXXXX length 32768 item 2 key (XXXXXX RAID_STRIPE_KEY 8192) itemoff XXXXX itemsize 56 encoding: RAID10 stripe 0 devid 1 physical XXXXXXXXX length 8192 stripe 1 devid 2 physical XXXXXXXXX length 8192 Glossary -------- RAID Redundant Array of Independent Disks. This is a storage mechanism where data is not stored on a single disk alone but either mirrored (in case of RAID 1) or split across multiple disks (RAID 0). Other RAID levels like RAID5 and RAID6 stripe the data across multiple disks and add parity information to enable data recovery in case of a disk failure. LBA Logical Block Address. LBAs describe the address space of a block device as a linearly increasing address map. LBAs are internally mapped to different physical locations by the device firmware. Zoned Block Device Zoned Block Devices are a special kind of block devices that partition their LBA space into so called zones. These zones can impose write constraints on the host, e.g., allowing only sequential writes aligned to a zone write pointer. Zone Append A write operation where the start LBA of a Zone is specified instead of a destination LBA for the data to be written. On completion the device reports the starting LBA used to write the data back to the host. Copy-on-Write A write technique where the data is not overwritten, but a new version of it is written out of place. Read-Modify-Write A write technique where the data to be written is first read from the block device, modified in memory and the modified data written back in place on the block device.