Commit graph

30 commits

Author SHA1 Message Date
Naohiro Aota c821e5545f btrfs-progs: introduce btrfs_pwrite wrapper for pwrite
Wrap pwrite with btrfs_pwrite(). It simply calls pwrite() on non-zoned
btrfs (opened without O_DIRECT). On zoned mode (opened with O_DIRECT),
it allocates an aligned bounce buffer, copies the contents and uses it
for direct-IO writing.

Writes in device_zero_blocks() and btrfs_wipe_existing_sb() are a little
tricky. We don't have fs_info on our hands, so use zinfo to determine it
is a zoned device or not.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-20 18:59:23 +02:00
Naohiro Aota 585ac14d1a btrfs-progs: use btrfs_device_size() instead of device_get_partition_size_fd()
device_get_partition_size_fd() fails if we pass a regular file. This can
happen when trying to create an emulated zoned filesystem on a regular file.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-08 20:46:35 +02:00
David Sterba d0ea2b2af4 btrfs-progs: zoned: also exclude raid1c3 and raid1c4 from supported profiles
The enumeration of profiles not available for zoned mode in
btrfs_load_block_group_zone_info was lacking the 3 and 4 copy raid1, add
them.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-07 18:39:58 +02:00
Johannes Thumshirn c22e9487a7 btrfs-progs: remove max_zone_append_size logic
max_zone_append_size is unused and can as well be removed just like we
did on the kernel side.

Keep one sanity check though, so we're not adding devices to a zoned FS
that aren't supporting zone append.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-06 16:49:07 +02:00
Naohiro Aota 53ec59ead0 btrfs-progs: do not zone reset on emulated zoned mode
We cannot zone reset a regular file with emulated zones. So, mkfs.btrfs
on such a file causes the following error.

  ERROR: zoned: failed to reset device '/home/naota/tmp/btrfs.img' zones: Inappropriate ioctl for device

Introduce btrfs_zoned_device_info->emulated to distinguish the zones are
emulated or not. And, use it to decide it needs zone reset or not.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-06 16:48:56 +02:00
David Sterba 96a5cf0719 btrfs-progs: handle EINVAL when reading zone size on older kernels
A combination of new progs and old kernel may lead to problems with
detecting zone size by ioctl. Fixed by #376 but still incomplete because
old kernels may return EINVAL for unsupported ioctl. This should be
ENOTTY but hasn't been like that until kernel 5.11.

As we always pass valid arguments to the ioctl we can't conflate the two
and can EINVAL the same way as ENOTTY.

Issue: #399
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-20 11:31:09 +02:00
David Sterba c3ee6a8a09 btrfs-progs: unify GPL header comments
Add the GPL v2 header to files where it was missing and is not from an
external source, update to the most recent version with the address.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-07 13:58:44 +02:00
Sidong Yang 94f3b75c00 btrfs-progs: zoned: fix memory leak in btrfs_sb_io()
In btrfs_sb_io(), blk_zone_report is used for getting information about
zones. But it is not freed if code goes in usual path. This patch frees
the variable just after it used.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Sidong Yang <realwakka@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-02 17:27:53 +02:00
David Sterba 1dc6f33c28 btrfs-progs: zoned: use fixed width type when reading zone size
The ioctl BLKGETZONESZ expects 32bit integer, declare the target
variable as such.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-02 17:27:53 +02:00
David Sterba 6134973527 btrfs-progs: zoned: make it work without kernel support
There's a report that a system with 4.19 kernel fails boot because
device scan exits with error. This is because zoned support is compiled
in btrfs-progs but not in kernel.

To make new progs and old kernels work, do a fallback when the zoned
ioctl is not available, as if it were a non-zoned device. There is no
other option, but this is safe at least for the device scan that would
not error out. Any unaligned writes to a zoned device will fail as
expected.

Issue: #376
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-07 17:38:46 +02:00
David Sterba b19a603d62 btrfs-progs: remove unnecessary linux/*.h includes
Decrease dependency on system headers, remove where they're not needed
or became stale after code moved. The path-utils.h encapsulate path
operations so include linux/limits.h here, that's where PATH_MAX is
defined.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-06 16:41:47 +02:00
David Sterba aa56bf3a31 btrfs-progs: zoned: replace raw ioctl with a helper for device size
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-06 16:41:46 +02:00
David Sterba c7b5f884e0 btrfs-progs: add prefix to zero_blocks
This is a public helper for devices, add the prefix to make it clear.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-06 16:41:46 +02:00
David Sterba 2b5d4f2e6f btrfs-progs: add prefix to discard_blocks
This is a helper for devices, make it clear in the function name.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-06 16:41:46 +02:00
David Sterba bc6864967b btrfs-progs: add prefix to exported queue_param
As this is a public helper, add a prefix that makes it clear what is the
queue related to.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-06 16:41:46 +02:00
David Sterba 38254c4934 btrfs-progs: kerncompat: add const_ilog2
The newly added zoned mode constants can utilize the const ilog2
version. Copy it from kernel include/linux/log2.h.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-06 16:41:46 +02:00
Naohiro Aota 8c2dfa6387 btrfs-progs: zoned: wipe temporary superblocks in superblock log zone
mkfs.btrfs uses a temporary superblock during the initialization process.
The temporary superblock uses BTRFS_MAGIC_TEMPORARY as its magic which is
different from a regular superblock. As a result, libblkid, which only
supports the usual magic, cannot recognize the volume as btrfs. So, let's
wipe the temporary magic before writing out the usual superblock.

Technically, we can add the temporary magic to the libblkid's table. But,
it will result in recognizing a half-baked filesystem as btrfs, which is
not ideal.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-06 16:41:46 +02:00
Naohiro Aota 8bbb0c5744 btrfs-progs: zoned: support zero out on zoned block device
If we zero out a region in a sequential write required zone, we cannot
write to the region until we reset the zone. Thus, we must prohibit zeroing
out to a sequential write required zone.

zero_dev_clamped() is modified to take the zone information and it calls
zero_zone_blocks() if the device is host managed to avoid writing to
sequential write required zones.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-06 16:41:46 +02:00
Naohiro Aota 58ec593892 btrfs-progs: zoned: support resetting zoned device
All zones of zoned block devices should be reset before writing. Support
this by introducing PREP_DEVICE_ZONED.

btrfs_reset_all_zones() walk all the zones on a device, and reset a zone if
it is sequential required zone, or discard the zone range otherwise.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-06 16:41:46 +02:00
Naohiro Aota bfdb3ae237 btrfs-progs: zoned: reset zone of freed block group
When freeing a chunk, we can/should reset the underlying device zones
for the chunk. Introduce btrfs_reset_chunk_zones() and reset the zones.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-06 16:41:45 +02:00
Naohiro Aota bfd34b7876 btrfs-progs: zoned: redirty clean extent buffers
Tree manipulating operations like merging nodes often release
once-allocated tree nodes. Btrfs cleans such nodes so that pages in the
node are not uselessly written out. On ZONED drives, however, such
optimization blocks the following IOs as the cancellation of the write
out of the freed blocks breaks the sequential write sequence expected by
the device.

Check if next dirty extent buffer is continuous to a previously written
one. If not, it redirty extent buffers between the previous one and the
next one, so that all dirty buffers are written sequentially.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-06 16:41:45 +02:00
Naohiro Aota feff533e34 btrfs-progs: zoned: calculate allocation offset for conventional zones
Conventional zones do not have a write pointer, so we cannot use it to
determine the allocation offset for sequential allocation if a block
group contains a conventional zone.

But instead, we can consider the end of the highest addressed extent in
the block group for the allocation offset.

For new block group, we cannot calculate the allocation offset by
consulting the extent tree, because it can cause deadlock by taking
extent buffer lock after chunk mutex, which is already taken in
btrfs_make_block_group(). Since it is a new block group anyways, we can
simply set the allocation offset to 0.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-06 16:41:45 +02:00
Naohiro Aota f08410f078 btrfs-progs: zoned: load zone's allocation offset
A zoned filesystem must allocate blocks at the zones' write pointer. The
device's write pointer position can be mapped to a logical address
within a block group. To facilitate this, add an "alloc_offset" to the
block group to track the logical addresses of the write pointer.

This logical address is populated in btrfs_load_block_group_zone_info()
from the write pointers of corresponding zones.

For now, zoned filesystems the single profile. Supporting non-single
profile with zone append writing is not trivial. For example, in the DUP
profile, we send a zone append writing IO to two zones on a device. The
device reply with written LBAs for the IOs. If the offsets of the
returned addresses from the beginning of the zone are different, then it
results in different logical addresses.

We need fine-grained logical to physical mapping to support such
separated physical address issue. Since it should require additional
metadata type, disable non-single profiles for now.

This commit supports the case all the zones in a block group are
sequential. The next patch will handle the case having a conventional
zone.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-06 16:41:45 +02:00
Naohiro Aota b031fe84fd btrfs-progs: zoned: implement zoned chunk allocator
Implement a zoned chunk and device extent allocator. One device zone
becomes a device extent so that a zone reset affects only this device
extent and does not change the state of blocks in the neighbor device
extents.

To implement the allocator, we need to extend the following functions for
a zoned filesystem:

  - init_alloc_chunk_ctl
  - dev_extent_search_start
  - dev_extent_hole_check
  - decide_stripe_size

Here, dev_extent_hole_check() is newly introduced to check the validity of
a hole found.

init_alloc_chunk_ctl_zoned() is mostly the same as regular one. It always
set the stripe_size to the zone size and aligns the parameters to the zone
size.

dev_extent_search_start() only aligns the start offset to zone boundaries.
We don't care about the first 1MB like in regular filesystem because we
anyway reserve the first two zones for superblock logging.

dev_extent_hole_check_zoned() checks if zones in given hole are either
conventional or empty sequential zones. Also, it skips zones reserved for
superblock logging.

With the change to the hole, the new hole may now contain pending extents.
So, in this case, loop again to check that.

Finally, decide_stripe_size_zoned() should shrink the number of devices
instead of stripe size because we need to honor stripe_size == zone_size.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-06 16:41:45 +02:00
Naohiro Aota 8ef9313cf2 btrfs-progs: zoned: implement log-structured superblock
Superblock (and its copies) is the only data structure in btrfs which has a
fixed location on a device. Since we cannot overwrite in a sequential write
required zone, we cannot place superblock in the zone.  One easy solution
is limiting superblock and copies to be placed only in conventional zones.

However, this method has two downsides: one is reduced number of superblock
copies. The location of the second copy of superblock is 256GB, which is in
a sequential write required zone on typical devices in the market today.
So, the number of superblock and copies is limited to be two.  Second
downside is that we cannot support devices which have no conventional zones
at all.

To solve these two problems, we employ superblock log writing. It uses two
adjacent zones as a circular buffer to write updated superblocks.  Once the
first zone is filled up, start writing into the second one.  Then, when
both zones are filled up and before starting to write to the first zone
again, reset the first zone.

We can determine the position of the latest superblock by reading write
pointer information from a device. One corner case is when both zones are
full. For this situation, we read out the last superblock of each zone, and
compare them to determine which zone is older.

The following zones are reserved as the circular buffer on ZONED btrfs.

- primary superblock: offset   0B (and the following zone)
- first copy:         offset 512G (and the following zone)
- Second copy:        offset   4T (4096G, and the following zone)

If these reserved zones are conventional, superblock is written fixed at
the start of the zone without logging.

Currently, superblock reading/writing is done by pread/pwrite. This
commit replace the call sites with sbread/sbwrite to wrap the functions.
For zoned btrfs, btrfs_sb_io which is called from sbread/sbwrite
reverses the IO position back to a mirror number, maps the mirror number
into the superblock logging position, and do the IO.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-06 16:41:45 +02:00
Naohiro Aota 49d5ce4d0f btrfs-progs: zoned: allow zoned filesystems on non-zoned block devices
Run a zoned filesystem on non-zoned devices. This is done by "slicing
up" the block device into fixed-sized chunks and emulate a conventional
zone on each of them. The emulated zone size is determined from the size
of device extent.

This is mainly aimed at testing of zoned filesystems, i.e. the zoned
chunk allocator, on regular block devices.

Currently, we always use EMULATED_ZONE_SIZE (256MiB) for the emulated
zone size. In the future, this will be customized by mkfs option.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-06 16:41:45 +02:00
Naohiro Aota 707f0716e0 btrfs-progs: zoned: disallow mixed-bg in ZONED mode
Placing both data and metadata in a block group is impossible in ZONED
mode. For data, we can allocate a space for it and write it immediately
after the allocation. For metadata, however, we cannot do that, because the
logical addresses are recorded in other metadata buffers to build up the
trees. As a result, a data buffer can be placed after a metadata buffer,
which is not written yet. Writing out the data buffer will break the
sequential write rule.

Check and disallow MIXED_BG with ZONED mode.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-06 16:41:45 +02:00
Naohiro Aota 3c0f83e541 btrfs-progs: zoned: introduce max_zone_append_size
The zone append write command has a maximum IO size restriction it
accepts. This is because a zone append write command cannot be split, as
we ask the device to place the data into a specific target zone and the
device responds with the actual written location of the data.

Introduce max_zone_append_size to zone_info and fs_info to track the
value, so we can limit all I/O to a zoned block device that we want to
write using the zone append command to the device's limits.

Zone append command is mandatory for zoned btrfs. So, reject a device
with max_zone_append_size == 0.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-06 16:41:45 +02:00
Naohiro Aota 7e520022ff btrfs-progs: zoned: check and enable ZONED mode
Introduce function btrfs_check_zoned_mode() to check if ZONED flag is
enabled on the file system and if the file system consists of zoned
devices with equal zone size.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-06 16:41:45 +02:00
Naohiro Aota 384840b9c0 btrfs-progs: zoned: get zone information of zoned block devices
Get the zone information (number of zones and zone size) from all the
devices, if the volume contains a zoned block device. To avoid costly
run-time zone report commands to test the device zones type during block
allocation, it also records all the zone status (zone type, write
pointer position, etc.).

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-06 16:41:45 +02:00