btrfs-progs: docs: add more chapters (part 3)

All main pages have some content and many typos have been fixed.

Signed-off-by: David Sterba <dsterba@suse.com>
This commit is contained in:
David Sterba 2021-12-17 10:49:39 +01:00
parent c6be84840f
commit 208aed2ed4
26 changed files with 554 additions and 410 deletions

View file

@ -1,4 +1,9 @@
Balance
=======
...
.. include:: ch-balance-intro.rst
Filters
-------
.. include:: ch-balance-filters.rst

View file

@ -1,20 +1,44 @@
Common Linux features
=====================
Anything that's standard and also supported
The Linux operating system implements a POSIX standard interfaces and API with
additional interfaces. Many of them have become common in other filesystems. The
ones listed below have been added relatively recently and are considered
interesting for users:
- statx
birth/origin inode time
a timestamp associated with an inode of when it was created, cannot be
changed and requires the *statx* syscall to be read
- fallocate modes
statx
an extended version of the *stat* syscall that provides extensible
interface to read more information that are not available in original
*stat*
- birth/origin inode time
fallocate modes
the *fallocate* syscall allows to manipulate file extents like punching
holes, preallocation or zeroing a range
- filesystem label
FIEMAP
an ioctl that enumerates file extents, related tool is ``filefrag``
- xattr, acl
filesystem label
another filesystem identification, could be used for mount or for better
recognition, can be set or read by an ioctl or by command ``btrfs
filesystem label``
- FIEMAP
O_TMPFILE
mode of open() syscall that creates a file with no associated directory
entry, which makes it impossible to be seen by other processes and is
thus safe to be used as a temporary file
(https://lwn.net/Articles/619146/)
- O_TMPFILE
xattr, acl
extended attributes (xattr) is a list of *key=value* pairs associated
with a file, usually storing additional metadata related to security,
access control list in particular (ACL) or properties (``btrfs
property``)
- XFLAGS, fileattr
- cross-rename

View file

@ -1,16 +1,21 @@
Custom ioctls
=============
Anything that's not doing the other features and stands on it's own
Filesystems are usually extended by custom ioctls beyond the standard system
call interface to let user applications access the advanced features. They're
low level and the following list gives only an overview of the capabilities or
a command if available:
- reverse lookup, from file offset to inode
- reverse lookup, from file offset to inode, ``btrfs inspect-internal
logical-resolve``
- resolve inode number -> name
- resolve inode number to list of name, ``btrfs inspect-internal inode-resolve``
- file offset -> all inodes that share it
- tree search, given a key range and tree id, lookup and return all b-tree items
found in that range, basically all metadata at your hand but you need to know
what to do with them
- tree search, all the metadata at your hand (if you know what to do with them)
- informative, about devices, space allocation or the whole filesystem, many of
which is also exported in ``/sys/fs/btrfs``
- informative (device, fs, space)
- query/set a subset of features on a mounted fs
- query/set a subset of features on a mounted filesystem

View file

@ -18,5 +18,5 @@ happens inside the page cache, that is the central point caching the file data
and takes care of synchronization. Once a filesystem sync or flush is started
(either manually or automatically) all the dirty data get written to the
devices. This however reduces the chances to find optimal layout as the writes
happen together with other data and the result depens on the remaining free
happen together with other data and the result depends on the remaining free
space layout and fragmentation.

View file

@ -14,7 +14,7 @@ also copied, though there are no ready-made tools for that.
cp --reflink=always source target
There are some constaints:
There are some constraints:
- cross-filesystem reflink is not possible, there's nothing in common between
so the block sharing can't work

View file

@ -3,8 +3,8 @@ Resize
A BTRFS mounted filesystem can be resized after creation, grown or shrunk. On a
multi device filesystem the space occupied on each device can be resized
independently. Data tha reside in the are that would be out of the new size are
relocated to the remaining space below the limit, so this constrains the
independently. Data that reside in the area that would be out of the new size
are relocated to the remaining space below the limit, so this constrains the
minimum size to which a filesystem can be shrunk.
Growing a filesystem is quick as it only needs to take note of the available

View file

@ -1,4 +1,4 @@
Subvolumes
==========
...
.. include:: ch-subvolume-intro.rst

View file

@ -1,6 +1,53 @@
Tree checker
============
Metadata blocks that have been just read from devices or are just about to be
written are verified and sanity checked by so called **tree checker**. The
b-tree nodes contain several items describing the filesystem structure and to
some degree can be verified for consistency or validity. This is additional
check to the checksums that only verify the overall block status while the tree
checker tries to validate and cross reference the logical structure. This takes
a slight performance hit but is comparable to calculating the checksum and has
no noticeable impact while it does catch all sorts of errors.
There are two occasions when the checks are done:
Pre-write checks
----------------
When metadata blocks are in memory about to be written to the permanent storage,
the checks are performed, before the checksums are calculated. This can catch
random corruptions of the blocks (or pages) either caused by bugs or by other
parts of the system or hardware errors (namely faulty RAM).
Once a block does not pass the checks, the filesystem refuses to write more data
and turns itself to read-only mode to prevent further damage. At this point some
the recent metadata updates are held *only* in memory so it's best to not panic
and try to remember what files could be affected and copy them elsewhere. Once
the filesystem gets unmounted, the most recent changes are unfortunately lost.
The filesystem that is stored on the device is still consistent and should mount
fine.
Post-read checks
----------------
Metadata blocks get verified right after they're read from devices and the
checksum is found to be valid. This protects against changes to the metadata
that could possibly also update the checksum, less likely to happen accidentally
but rather due to intentional corruption or fuzzing.
The checks
----------
As implemented right now, the metadata consistency is limited to one b-tree node
and what items are stored there, ie. there's no extensive or broad check done
eg. against other data structures in other b-tree nodes. This still provides
enough opportunities to verify consistency of individual items, besides verifying
general validity of the items like the length or offset. The b-tree items are
also coupled with a key so proper key ordering is also part of the check and can
reveal random bitflips in the sequence (this has been the most successful
detector of faulty RAM).
The capabilities of tree checker have been improved over time and it's possible
that a filesystem created on an older kernel may trigger warnings or fail some
checks on a new one.

View file

@ -1,4 +1,41 @@
Trim
====
Trim/discard
============
...
Trim or discard is an operation on a storage device based on flash technology
(SSD, NVMe or similar), a thin-provisioned device or could be emulated on top
of other block device types. On real hardware, there's a different lifetime
span of the memory cells and the driver firmware usually tries to optimize for
that. The trim operation issued by user provides hints about what data are
unused and allow to reclaim the memory cells. On thin-provisioned or emulated
this is could simply free the space.
There are three main uses of trim that BTRFS supports:
synchronous
enabled by mounting filesystem with ``-o discard`` or ``-o
discard=sync``, the trim is done right after the file extents get freed,
this however could have severe performance hit and is not recommended
as the ranges to be trimmed could be too fragmented
asynchronous
enabled by mounting filesystem with ``-o discard=async``, which is an
improved version of the synchronous trim where the freed file extents
are first tracked in memory and after a period or enough ranges accumulate
the trim is started, expecting the ranges to be much larger and
allowing to throttle the number of IO requests which does not interfere
with the rest of the filesystem activity
manually by fstrim
the tool ``fstrim`` starts a trim operation on the whole filesystem, no
mount options need to be specified, so it's up to the filesystem to
traverse the free space and start the trim, this is suitable for running
it as periodic service
The trim is considered only a hint to the device, it could ignore it completely,
start it only on ranges meeting some criteria, or decide not to do it because of
other factors affecting the memory cells. The device itself could internally
relocate the data, however this leads to unexpected performance drop. Running
trim periodically could prevent that too.
When a filesystem is created by ``mkfs.btrfs`` and is capable of trim, then it's
by default performed on all devices.

View file

@ -1,4 +1,4 @@
Volume management
=================
...
.. include:: ch-volume-management-intro.rst

View file

@ -1,4 +1,4 @@
Zoned mode
==========
...
.. include:: ch-zoned-intro.rst

View file

@ -9,68 +9,7 @@ SYNOPSIS
DESCRIPTION
-----------
The primary purpose of the balance feature is to spread block groups across
all devices so they match constraints defined by the respective profiles. See
``mkfs.btrfs(8)`` section *PROFILES* for more details.
The scope of the balancing process can be further tuned by use of filters that
can select the block groups to process. Balance works only on a mounted
filesystem. Extent sharing is preserved and reflinks are not broken.
Files are not defragmented nor recompressed, file extents are preserved
but the physical location on devices will change.
The balance operation is cancellable by the user. The on-disk state of the
filesystem is always consistent so an unexpected interruption (eg. system crash,
reboot) does not corrupt the filesystem. The progress of the balance operation
is temporarily stored as an internal state and will be resumed upon mount,
unless the mount option *skip_balance* is specified.
.. warning::
Running balance without filters will take a lot of time as it basically move
data/metadata from the whol filesystem and needs to update all block
pointers.
The filters can be used to perform following actions:
- convert block group profiles (filter *convert*)
- make block group usage more compact (filter *usage*)
- perform actions only on a given device (filters *devid*, *drange*)
The filters can be applied to a combination of block group types (data,
metadata, system). Note that changing only the *system* type needs the force
option. Otherwise *system* gets automatically converted whenever *metadata*
profile is converted.
When metadata redundancy is reduced (eg. from RAID1 to single) the force option
is also required and it is noted in system log.
.. note::
The balance operation needs enough work space, ie. space that is completely
unused in the filesystem, otherwise this may lead to ENOSPC reports. See
the section *ENOSPC* for more details.
COMPATIBILITY
-------------
.. note::
The balance subcommand also exists under the **btrfs filesystem** namespace.
This still works for backward compatibility but is deprecated and should not
be used any more.
.. note::
A short syntax **btrfs balance <path>** works due to backward compatibility
but is deprecated and should not be used any more. Use **btrfs balance start**
command instead.
PERFORMANCE IMPLICATIONS
------------------------
Balancing operations are very IO intensive and can also be quite CPU intensive,
impacting other ongoing filesystem operations. Typically large amounts of data
are copied from one location to another, with corresponding metadata updates.
Depending upon the block group layout, it can also be seek heavy. Performance
on rotational devices is noticeably worse compared to SSDs or fast arrays.
.. include:: ch-balance-intro.rst
SUBCOMMAND
----------
@ -148,89 +87,7 @@ status [-v] <path>
FILTERS
-------
From kernel 3.3 onwards, btrfs balance can limit its action to a subset of the
whole filesystem, and can be used to change the replication configuration (e.g.
moving data from single to RAID1). This functionality is accessed through the
*-d*, *-m* or *-s* options to btrfs balance start, which filter on data,
metadata and system blocks respectively.
A filter has the following structure: *type[=params][,type=...]*
The available types are:
profiles=<profiles>
Balances only block groups with the given profiles. Parameters
are a list of profile names separated by "*|*" (pipe).
usage=<percent>, usage=<range>
Balances only block groups with usage under the given percentage. The
value of 0 is allowed and will clean up completely unused block groups, this
should not require any new work space allocated. You may want to use *usage=0*
in case balance is returning ENOSPC and your filesystem is not too full.
The argument may be a single value or a range. The single value *N* means *at
most N percent used*, equivalent to *..N* range syntax. Kernels prior to 4.4
accept only the single value format.
The minimum range boundary is inclusive, maximum is exclusive.
devid=<id>
Balances only block groups which have at least one chunk on the given
device. To list devices with ids use **btrfs filesystem show**.
drange=<range>
Balance only block groups which overlap with the given byte range on any
device. Use in conjunction with *devid* to filter on a specific device. The
parameter is a range specified as *start..end*.
vrange=<range>
Balance only block groups which overlap with the given byte range in the
filesystem's internal virtual address space. This is the address space that
most reports from btrfs in the kernel log use. The parameter is a range
specified as *start..end*.
convert=<profile>
Convert each selected block group to the given profile name identified by
parameters.
.. note::
Starting with kernel 4.5, the *data* chunks can be converted to/from the
*DUP* profile on a single device.
.. note::
Starting with kernel 4.6, all profiles can be converted to/from *DUP* on
multi-device filesystems.
limit=<number>, limit=<range>
Process only given number of chunks, after all filters are applied. This can be
used to specifically target a chunk in connection with other filters (*drange*,
*vrange*) or just simply limit the amount of work done by a single balance run.
The argument may be a single value or a range. The single value *N* means *at
most N chunks*, equivalent to *..N* range syntax. Kernels prior to 4.4 accept
only the single value format. The range minimum and maximum are inclusive.
stripes=<range>
Balance only block groups which have the given number of stripes. The parameter
is a range specified as *start..end*. Makes sense for block group profiles that
utilize striping, ie. RAID0/10/5/6. The range minimum and maximum are
inclusive.
soft
Takes no parameters. Only has meaning when converting between profiles.
When doing convert from one profile to another and soft mode is on,
chunks that already have the target profile are left untouched.
This is useful e.g. when half of the filesystem was converted earlier but got
cancelled.
The soft mode switch is (like every other filter) per-type.
For example, this means that we can convert metadata chunks the "hard" way
while converting data chunks selectively with soft switch.
Profile names, used in *profiles* and *convert* are one of: *raid0*, *raid1*,
*raid1c3*, *raid1c4*, *raid10*, *raid5*, *raid6*, *dup*, *single*. The mixed
data/metadata profiles can be converted in the same way, but it's conversion
between mixed and non-mixed is not implemented. For the constraints of the
profiles please refer to ``mkfs.btrfs(8)``, section *PROFILES*.
.. include:: ch-balance-filters.rst
ENOSPC
------

View file

@ -14,36 +14,7 @@ The **btrfs device** command group is used to manage devices of the btrfs filesy
DEVICE MANAGEMENT
-----------------
Btrfs filesystem can be created on top of single or multiple block devices.
Data and metadata are organized in allocation profiles with various redundancy
policies. There's some similarity with traditional RAID levels, but this could
be confusing to users familiar with the traditional meaning. Due to the
similarity, the RAID terminology is widely used in the documentation. See
``mkfs.btrfs(8)`` for more details and the exact profile capabilities and
constraints.
The device management works on a mounted filesystem. Devices can be added,
removed or replaced, by commands provided by **btrfs device** and **btrfs replace**.
The profiles can be also changed, provided there's enough workspace to do the
conversion, using the **btrfs balance** command and namely the filter *convert*.
Type
The block group profile type is the main distinction of the information stored
on the block device. User data are called *Data*, the internal data structures
managed by filesystem are *Metadata* and *System*.
Profile
A profile describes an allocation policy based on the redundancy/replication
constraints in connection with the number of devices. The profile applies to
data and metadata block groups separately. Eg. *single*, *RAID1*.
RAID level
Where applicable, the level refers to a profile that matches constraints of the
standard RAID levels. At the moment the supported ones are: RAID0, RAID1,
RAID10, RAID5 and RAID6.
See the section *TYPICAL USECASES* for some examples.
.. include ch-volume-management-intro.rst
SUBCOMMAND
----------
@ -76,7 +47,7 @@ remove [options] <device>|<devid> [<device>|<devid>...] <path>
Device removal must satisfy the profile constraints, otherwise the command
fails. The filesystem must be converted to profile(s) that would allow the
removal. This can typically happen when going down from 2 devices to 1 and
using the RAID1 profile. See the *TYPICAL USECASES* section below.
using the RAID1 profile. See the section *TYPICAL USECASES*.
The operation can take long as it needs to move all data from the device.
@ -217,94 +188,6 @@ usage [options] <path> [<path>...]::
If conflicting options are passed, the last one takes precedence.
TYPICAL USECASES
----------------
STARTING WITH A SINGLE-DEVICE FILESYSTEM
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Assume we've created a filesystem on a block device */dev/sda* with profile
*single/single* (data/metadata), the device size is 50GiB and we've used the
whole device for the filesystem. The mount point is */mnt*.
The amount of data stored is 16GiB, metadata have allocated 2GiB.
ADD NEW DEVICE
""""""""""""""
We want to increase the total size of the filesystem and keep the profiles. The
size of the new device */dev/sdb* is 100GiB.
.. code-block:: bash
$ btrfs device add /dev/sdb /mnt
The amount of free data space increases by less than 100GiB, some space is
allocated for metadata.
CONVERT TO RAID1
""""""""""""""""
Now we want to increase the redundancy level of both data and metadata, but
we'll do that in steps. Note, that the device sizes are not equal and we'll use
that to show the capabilities of split data/metadata and independent profiles.
The constraint for RAID1 gives us at most 50GiB of usable space and exactly 2
copies will be stored on the devices.
First we'll convert the metadata. As the metadata occupy less than 50GiB and
there's enough workspace for the conversion process, we can do:
.. code-block:: bash
$ btrfs balance start -mconvert=raid1 /mnt
This operation can take a while, because all metadata have to be moved and all
block pointers updated. Depending on the physical locations of the old and new
blocks, the disk seeking is the key factor affecting performance.
You'll note that the system block group has been also converted to RAID1, this
normally happens as the system block group also holds metadata (the physical to
logical mappings).
What changed:
* available data space decreased by 3GiB, usable roughly (50 - 3) + (100 - 3) = 144 GiB
* metadata redundancy increased
IOW, the unequal device sizes allow for combined space for data yet improved
redundancy for metadata. If we decide to increase redundancy of data as well,
we're going to lose 50GiB of the second device for obvious reasons.
.. code-block:: bash
$ btrfs balance start -dconvert=raid1 /mnt
The balance process needs some workspace (ie. a free device space without any
data or metadata block groups) so the command could fail if there's too much
data or the block groups occupy the whole first device.
The device size of */dev/sdb* as seen by the filesystem remains unchanged, but
the logical space from 50-100GiB will be unused.
REMOVE DEVICE
"""""""""""""
Device removal must satisfy the profile constraints, otherwise the command
fails. For example:
.. code-block:: bash
$ btrfs device remove /dev/sda /mnt
ERROR: error removing device '/dev/sda': unable to go below two devices on raid1
In order to remove a device, you need to convert the profile in this case:
.. code-block:: bash
$ btrfs balance start -mconvert=dup -dconvert=single /mnt
$ btrfs device remove /dev/sda /mnt
DEVICE STATS
------------

View file

@ -739,7 +739,6 @@ CHECKSUM ALGORITHMS
.. include:: ch-checksumming.rst
COMPRESSION
-----------
@ -915,71 +914,7 @@ d
ZONED MODE
----------
Since version 5.12 btrfs supports so called *zoned mode*. This is a special
on-disk format and allocation/write strategy that's friendly to zoned devices.
In short, a device is partitioned into fixed-size zones and each zone can be
updated by append-only manner, or reset. As btrfs has no fixed data structures,
except the super blocks, the zoned mode only requires block placement that
follows the device constraints. You can learn about the whole architecture at
https://zonedstorage.io .
The devices are also called SMR/ZBC/ZNS, in *host-managed* mode. Note that
there are devices that appear as non-zoned but actually are, this is
*drive-managed* and using zoned mode won't help.
The zone size depends on the device, typical sizes are 256MiB or 1GiB. In
general it must be a power of two. Emulated zoned devices like *null_blk* allow
to set various zone sizes.
REQUIREMENTS, LIMITATIONS
^^^^^^^^^^^^^^^^^^^^^^^^^
* all devices must have the same zone size
* maximum zone size is 8GiB
* mixing zoned and non-zoned devices is possible, the zone writes are emulated,
but this is namely for testing
* the super block is handled in a special way and is at different locations
than on a non-zoned filesystem:
* primary: 0B (and the next two zones)
* secondary: 512G (and the next two zones)
* tertiary: 4TiB (4096GiB, and the next two zones)
INCOMPATIBLE FEATURES
^^^^^^^^^^^^^^^^^^^^^
The main constraint of the zoned devices is lack of in-place update of the data.
This is inherently incompatbile with some features:
* nodatacow - overwrite in-place, cannot create such files
* fallocate - preallocating space for in-place first write
* mixed-bg - unordered writes to data and metadata, fixing that means using
separate data and metadata block groups
* booting - the zone at offset 0 contains superblock, resetting the zone would
destroy the bootloader data
Initial support lacks some features but they're planned:
* only single profile is supported
* fstrim - due to dependency on free space cache v1
SUPER BLOCK
~~~~~~~~~~~
As said above, super block is handled in a special way. In order to be crash
safe, at least one zone in a known location must contain a valid superblock.
This is implemented as a ring buffer in two consecutive zones, starting from
known offsets 0, 512G and 4TiB. The values are different than on non-zoned
devices. Each new super block is appended to the end of the zone, once it's
filled, the zone is reset and writes continue to the next one. Looking up the
latest super block needs to read offsets of both zones and determine the last
written version.
The amount of space reserved for super block depends on the zone size. The
secondary and tertiary copies are at distant offsets as the capacity of the
devices is expected to be large, tens of terabytes. Maximum zone size supported
is 8GiB, which would mean that eg. offset 0-16GiB would be reserved just for
the super block on a hypothetical device of that zone size. This is wasteful
but required to guarantee crash safety.
.. include:: ch-zoned-intro.rst
CONTROL DEVICE

View file

@ -12,6 +12,8 @@ DESCRIPTION
**btrfs subvolume** is used to create/delete/list/show btrfs subvolumes and
snapshots.
.. include:: ch-subvolume-intro.rst
SUBVOLUME AND SNAPSHOT
----------------------
@ -241,36 +243,6 @@ sync <path> [subvolid...]
-s <N>
sleep N seconds between checks (default: 1)
SUBVOLUME FLAGS
---------------
The subvolume flag currently implemented is the *ro* property. Read-write
subvolumes have that set to *false*, snapshots as *true*. In addition to that,
a plain snapshot will also have last change generation and creation generation
equal.
Read-only snapshots are building blocks fo incremental send (see
``btrfs-send(8)``) and the whole use case relies on unmodified snapshots where the
relative changes are generated from. Thus, changing the subvolume flags from
read-only to read-write will break the assumptions and may lead to unexpected changes
in the resulting incremental stream.
A snapshot that was created by send/receive will be read-only, with different
last change generation, read-only and with set *received_uuid* which identifies
the subvolume on the filesystem that produced the stream. The usecase relies
on matching data on both sides. Changing the subvolume to read-write after it
has been received requires to reset the *received_uuid*. As this is a notable
change and could potentially break the incremental send use case, performing
it by **btrfs property set** requires force if that is really desired by user.
.. note::
The safety checks have been implemented in 5.14.2, any subvolumes previously
received (with a valid *received_uuid*) and read-write status may exist and
could still lead to problems with send/receive. You can use **btrfs subvolume
show** to identify them. Flipping the flags to read-only and back to
read-write will reset the *received_uuid* manually. There may exist a
convenience tool in the future.
EXAMPLES
--------

View file

@ -0,0 +1,83 @@
From kernel 3.3 onwards, btrfs balance can limit its action to a subset of the
whole filesystem, and can be used to change the replication configuration (e.g.
moving data from single to RAID1). This functionality is accessed through the
*-d*, *-m* or *-s* options to btrfs balance start, which filter on data,
metadata and system blocks respectively.
A filter has the following structure: *type[=params][,type=...]*
The available types are:
profiles=<profiles>
Balances only block groups with the given profiles. Parameters
are a list of profile names separated by "*|*" (pipe).
usage=<percent>, usage=<range>
Balances only block groups with usage under the given percentage. The
value of 0 is allowed and will clean up completely unused block groups, this
should not require any new work space allocated. You may want to use *usage=0*
in case balance is returning ENOSPC and your filesystem is not too full.
The argument may be a single value or a range. The single value *N* means *at
most N percent used*, equivalent to *..N* range syntax. Kernels prior to 4.4
accept only the single value format.
The minimum range boundary is inclusive, maximum is exclusive.
devid=<id>
Balances only block groups which have at least one chunk on the given
device. To list devices with ids use **btrfs filesystem show**.
drange=<range>
Balance only block groups which overlap with the given byte range on any
device. Use in conjunction with *devid* to filter on a specific device. The
parameter is a range specified as *start..end*.
vrange=<range>
Balance only block groups which overlap with the given byte range in the
filesystem's internal virtual address space. This is the address space that
most reports from btrfs in the kernel log use. The parameter is a range
specified as *start..end*.
convert=<profile>
Convert each selected block group to the given profile name identified by
parameters.
.. note::
Starting with kernel 4.5, the *data* chunks can be converted to/from the
*DUP* profile on a single device.
.. note::
Starting with kernel 4.6, all profiles can be converted to/from *DUP* on
multi-device filesystems.
limit=<number>, limit=<range>
Process only given number of chunks, after all filters are applied. This can be
used to specifically target a chunk in connection with other filters (*drange*,
*vrange*) or just simply limit the amount of work done by a single balance run.
The argument may be a single value or a range. The single value *N* means *at
most N chunks*, equivalent to *..N* range syntax. Kernels prior to 4.4 accept
only the single value format. The range minimum and maximum are inclusive.
stripes=<range>
Balance only block groups which have the given number of stripes. The parameter
is a range specified as *start..end*. Makes sense for block group profiles that
utilize striping, ie. RAID0/10/5/6. The range minimum and maximum are
inclusive.
soft
Takes no parameters. Only has meaning when converting between profiles.
When doing convert from one profile to another and soft mode is on,
chunks that already have the target profile are left untouched.
This is useful e.g. when half of the filesystem was converted earlier but got
cancelled.
The soft mode switch is (like every other filter) per-type.
For example, this means that we can convert metadata chunks the "hard" way
while converting data chunks selectively with soft switch.
Profile names, used in *profiles* and *convert* are one of: *raid0*, *raid1*,
*raid1c3*, *raid1c4*, *raid10*, *raid5*, *raid6*, *dup*, *single*. The mixed
data/metadata profiles can be converted in the same way, but it's conversion
between mixed and non-mixed is not implemented. For the constraints of the
profiles please refer to ``mkfs.btrfs(8)``, section *PROFILES*.

View file

@ -0,0 +1,62 @@
The primary purpose of the balance feature is to spread block groups across
all devices so they match constraints defined by the respective profiles. See
``mkfs.btrfs(8)`` section *PROFILES* for more details.
The scope of the balancing process can be further tuned by use of filters that
can select the block groups to process. Balance works only on a mounted
filesystem. Extent sharing is preserved and reflinks are not broken.
Files are not defragmented nor recompressed, file extents are preserved
but the physical location on devices will change.
The balance operation is cancellable by the user. The on-disk state of the
filesystem is always consistent so an unexpected interruption (eg. system crash,
reboot) does not corrupt the filesystem. The progress of the balance operation
is temporarily stored as an internal state and will be resumed upon mount,
unless the mount option *skip_balance* is specified.
.. warning::
Running balance without filters will take a lot of time as it basically move
data/metadata from the whole filesystem and needs to update all block
pointers.
The filters can be used to perform following actions:
- convert block group profiles (filter *convert*)
- make block group usage more compact (filter *usage*)
- perform actions only on a given device (filters *devid*, *drange*)
The filters can be applied to a combination of block group types (data,
metadata, system). Note that changing only the *system* type needs the force
option. Otherwise *system* gets automatically converted whenever *metadata*
profile is converted.
When metadata redundancy is reduced (eg. from RAID1 to single) the force option
is also required and it is noted in system log.
.. note::
The balance operation needs enough work space, ie. space that is completely
unused in the filesystem, otherwise this may lead to ENOSPC reports. See
the section *ENOSPC* for more details.
Compatibility
-------------
.. note::
The balance subcommand also exists under the **btrfs filesystem** namespace.
This still works for backward compatibility but is deprecated and should not
be used any more.
.. note::
A short syntax **btrfs balance <path>** works due to backward compatibility
but is deprecated and should not be used any more. Use **btrfs balance start**
command instead.
Performance implications
------------------------
Balancing operations are very IO intensive and can also be quite CPU intensive,
impacting other ongoing filesystem operations. Typically large amounts of data
are copied from one location to another, with corresponding metadata updates.
Depending upon the block group layout, it can also be seek heavy. Performance
on rotational devices is noticeably worse compared to SSDs or fast arrays.

View file

@ -10,7 +10,7 @@ CRC32C (32bit digest)
instruction-level support, not collision-resistant but still good error
detection capabilities
XXHASH* (64bit digest)
XXHASH (64bit digest)
can be used as CRC32C successor, very fast, optimized for modern CPUs utilizing
instruction pipelining, good collision resistance and error detection
@ -33,7 +33,6 @@ additional overhead of the b-tree leaves.
Approximate relative performance of the algorithms, measured against CRC32C
using reference software implementations on a 3.5GHz intel CPU:
======== ============ ======= ================
Digest Cycles/4KiB Ratio Implementation
======== ============ ======= ================
@ -73,4 +72,3 @@ while accelerated implementation is e.g.
priority : 170
...

View file

@ -56,7 +56,7 @@ cause performance drops.
The command above will start defragmentation of the whole *file* and apply
the compression, regardless of the mount option. (Note: specifying level is not
yet implemented). The compression algorithm is not persisent and applies only
yet implemented). The compression algorithm is not persistent and applies only
to the defragmentation command, for any other writes other compression settings
apply.
@ -114,9 +114,9 @@ There are two ways to detect incompressible data:
* actual compression attempt - data are compressed, if the result is not smaller,
it's discarded, so this depends on the algorithm and level
* pre-compression heuristics - a quick statistical evaluation on the data is
peformed and based on the result either compression is performed or skipped,
performed and based on the result either compression is performed or skipped,
the NOCOMPRESS bit is not set just by the heuristic, only if the compression
algorithm does not make an improvent
algorithm does not make an improvement
.. code-block:: shell
@ -137,7 +137,7 @@ incompressible data too but this leads to more overhead as the compression is
done in another thread and has to write the data anyway. The heuristic is
read-only and can utilize cached memory.
The tests performed based on the following: data sampling, long repated
The tests performed based on the following: data sampling, long repeated
pattern detection, byte frequency, Shannon entropy.
Compatibility

View file

@ -36,7 +36,7 @@ machines).
**BEFORE YOU START**
The source filesystem must be clean, eg. no journal to replay or no repairs
needed. The respective **fsck** utility must be run on the source filesytem prior
needed. The respective **fsck** utility must be run on the source filessytem prior
to conversion. Please refer to the manual pages in case you encounter problems.
For ext2/3/4:

View file

@ -42,7 +42,7 @@ exclusive
is the amount of data where all references to this data can be reached
from within this qgroup.
SUBVOLUME QUOTA GROUPS
Subvolume quota groups
^^^^^^^^^^^^^^^^^^^^^^
The basic notion of the Subvolume Quota feature is the quota group, short
@ -75,7 +75,7 @@ of qgroups. Figure 1 shows an example qgroup tree.
| / \ / \
extents 1 2 3 4
Figure1: Sample qgroup hierarchy
Figure 1: Sample qgroup hierarchy
At the bottom, some extents are depicted showing which qgroups reference which
extents. It is important to understand the notion of *referenced* vs
@ -101,7 +101,7 @@ allocation information are not accounted.
In turn, the referenced count of a qgroup can be limited. All writes beyond
this limit will lead to a 'Quota Exceeded' error.
INHERITANCE
Inheritance
^^^^^^^^^^^
Things get a bit more complicated when new subvolumes or snapshots are created.
@ -133,13 +133,13 @@ exclusive count from the second qgroup needs to be copied to the first qgroup,
as it represents the correct value. The second qgroup is called a tracking
qgroup. It is only there in case a snapshot is needed.
USE CASES
Use cases
^^^^^^^^^
Below are some usecases that do not mean to be extensive. You can find your
Below are some use cases that do not mean to be extensive. You can find your
own way how to integrate qgroups.
SINGLE-USER MACHINE
Single-user machine
"""""""""""""""""""
``Replacement for partitions``
@ -156,7 +156,7 @@ the correct values. 'Referenced' will show how much is in it, possibly shared
with other subvolumes. 'Exclusive' will be the amount of space that gets freed
when the subvolume is deleted.
MULTI-USER MACHINE
Multi-user machine
""""""""""""""""""
``Restricting homes``
@ -194,5 +194,3 @@ but some snapshots for backup purposes are being created by the system. The
user's snapshots should be accounted to the user, not the system. The solution
is similar to the one from section 'Accounting snapshots to the user', but do
not assign system snapshots to user's qgroup.

View file

@ -19,7 +19,7 @@ UUID on each mount.
Once the seeding device is mounted, it needs the writable device. After adding
it, something like **remount -o remount,rw /path** makes the filesystem at
*/path* ready for use. The simplest usecase is to throw away all changes by
*/path* ready for use. The simplest use case is to throw away all changes by
unmounting the filesystem when convenient.
Alternatively, deleting the seeding device from the filesystem can turn it into
@ -29,7 +29,7 @@ data from the seeding device.
The seeding device flag can be cleared again by **btrfstune -f -s 0**, eg.
allowing to update with newer data but please note that this will invalidate
all existing filesystems that use this particular seeding device. This works
for some usecases, not for others, and a forcing flag to the command is
for some use cases, not for others, and a forcing flag to the command is
mandatory to avoid accidental mistakes.
Example how to create and use one seeding device:
@ -71,8 +71,6 @@ A few things to note:
* it's recommended to use only single device for the seeding device, it works
for multiple devices but the *single* profile must be used in order to make
the seeding device deletion work
* block group profiles *single* and *dup* support the usecases above
* block group profiles *single* and *dup* support the use cases above
* the label is copied from the seeding device and can be changed by **btrfs filesystem label**
* each new mount of the seeding device gets a new random UUID

View file

@ -0,0 +1,58 @@
A BTRFS subvolume is a part of filesystem with its own independent
file/directory hierarchy. Subvolumes can share file extents. A snapshot is also
subvolume, but with a given initial content of the original subvolume.
.. note::
A subvolume in BTRFS is not like an LVM logical volume, which is block-level
snapshot while BTRFS subvolumes are file extent-based.
A subvolume looks like a normal directory, with some additional operations
described below. Subvolumes can be renamed or moved, nesting subvolumes is not
restricted but has some implications regarding snapshotting.
A subvolume in BTRFS can be accessed in two ways:
* like any other directory that is accessible to the user
* like a separately mounted filesystem (options *subvol* or *subvolid*)
In the latter case the parent directory is not visible and accessible. This is
similar to a bind mount, and in fact the subvolume mount does exactly that.
A freshly created filesystem is also a subvolume, called *top-level*,
internally has an id 5. This subvolume cannot be removed or replaced by another
subvolume. This is also the subvolume that will be mounted by default, unless
the default subvolume has been changed (see ``btrfs subvolume set-default``).
A snapshot is a subvolume like any other, with given initial content. By
default, snapshots are created read-write. File modifications in a snapshot
do not affect the files in the original subvolume.
Subvolume flags
---------------
The subvolume flag currently implemented is the *ro* property. Read-write
subvolumes have that set to *false*, snapshots as *true*. In addition to that,
a plain snapshot will also have last change generation and creation generation
equal.
Read-only snapshots are building blocks of incremental send (see
``btrfs-send(8)``) and the whole use case relies on unmodified snapshots where
the relative changes are generated from. Thus, changing the subvolume flags
from read-only to read-write will break the assumptions and may lead to
unexpected changes in the resulting incremental stream.
A snapshot that was created by send/receive will be read-only, with different
last change generation, read-only and with set *received_uuid* which identifies
the subvolume on the filesystem that produced the stream. The use case relies
on matching data on both sides. Changing the subvolume to read-write after it
has been received requires to reset the *received_uuid*. As this is a notable
change and could potentially break the incremental send use case, performing
it by **btrfs property set** requires force if that is really desired by user.
.. note::
The safety checks have been implemented in 5.14.2, any subvolumes previously
received (with a valid *received_uuid*) and read-write status may exist and
could still lead to problems with send/receive. You can use **btrfs subvolume
show** to identify them. Flipping the flags to read-only and back to
read-write will reset the *received_uuid* manually. There may exist a
convenience tool in the future.

View file

@ -0,0 +1,116 @@
BTRFS filesystem can be created on top of single or multiple block devices.
Devices can be then added, removed or replaced on demand. Data and metadata are
organized in allocation profiles with various redundancy policies. There's some
similarity with traditional RAID levels, but this could be confusing to users
familiar with the traditional meaning. Due to the similarity, the RAID
terminology is widely used in the documentation. See ``mkfs.btrfs(8)`` for more
details and the exact profile capabilities and constraints.
The device management works on a mounted filesystem. Devices can be added,
removed or replaced, by commands provided by ``btrfs device`` and ``btrfs replace``.
The profiles can be also changed, provided there's enough workspace to do the
conversion, using the ``btrfs balance`` command and namely the filter *convert*.
Type
The block group profile type is the main distinction of the information stored
on the block device. User data are called *Data*, the internal data structures
managed by filesystem are *Metadata* and *System*.
Profile
A profile describes an allocation policy based on the redundancy/replication
constraints in connection with the number of devices. The profile applies to
data and metadata block groups separately. Eg. *single*, *RAID1*.
RAID level
Where applicable, the level refers to a profile that matches constraints of the
standard RAID levels. At the moment the supported ones are: RAID0, RAID1,
RAID10, RAID5 and RAID6.
Typical use cases
-----------------
Starting with a single-device filesystem
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Assume we've created a filesystem on a block device */dev/sda* with profile
*single/single* (data/metadata), the device size is 50GiB and we've used the
whole device for the filesystem. The mount point is */mnt*.
The amount of data stored is 16GiB, metadata have allocated 2GiB.
Add new device
""""""""""""""
We want to increase the total size of the filesystem and keep the profiles. The
size of the new device */dev/sdb* is 100GiB.
.. code-block:: bash
$ btrfs device add /dev/sdb /mnt
The amount of free data space increases by less than 100GiB, some space is
allocated for metadata.
Convert to RAID1
""""""""""""""""
Now we want to increase the redundancy level of both data and metadata, but
we'll do that in steps. Note, that the device sizes are not equal and we'll use
that to show the capabilities of split data/metadata and independent profiles.
The constraint for RAID1 gives us at most 50GiB of usable space and exactly 2
copies will be stored on the devices.
First we'll convert the metadata. As the metadata occupy less than 50GiB and
there's enough workspace for the conversion process, we can do:
.. code-block:: bash
$ btrfs balance start -mconvert=raid1 /mnt
This operation can take a while, because all metadata have to be moved and all
block pointers updated. Depending on the physical locations of the old and new
blocks, the disk seeking is the key factor affecting performance.
You'll note that the system block group has been also converted to RAID1, this
normally happens as the system block group also holds metadata (the physical to
logical mappings).
What changed:
* available data space decreased by 3GiB, usable roughly (50 - 3) + (100 - 3) = 144 GiB
* metadata redundancy increased
IOW, the unequal device sizes allow for combined space for data yet improved
redundancy for metadata. If we decide to increase redundancy of data as well,
we're going to lose 50GiB of the second device for obvious reasons.
.. code-block:: bash
$ btrfs balance start -dconvert=raid1 /mnt
The balance process needs some workspace (ie. a free device space without any
data or metadata block groups) so the command could fail if there's too much
data or the block groups occupy the whole first device.
The device size of */dev/sdb* as seen by the filesystem remains unchanged, but
the logical space from 50-100GiB will be unused.
Remove device
"""""""""""""
Device removal must satisfy the profile constraints, otherwise the command
fails. For example:
.. code-block:: bash
$ btrfs device remove /dev/sda /mnt
ERROR: error removing device '/dev/sda': unable to go below two devices on raid1
In order to remove a device, you need to convert the profile in this case:
.. code-block:: bash
$ btrfs balance start -mconvert=dup -dconvert=single /mnt
$ btrfs device remove /dev/sda /mnt

View file

@ -0,0 +1,66 @@
Since version 5.12 btrfs supports so called *zoned mode*. This is a special
on-disk format and allocation/write strategy that's friendly to zoned devices.
In short, a device is partitioned into fixed-size zones and each zone can be
updated by append-only manner, or reset. As btrfs has no fixed data structures,
except the super blocks, the zoned mode only requires block placement that
follows the device constraints. You can learn about the whole architecture at
https://zonedstorage.io .
The devices are also called SMR/ZBC/ZNS, in *host-managed* mode. Note that
there are devices that appear as non-zoned but actually are, this is
*drive-managed* and using zoned mode won't help.
The zone size depends on the device, typical sizes are 256MiB or 1GiB. In
general it must be a power of two. Emulated zoned devices like *null_blk* allow
to set various zone sizes.
Requirements, limitations
^^^^^^^^^^^^^^^^^^^^^^^^^
* all devices must have the same zone size
* maximum zone size is 8GiB
* mixing zoned and non-zoned devices is possible, the zone writes are emulated,
but this is namely for testing
* the super block is handled in a special way and is at different locations
than on a non-zoned filesystem:
* primary: 0B (and the next two zones)
* secondary: 512GiB (and the next two zones)
* tertiary: 4TiB (4096GiB, and the next two zones)
Incompatible features
^^^^^^^^^^^^^^^^^^^^^
The main constraint of the zoned devices is lack of in-place update of the data.
This is inherently incompatibile with some features:
* nodatacow - overwrite in-place, cannot create such files
* fallocate - preallocating space for in-place first write
* mixed-bg - unordered writes to data and metadata, fixing that means using
separate data and metadata block groups
* booting - the zone at offset 0 contains superblock, resetting the zone would
destroy the bootloader data
Initial support lacks some features but they're planned:
* only single profile is supported
* fstrim - due to dependency on free space cache v1
Super block
^^^^^^^^^^^
As said above, super block is handled in a special way. In order to be crash
safe, at least one zone in a known location must contain a valid superblock.
This is implemented as a ring buffer in two consecutive zones, starting from
known offsets 0B, 512GiB and 4TiB.
The values are different than on non-zoned devices. Each new super block is
appended to the end of the zone, once it's filled, the zone is reset and writes
continue to the next one. Looking up the latest super block needs to read
offsets of both zones and determine the last written version.
The amount of space reserved for super block depends on the zone size. The
secondary and tertiary copies are at distant offsets as the capacity of the
devices is expected to be large, tens of terabytes. Maximum zone size supported
is 8GiB, which would mean that eg. offset 0-16GiB would be reserved just for
the super block on a hypothetical device of that zone size. This is wasteful
but required to guarantee crash safety.

View file

@ -8,7 +8,6 @@ Welcome to BTRFS documentation!
:caption: Overview
Introduction
Quick-start
man-index
.. toctree::
@ -41,6 +40,7 @@ Welcome to BTRFS documentation!
:maxdepth: 1
:caption: TODO
Quick-start
Interoperability
Glossary
Flexibility