btrfs-progs: docs: add more chapters

The feature pages share the contents with the manual page section 5 so
put the contents to separate files.

Signed-off-by: David Sterba <dsterba@suse.com>
This commit is contained in:
David Sterba 2021-12-09 20:46:42 +01:00
parent 7f94ccb20a
commit b871bf49f3
7 changed files with 20 additions and 239 deletions

View file

@ -1,4 +1,4 @@
Checksumming
============
...
.. include:: ch-checksumming.rst

View file

@ -16,3 +16,5 @@ Anything that's standard and also supported
- FIEMAP
- O_TMPFILE
- XFLAGS, fileattr

View file

@ -1,4 +1,4 @@
Compression
===========
...
.. include:: ch-compression.rst

View file

@ -1,4 +1,12 @@
Inline files
============
...
Files up to some size can be stored in the metadata section ("inline" in the
b-tree nodes), ie. no separate blocks for the extents. The default limit is
2048 bytes and can be configured by mount option ``max_inline``. The data of
inlined files can be also compressed as long as they fit into the b-tree nodes.
If the filesystem has been created with different data and metadata profiles,
namely with different level of integrity, this also affects the inlined files.
It can be completely disabled by mounting with ``max_inline=0``. The upper
limit is either the size of b-tree node or the page size of the host.

View file

@ -26,3 +26,6 @@ overlayfs
SELinux
-------
io_uring
--------

View file

@ -1,4 +1,4 @@
Seeding device
==============
...
.. include:: ch-seeding-device.rst

View file

@ -737,169 +737,13 @@ priority, not the btrfs mount options).
CHECKSUM ALGORITHMS
-------------------
There are several checksum algorithms supported. The default and backward
compatible is *crc32c*. Since kernel 5.5 there are three more with different
characteristics and trade-offs regarding speed and strength. The following
list may help you to decide which one to select.
CRC32C (32bit digest)
default, best backward compatibility, very fast, modern CPUs have
instruction-level support, not collision-resistant but still good error
detection capabilities
XXHASH* (64bit digest)
can be used as CRC32C successor, very fast, optimized for modern CPUs utilizing
instruction pipelining, good collision resistance and error detection
SHA256 (256bit digest)::
a cryptographic-strength hash, relatively slow but with possible CPU
instruction acceleration or specialized hardware cards, FIPS certified and
in wide use
BLAKE2b (256bit digest)
a cryptographic-strength hash, relatively fast with possible CPU acceleration
using SIMD extensions, not standardized but based on BLAKE which was a SHA3
finalist, in wide use, the algorithm used is BLAKE2b-256 that's optimized for
64bit platforms
The *digest size* affects overall size of data block checksums stored in the
filesystem. The metadata blocks have a fixed area up to 256 bits (32 bytes), so
there's no increase. Each data block has a separate checksum stored, with
additional overhead of the b-tree leaves.
Approximate relative performance of the algorithms, measured against CRC32C
using reference software implementations on a 3.5GHz intel CPU:
======== ============ ======= ================
Digest Cycles/4KiB Ratio Implementation
======== ============ ======= ================
CRC32C 1700 1.00 CPU instruction
XXHASH 2500 1.44 reference impl.
SHA256 105000 61 reference impl.
SHA256 36000 21 libgcrypt/AVX2
SHA256 63000 37 libsodium/AVX2
BLAKE2b 22000 13 reference impl.
BLAKE2b 19000 11 libgcrypt/AVX2
BLAKE2b 19000 11 libsodium/AVX2
======== ============ ======= ================
Many kernels are configured with SHA256 as built-in and not as a module.
The accelerated versions are however provided by the modules and must be loaded
explicitly (**modprobe sha256**) before mounting the filesystem to make use of
them. You can check in */sys/fs/btrfs/FSID/checksum* which one is used. If you
see *sha256-generic*, then you may want to unmount and mount the filesystem
again, changing that on a mounted filesystem is not possible.
Check the file */proc/crypto*, when the implementation is built-in, you'd find
.. code-block:: none
name : sha256
driver : sha256-generic
module : kernel
priority : 100
...
while accelerated implementation is e.g.
.. code-block:: none
name : sha256
driver : sha256-avx2
module : sha256_ssse3
priority : 170
...
.. include:: ch-checksumming.rst
COMPRESSION
-----------
Btrfs supports transparent file compression. There are three algorithms
available: ZLIB, LZO and ZSTD (since v4.14). Basically, compression is on a file
by file basis. You can have a single btrfs mount point that has some files that
are uncompressed, some that are compressed with LZO, some with ZLIB, for
instance (though you may not want it that way, it is supported).
To enable compression, mount the filesystem with options *compress* or
*compress-force*. Please refer to section *MOUNT OPTIONS*. Once compression is
enabled, all new writes will be subject to compression. Some files may not
compress very well, and these are typically not recompressed but still written
uncompressed.
Each compression algorithm has different speed/ratio trade offs. The levels
can be selected by a mount option and affect only the resulting size (ie.
no compatibility issues).
Basic characteristics:
ZLIB
* slower, higher compression ratio
* levels: 1 to 9, mapped directly, default level is 3
* good backward compatibility
LZO
* faster compression and decompression than zlib, worse compression ratio, designed to be fast
* no levels
* good backward compatibility
ZSTD
* compression comparable to zlib with higher compression/decompression speeds and different ratio
* levels: 1 to 15
* since 4.14, levels since 5.1
The differences depend on the actual data set and cannot be expressed by a
single number or recommendation. Higher levels consume more CPU time and may
not bring a significant improvement, lower levels are close to real time.
The algorithms could be mixed in one file as they're stored per extent. The
compression can be changed on a file by **btrfs filesystem defrag** command,
using the *-c* option, or by **btrfs property set** using the *compression*
property. Setting compression by **chattr +c** utility will set it to zlib.
INCOMPRESSIBLE DATA
^^^^^^^^^^^^^^^^^^^
Files with already compressed data or with data that won't compress well with
the CPU and memory constraints of the kernel implementations are using a simple
decision logic. If the first portion of data being compressed is not smaller
than the original, the compression of the file is disabled -- unless the
filesystem is mounted with *compress-force*. In that case compression will
always be attempted on the file only to be later discarded. This is not optimal
and subject to optimizations and further development.
If a file is identified as incompressible, a flag is set (NOCOMPRESS) and it's
sticky. On that file compression won't be performed unless forced. The flag
can be also set by **chattr +m** (since e2fsprogs 1.46.2) or by properties with
value *no* or *none*. Empty value will reset it to the default that's currently
applicable on the mounted filesystem.
There are two ways to detect incompressible data:
* actual compression attempt - data are compressed, if the result is not smaller,
it's discarded, so this depends on the algorithm and level
* pre-compression heuristics - a quick statistical evaluation on the data is
peformed and based on the result either compression is performed or skipped,
the NOCOMPRESS bit is not set just by the heuristic, only if the compression
algorithm does not make an improvent
PRE-COMPRESSION HEURISTICS
^^^^^^^^^^^^^^^^^^^^^^^^^^
The heuristics aim to do a few quick statistical tests on the compressed data
in order to avoid probably costly compression that would turn out to be
inefficient. Compression algorithms could have internal detection of
incompressible data too but this leads to more overhead as the compression is
done in another thread and has to write the data anyway. The heuristic is
read-only and can utilize cached memory.
The tests performed based on the following: data sampling, long repated
pattern detection, byte frequency, Shannon entropy.
COMPATIBILITY WITH OTHER FEATURES
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Compression is done using the COW mechanism so it's incompatible with
*nodatacow*. Direct IO works on compressed files but will fall back to buffered
writes. Currently 'nodatasum' and compression don't work together.
.. include:: ch-compression.rst
FILESYSTEM EXCLUSIVE OPERATIONS
-------------------------------
@ -1249,83 +1093,7 @@ that report space usage: **filesystem df**, **device usage**. The command
SEEDING DEVICE
--------------
The COW mechanism and multiple devices under one hood enable an interesting
concept, called a seeding device: extending a read-only filesystem on a single
device filesystem with another device that captures all writes. For example
imagine an immutable golden image of an operating system enhanced with another
device that allows to use the data from the golden image and normal operation.
This idea originated on CD-ROMs with base OS and allowing to use them for live
systems, but this became obsolete. There are technologies providing similar
functionality, like *unionmount*, *overlayfs* or *qcow2* image snapshot.
The seeding device starts as a normal filesystem, once the contents is ready,
**btrfstune -S 1** is used to flag it as a seeding device. Mounting such device
will not allow any writes, except adding a new device by **btrfs device add**.
Then the filesystem can be remounted as read-write.
Given that the filesystem on the seeding device is always recognized as
read-only, it can be used to seed multiple filesystems, at the same time. The
UUID that is normally attached to a device is automatically changed to a random
UUID on each mount.
Once the seeding device is mounted, it needs the writable device. After adding
it, something like **remount -o remount,rw /path** makes the filesystem at
*/path* ready for use. The simplest usecase is to throw away all changes by
unmounting the filesystem when convenient.
Alternatively, deleting the seeding device from the filesystem can turn it into
a normal filesystem, provided that the writable device can also contain all the
data from the seeding device.
The seeding device flag can be cleared again by **btrfstune -f -s 0**, eg.
allowing to update with newer data but please note that this will invalidate
all existing filesystems that use this particular seeding device. This works
for some usecases, not for others, and a forcing flag to the command is
mandatory to avoid accidental mistakes.
Example how to create and use one seeding device:
.. code-block:: bash
# mkfs.btrfs /dev/sda
# mount /dev/sda /mnt/mnt1
# ... fill mnt1 with data
# umount /mnt/mnt1
# btrfstune -S 1 /dev/sda
# mount /dev/sda /mnt/mnt1
# btrfs device add /dev/sdb /mnt
# mount -o remount,rw /mnt/mnt1
# ... /mnt/mnt1 is now writable
Now */mnt/mnt1* can be used normally. The device */dev/sda* can be mounted
again with a another writable device:
.. code-block:: bash
# mount /dev/sda /mnt/mnt2
# btrfs device add /dev/sdc /mnt/mnt2
# mount -o remount,rw /mnt/mnt2
... /mnt/mnt2 is now writable
The writable device (*/dev/sdb*) can be decoupled from the seeding device and
used independently:
.. code-block:: bash
# btrfs device delete /dev/sda /mnt/mnt1
As the contents originated in the seeding device, it's possible to turn
*/dev/sdb* to a seeding device again and repeat the whole process.
A few things to note:
* it's recommended to use only single device for the seeding device, it works
for multiple devices but the *single* profile must be used in order to make
the seeding device deletion work
* block group profiles *single* and *dup* support the usecases above
* the label is copied from the seeding device and can be changed by **btrfs filesystem label**
* each new mount of the seeding device gets a new random UUID
.. include:: ch-seeding-device.rst
RAID56 STATUS AND RECOMMENDED PRACTICES
---------------------------------------