btrfs-progs: docs: merge storage model to hardware chapter

The storage model is the intro chapter for the hardware problems.

Signed-off-by: David Sterba <dsterba@suse.com>
This commit is contained in:
David Sterba 2022-05-12 20:50:04 +02:00
parent 4c5554d46d
commit f7456830d1
2 changed files with 94 additions and 95 deletions

View file

@ -20,8 +20,7 @@ tools. Currently covers:
#. filesystems with multiple block group profiles
#. seeding device
#. raid56 status and recommended practices
#. storage model
#. hardware considerations
#. storage model, hardware considerations
MOUNT OPTIONS
@ -397,99 +396,8 @@ some errors might get reported multiple times.
The *write hole* problem.
STORAGE MODEL
-------------
*A storage model is a model that captures key physical aspects of data
structure in a data store. A filesystem is the logical structure organizing
data on top of the storage device.*
The filesystem assumes several features or limitations of the storage device
and utilizes them or applies measures to guarantee reliability. BTRFS in
particular is based on a COW (copy on write) mode of writing, ie. not updating
data in place but rather writing a new copy to a different location and then
atomically switching the pointers.
In an ideal world, the device does what it promises. The filesystem assumes
that this may not be true so additional mechanisms are applied to either detect
misbehaving hardware or get valid data by other means. The devices may (and do)
apply their own detection and repair mechanisms but we won't assume any.
The following assumptions about storage devices are considered (sorted by
importance, numbers are for further reference):
1. atomicity of reads and writes of blocks/sectors (the smallest unit of data
the device presents to the upper layers)
2. there's a flush command that instructs the device to forcibly order writes
before and after the command; alternatively there's a barrier command that
facilitates the ordering but may not flush the data
3. data sent to write to a given device offset will be written without further
changes to the data and to the offset
4. writes can be reordered by the device, unless explicitly serialized by the
flush command
5. reads and writes can be freely reordered and interleaved
The consistency model of BTRFS builds on these assumptions. The logical data
updates are grouped, into a generation, written on the device, serialized by
the flush command and then the super block is written ending the generation.
All logical links among metadata comprising a consistent view of the data may
not cross the generation boundary.
WHEN THINGS GO WRONG
^^^^^^^^^^^^^^^^^^^^
**No or partial atomicity of block reads/writes (1)**
- *Problem*: a partial block contents is written (*torn write*), eg. due to a
power glitch or other electronics failure during the read/write
- *Detection*: checksum mismatch on read
- *Repair*: use another copy or rebuild from multiple blocks using some encoding
scheme
**The flush command does not flush (2)**
This is perhaps the most serious problem and impossible to mitigate by
filesystem without limitations and design restrictions. What could happen in
the worst case is that writes from one generation bleed to another one, while
still letting the filesystem consider the generations isolated. Crash at any
point would leave data on the device in an inconsistent state without any hint
what exactly got written, what is missing and leading to stale metadata link
information.
Devices usually honor the flush command, but for performance reasons may do
internal caching, where the flushed data are not yet persistently stored. A
power failure could lead to a similar scenario as above, although it's less
likely that later writes would be written before the cached ones. This is
beyond what a filesystem can take into account. Devices or controllers are
usually equipped with batteries or capacitors to write the cache contents even
after power is cut. (*Battery backed write cache*)
**Data get silently changed on write (3)**
Such thing should not happen frequently, but still can happen spuriously due
the complex internal workings of devices or physical effects of the storage
media itself.
* *Problem*: while the data are written atomically, the contents get changed
* *Detection*: checksum mismatch on read
* 'Repair*: use another copy or rebuild from multiple blocks using some
encoding scheme
**Data get silently written to another offset (3)**
This would be another serious problem as the filesystem has no information
when it happens. For that reason the measures have to be done ahead of time.
This problem is also commonly called 'ghost write'.
The metadata blocks have the checksum embedded in the blocks, so a correct
atomic write would not corrupt the checksum. It's likely that after reading
such block the data inside would not be consistent with the rest. To rule that
out there's embedded block number in the metadata block. It's the logical
block number because this is what the logical structure expects and verifies.
HARDWARE CONSIDERATIONS
-----------------------
STORAGE MODEL, HARDWARE CONSIDERATIONS
--------------------------------------
.. include:: ch-hardware-considerations.rst

View file

@ -1,3 +1,94 @@
STORAGE MODEL
^^^^^^^^^^^^^
*A storage model is a model that captures key physical aspects of data
structure in a data store. A filesystem is the logical structure organizing
data on top of the storage device.*
The filesystem assumes several features or limitations of the storage device
and utilizes them or applies measures to guarantee reliability. BTRFS in
particular is based on a COW (copy on write) mode of writing, ie. not updating
data in place but rather writing a new copy to a different location and then
atomically switching the pointers.
In an ideal world, the device does what it promises. The filesystem assumes
that this may not be true so additional mechanisms are applied to either detect
misbehaving hardware or get valid data by other means. The devices may (and do)
apply their own detection and repair mechanisms but we won't assume any.
The following assumptions about storage devices are considered (sorted by
importance, numbers are for further reference):
1. atomicity of reads and writes of blocks/sectors (the smallest unit of data
the device presents to the upper layers)
2. there's a flush command that instructs the device to forcibly order writes
before and after the command; alternatively there's a barrier command that
facilitates the ordering but may not flush the data
3. data sent to write to a given device offset will be written without further
changes to the data and to the offset
4. writes can be reordered by the device, unless explicitly serialized by the
flush command
5. reads and writes can be freely reordered and interleaved
The consistency model of BTRFS builds on these assumptions. The logical data
updates are grouped, into a generation, written on the device, serialized by
the flush command and then the super block is written ending the generation.
All logical links among metadata comprising a consistent view of the data may
not cross the generation boundary.
WHEN THINGS GO WRONG
^^^^^^^^^^^^^^^^^^^^
**No or partial atomicity of block reads/writes (1)**
- *Problem*: a partial block contents is written (*torn write*), eg. due to a
power glitch or other electronics failure during the read/write
- *Detection*: checksum mismatch on read
- *Repair*: use another copy or rebuild from multiple blocks using some encoding
scheme
**The flush command does not flush (2)**
This is perhaps the most serious problem and impossible to mitigate by
filesystem without limitations and design restrictions. What could happen in
the worst case is that writes from one generation bleed to another one, while
still letting the filesystem consider the generations isolated. Crash at any
point would leave data on the device in an inconsistent state without any hint
what exactly got written, what is missing and leading to stale metadata link
information.
Devices usually honor the flush command, but for performance reasons may do
internal caching, where the flushed data are not yet persistently stored. A
power failure could lead to a similar scenario as above, although it's less
likely that later writes would be written before the cached ones. This is
beyond what a filesystem can take into account. Devices or controllers are
usually equipped with batteries or capacitors to write the cache contents even
after power is cut. (*Battery backed write cache*)
**Data get silently changed on write (3)**
Such thing should not happen frequently, but still can happen spuriously due
the complex internal workings of devices or physical effects of the storage
media itself.
* *Problem*: while the data are written atomically, the contents get changed
* *Detection*: checksum mismatch on read
* 'Repair*: use another copy or rebuild from multiple blocks using some
encoding scheme
**Data get silently written to another offset (3)**
This would be another serious problem as the filesystem has no information
when it happens. For that reason the measures have to be done ahead of time.
This problem is also commonly called 'ghost write'.
The metadata blocks have the checksum embedded in the blocks, so a correct
atomic write would not corrupt the checksum. It's likely that after reading
such block the data inside would not be consistent with the rest. To rule that
out there's embedded block number in the metadata block. It's the logical
block number because this is what the logical structure expects and verifies.
The following is based on information publicly available, user feedback,
community discussions or bug report analyses. It's not complete and further
research is encouraged when in doubt.