btrfs-progs: docs: separate chapter for hardware considerations

Make it more visible than just in section 5. Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-12 20:39:29 +02:00 · 2022-05-12 20:39:29 +02:00 · 4c5554d46d
parent f1b4ef2f35
commit 4c5554d46d
4 changed files with 221 additions and 216 deletions
--- a/Documentation/Hardware.rst
+++ b/Documentation/Hardware.rst
@ -0,0 +1,4 @@
+Hardware considerations
+=======================
+
+.. include:: ch-hardware-considerations.rst
--- a/Documentation/btrfs-man5.rst
+++ b/Documentation/btrfs-man5.rst
@ -491,222 +491,7 @@ block number because this is what the logical structure expects and verifies.
 HARDWARE CONSIDERATIONS
 -----------------------

-The following is based on information publicly available, user feedback,
-community discussions or bug report analyses. It's not complete and further
-research is encouraged when in doubt.
-
-MAIN MEMORY
-^^^^^^^^^^^
-
-The data structures and raw data blocks are temporarily stored in computer
-memory before they get written to the device. It is critical that memory is
-reliable because even simple bit flips can have vast consequences and lead to
-damaged structures, not only in the filesystem but in the whole operating
-system.
-
-Based on experience in the community, memory bit flips are more common than one
-would think. When it happens, it's reported by the tree-checker or by a checksum
-mismatch after reading blocks. There are some very obvious instances of bit
-flips that happen, e.g. in an ordered sequence of keys in metadata blocks. We can
-easily infer from the other data what values get damaged and how. However, fixing
-that is not straightforward and would require cross-referencing data from the
-entire filesystem to see the scope.
-
-If available, ECC memory should lower the chances of bit flips, but this
-type of memory is not available in all cases. A memory test should be performed
-in case there's a visible bit flip pattern, though this may not detect a faulty
-memory module because the actual load of the system could be the factor making
-the problems appear. In recent years attacks on how the memory modules operate
-have been demonstrated ('rowhammer') achieving specific bits to be flipped.
-While these were targeted, this shows that a series of reads or writes can
-affect unrelated parts of memory.
-
-Further reading:
-
-* https://en.wikipedia.org/wiki/Row_hammer
-
-What to do:
-
-* run *memtest*, note that sometimes memory errors happen only when the system
-  is under heavy load that the default memtest cannot trigger
-* memory errors may appear as filesystem going read-only due to "pre write"
-  check, that verify meta data before they get written but fail some basic
-  consistency checks
-
-DIRECT MEMORY ACCESS (DMA)
-^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Another class of errors is related to DMA (direct memory access) performed
-by device drivers. While this could be considered a software error, the
-data transfers that happen without CPU assistance may accidentally corrupt
-other pages. Storage devices utilize DMA for performance reasons, the
-filesystem structures and data pages are passed back and forth, making
-errors possible in case page life time is not properly tracked.
-
-There are lots of quirks (device-specific workarounds) in Linux kernel
-drivers (regarding not only DMA) that are added when found. The quirks
-may avoid specific errors or disable some features to avoid worse problems.
-
-What to do:
-
-* use up-to-date kernel (recent releases or maintained long term support versions)
-* as this may be caused by faulty drivers, keep the systems up-to-date
-
-ROTATIONAL DISKS (HDD)
-^^^^^^^^^^^^^^^^^^^^^^
-
-Rotational HDDs typically fail at the level of individual sectors or small clusters.
-Read failures are caught on the levels below the filesystem and are returned to
-the user as *EIO - Input/output error*. Reading the blocks repeatedly may
-return the data eventually, but this is better done by specialized tools and
-filesystem takes the result of the lower layers. Rewriting the sectors may
-trigger internal remapping but this inevitably leads to data loss.
-
-Disk firmware is technically software but from the filesystem perspective is
-part of the hardware. IO requests are processed, and caching or various
-other optimizations are performed, which may lead to bugs under high load or
-unexpected physical conditions or unsupported use cases.
-
-Disks are connected by cables with two ends, both of which can cause problems
-when not attached properly. Data transfers are protected by checksums and the
-lower layers try hard to transfer the data correctly or not at all. The errors
-from badly-connecting cables may manifest as large amount of failed read or
-write requests, or as short error bursts depending on physical conditions.
-
-What to do:
-
-* check **smartctl** for potential issues
-
-SOLID STATE DRIVES (SSD)
-^^^^^^^^^^^^^^^^^^^^^^^^
-
-The mechanism of information storage is different from HDDs and this affects
-the failure mode as well. The data are stored in cells grouped in large blocks
-with limited number of resets and other write constraints. The firmware tries
-to avoid unnecessary resets and performs optimizations to maximize the storage
-media lifetime. The known techniques are deduplication (blocks with same
-fingerprint/hash are mapped to same physical block), compression or internal
-remapping and garbage collection of used memory cells. Due to the additional
-processing there are measures to verity the data e.g. by ECC codes.
-
-The observations of failing SSDs show that the whole electronic fails at once
-or affects a lot of data (eg. stored on one chip). Recovering such data
-may need specialized equipment and reading data repeatedly does not help as
-it's possible with HDDs.
-
-There are several technologies of the memory cells with different
-characteristics and price. The lifetime is directly affected by the type and
-frequency of data written.  Writing "too much" distinct data (e.g. encrypted)
-may render the internal deduplication ineffective and lead to a lot of rewrites
-and increased wear of the memory cells.
-
-There are several technologies and manufacturers so it's hard to describe them
-but there are some that exhibit similar behaviour:
-
-* expensive SSD will use more durable memory cells and is optimized for
-  reliability and high load
-* cheap SSD is projected for a lower load ("desktop user") and is optimized for
-  cost, it may employ the optimizations and/or extended error reporting
-  partially or not at all
-
-It's not possible to reliably determine the expected lifetime of an SSD due to
-lack of information about how it works or due to lack of reliable stats provided
-by the device.
-
-Metadata writes tend to be the biggest component of lifetime writes to a SSD,
-so there is some value in reducing them. Depending on the device class (high
-end/low end) the features like DUP block group profiles may affect the
-reliability in both ways:
-
-* *high end* are typically more reliable and using 'single' for data and
-  metadata could be suitable to reduce device wear
-* *low end* could lack ability to identify errors so an additional redundancy
-  at the filesystem level (checksums, *DUP*) could help
-
-Only users who consume 50 to 100% of the SSD's actual lifetime writes need to be
-concerned by the write amplification of btrfs DUP metadata. Most users will be
-far below 50% of the actual lifetime, or will write the drive to death and
-discover how many writes 100% of the actual lifetime was. SSD firmware often
-adds its own write multipliers that can be arbitrary and unpredictable and
-dependent on application behavior, and these will typically have far greater
-effect on SSD lifespan than DUP metadata. It's more or less impossible to
-predict when a SSD will run out of lifetime writes to within a factor of two, so
-it's hard to justify wear reduction as a benefit.
-
-Further reading:
-
-* https://www.snia.org/educational-library/ssd-and-deduplication-end-spinning-disk-2012
-* https://www.snia.org/educational-library/realities-solid-state-storage-2013-2013
-* https://www.snia.org/educational-library/ssd-performance-primer-2013
-* https://www.snia.org/educational-library/how-controllers-maximize-ssd-life-2013
-
-What to do:
-
-* run **smartctl** or self-tests to look for potential issues
-* keep the firmware up-to-date
-
-NVM EXPRESS, NON-VOLATILE MEMORY (NVMe)
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-NVMe is a type of persistent memory usually connected over a system bus (PCIe)
-or similar interface and the speeds are an order of magnitude faster than SSD.
-It is also a non-rotating type of storage, and is not typically connected by a
-cable. It's not a SCSI type device either but rather a complete specification
-for logical device interface.
-
-In a way the errors could be compared to a combination of SSD class and regular
-memory. Errors may exhibit as random bit flips or IO failures. There are tools
-to access the internal log (**nvme log** and **nvme-cli**) for a more detailed
-analysis.
-
-There are separate error detection and correction steps performed e.g. on the
-bus level and in most cases never making in to the filesystem level. Once this
-happens it could mean there's some systematic error like overheating or bad
-physical connection of the device. You may want to run self-tests (using
-**smartctl**).
-
-* https://en.wikipedia.org/wiki/NVM_Express
-* https://www.smartmontools.org/wiki/NVMe_Support
-
-DRIVE FIRMWARE
-^^^^^^^^^^^^^^
-
-Firmware is technically still software but embedded into the hardware. As all
-software has bugs, so does firmware. Storage devices can update the firmware
-and fix known bugs. In some cases the it's possible to avoid certain bugs by
-quirks (device-specific workarounds) in Linux kernel.
-
-A faulty firmware can cause wide range of corruptions from small and localized
-to large affecting lots of data. Self-repair capabilities may not be sufficient.
-
-What to do:
-
-* check for firmware updates in case there are known problems, note that
-  updating firmware can be risky on itself
-* use up-to-date kernel (recent releases or maintained long term support versions)
-
-SD FLASH CARDS
-^^^^^^^^^^^^^^
-
-There are a lot of devices with low power consumption and thus using storage
-media based on low power consumption too, typically flash memory stored on
-a chip enclosed in a detachable card package. An improperly inserted card may be
-damaged by electrical spikes when the device is turned on or off. The chips
-storing data in turn may be damaged permanently. All types of flash memory
-have a limited number of rewrites, so the data are internally translated by FTL
-(flash translation layer). This is implemented in firmware (technically a
-software) and prone to bugs that manifest as hardware errors.
-
-Adding redundancy like using DUP profiles for both data and metadata can help
-in some cases but a full backup might be the best option once problems appear
-and replacing the card could be required as well.
-
-HARDWARE AS THE MAIN SOURCE OF FILESYSTEM CORRUPTIONS
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-**If you use unreliable hardware and don't know about that, don't blame the
-filesystem when it tells you.**
-
+.. include:: ch-hardware-considerations.rst

 SEE ALSO
 --------
--- a/Documentation/ch-hardware-considerations.rst
+++ b/Documentation/ch-hardware-considerations.rst
@ -0,0 +1,215 @@
+The following is based on information publicly available, user feedback,
+community discussions or bug report analyses. It's not complete and further
+research is encouraged when in doubt.
+
+MAIN MEMORY
+^^^^^^^^^^^
+
+The data structures and raw data blocks are temporarily stored in computer
+memory before they get written to the device. It is critical that memory is
+reliable because even simple bit flips can have vast consequences and lead to
+damaged structures, not only in the filesystem but in the whole operating
+system.
+
+Based on experience in the community, memory bit flips are more common than one
+would think. When it happens, it's reported by the tree-checker or by a checksum
+mismatch after reading blocks. There are some very obvious instances of bit
+flips that happen, e.g. in an ordered sequence of keys in metadata blocks. We can
+easily infer from the other data what values get damaged and how. However, fixing
+that is not straightforward and would require cross-referencing data from the
+entire filesystem to see the scope.
+
+If available, ECC memory should lower the chances of bit flips, but this
+type of memory is not available in all cases. A memory test should be performed
+in case there's a visible bit flip pattern, though this may not detect a faulty
+memory module because the actual load of the system could be the factor making
+the problems appear. In recent years attacks on how the memory modules operate
+have been demonstrated ('rowhammer') achieving specific bits to be flipped.
+While these were targeted, this shows that a series of reads or writes can
+affect unrelated parts of memory.
+
+Further reading:
+
+* https://en.wikipedia.org/wiki/Row_hammer
+
+What to do:
+
+* run *memtest*, note that sometimes memory errors happen only when the system
+  is under heavy load that the default memtest cannot trigger
+* memory errors may appear as filesystem going read-only due to "pre write"
+  check, that verify meta data before they get written but fail some basic
+  consistency checks
+
+DIRECT MEMORY ACCESS (DMA)
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Another class of errors is related to DMA (direct memory access) performed
+by device drivers. While this could be considered a software error, the
+data transfers that happen without CPU assistance may accidentally corrupt
+other pages. Storage devices utilize DMA for performance reasons, the
+filesystem structures and data pages are passed back and forth, making
+errors possible in case page life time is not properly tracked.
+
+There are lots of quirks (device-specific workarounds) in Linux kernel
+drivers (regarding not only DMA) that are added when found. The quirks
+may avoid specific errors or disable some features to avoid worse problems.
+
+What to do:
+
+* use up-to-date kernel (recent releases or maintained long term support versions)
+* as this may be caused by faulty drivers, keep the systems up-to-date
+
+ROTATIONAL DISKS (HDD)
+^^^^^^^^^^^^^^^^^^^^^^
+
+Rotational HDDs typically fail at the level of individual sectors or small clusters.
+Read failures are caught on the levels below the filesystem and are returned to
+the user as *EIO - Input/output error*. Reading the blocks repeatedly may
+return the data eventually, but this is better done by specialized tools and
+filesystem takes the result of the lower layers. Rewriting the sectors may
+trigger internal remapping but this inevitably leads to data loss.
+
+Disk firmware is technically software but from the filesystem perspective is
+part of the hardware. IO requests are processed, and caching or various
+other optimizations are performed, which may lead to bugs under high load or
+unexpected physical conditions or unsupported use cases.
+
+Disks are connected by cables with two ends, both of which can cause problems
+when not attached properly. Data transfers are protected by checksums and the
+lower layers try hard to transfer the data correctly or not at all. The errors
+from badly-connecting cables may manifest as large amount of failed read or
+write requests, or as short error bursts depending on physical conditions.
+
+What to do:
+
+* check **smartctl** for potential issues
+
+SOLID STATE DRIVES (SSD)
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+The mechanism of information storage is different from HDDs and this affects
+the failure mode as well. The data are stored in cells grouped in large blocks
+with limited number of resets and other write constraints. The firmware tries
+to avoid unnecessary resets and performs optimizations to maximize the storage
+media lifetime. The known techniques are deduplication (blocks with same
+fingerprint/hash are mapped to same physical block), compression or internal
+remapping and garbage collection of used memory cells. Due to the additional
+processing there are measures to verity the data e.g. by ECC codes.
+
+The observations of failing SSDs show that the whole electronic fails at once
+or affects a lot of data (eg. stored on one chip). Recovering such data
+may need specialized equipment and reading data repeatedly does not help as
+it's possible with HDDs.
+
+There are several technologies of the memory cells with different
+characteristics and price. The lifetime is directly affected by the type and
+frequency of data written.  Writing "too much" distinct data (e.g. encrypted)
+may render the internal deduplication ineffective and lead to a lot of rewrites
+and increased wear of the memory cells.
+
+There are several technologies and manufacturers so it's hard to describe them
+but there are some that exhibit similar behaviour:
+
+* expensive SSD will use more durable memory cells and is optimized for
+  reliability and high load
+* cheap SSD is projected for a lower load ("desktop user") and is optimized for
+  cost, it may employ the optimizations and/or extended error reporting
+  partially or not at all
+
+It's not possible to reliably determine the expected lifetime of an SSD due to
+lack of information about how it works or due to lack of reliable stats provided
+by the device.
+
+Metadata writes tend to be the biggest component of lifetime writes to a SSD,
+so there is some value in reducing them. Depending on the device class (high
+end/low end) the features like DUP block group profiles may affect the
+reliability in both ways:
+
+* *high end* are typically more reliable and using 'single' for data and
+  metadata could be suitable to reduce device wear
+* *low end* could lack ability to identify errors so an additional redundancy
+  at the filesystem level (checksums, *DUP*) could help
+
+Only users who consume 50 to 100% of the SSD's actual lifetime writes need to be
+concerned by the write amplification of btrfs DUP metadata. Most users will be
+far below 50% of the actual lifetime, or will write the drive to death and
+discover how many writes 100% of the actual lifetime was. SSD firmware often
+adds its own write multipliers that can be arbitrary and unpredictable and
+dependent on application behavior, and these will typically have far greater
+effect on SSD lifespan than DUP metadata. It's more or less impossible to
+predict when a SSD will run out of lifetime writes to within a factor of two, so
+it's hard to justify wear reduction as a benefit.
+
+Further reading:
+
+* https://www.snia.org/educational-library/ssd-and-deduplication-end-spinning-disk-2012
+* https://www.snia.org/educational-library/realities-solid-state-storage-2013-2013
+* https://www.snia.org/educational-library/ssd-performance-primer-2013
+* https://www.snia.org/educational-library/how-controllers-maximize-ssd-life-2013
+
+What to do:
+
+* run **smartctl** or self-tests to look for potential issues
+* keep the firmware up-to-date
+
+NVM EXPRESS, NON-VOLATILE MEMORY (NVMe)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+NVMe is a type of persistent memory usually connected over a system bus (PCIe)
+or similar interface and the speeds are an order of magnitude faster than SSD.
+It is also a non-rotating type of storage, and is not typically connected by a
+cable. It's not a SCSI type device either but rather a complete specification
+for logical device interface.
+
+In a way the errors could be compared to a combination of SSD class and regular
+memory. Errors may exhibit as random bit flips or IO failures. There are tools
+to access the internal log (**nvme log** and **nvme-cli**) for a more detailed
+analysis.
+
+There are separate error detection and correction steps performed e.g. on the
+bus level and in most cases never making in to the filesystem level. Once this
+happens it could mean there's some systematic error like overheating or bad
+physical connection of the device. You may want to run self-tests (using
+**smartctl**).
+
+* https://en.wikipedia.org/wiki/NVM_Express
+* https://www.smartmontools.org/wiki/NVMe_Support
+
+DRIVE FIRMWARE
+^^^^^^^^^^^^^^
+
+Firmware is technically still software but embedded into the hardware. As all
+software has bugs, so does firmware. Storage devices can update the firmware
+and fix known bugs. In some cases the it's possible to avoid certain bugs by
+quirks (device-specific workarounds) in Linux kernel.
+
+A faulty firmware can cause wide range of corruptions from small and localized
+to large affecting lots of data. Self-repair capabilities may not be sufficient.
+
+What to do:
+
+* check for firmware updates in case there are known problems, note that
+  updating firmware can be risky on itself
+* use up-to-date kernel (recent releases or maintained long term support versions)
+
+SD FLASH CARDS
+^^^^^^^^^^^^^^
+
+There are a lot of devices with low power consumption and thus using storage
+media based on low power consumption too, typically flash memory stored on
+a chip enclosed in a detachable card package. An improperly inserted card may be
+damaged by electrical spikes when the device is turned on or off. The chips
+storing data in turn may be damaged permanently. All types of flash memory
+have a limited number of rewrites, so the data are internally translated by FTL
+(flash translation layer). This is implemented in firmware (technically a
+software) and prone to bugs that manifest as hardware errors.
+
+Adding redundancy like using DUP profiles for both data and metadata can help
+in some cases but a full backup might be the best option once problems appear
+and replacing the card could be required as well.
+
+HARDWARE AS THE MAIN SOURCE OF FILESYSTEM CORRUPTIONS
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+**If you use unreliable hardware and don't know about that, don't blame the
+filesystem when it tells you.**
--- a/Documentation/index.rst
+++ b/Documentation/index.rst
@ -10,6 +10,7 @@ Welcome to BTRFS documentation!
   Introduction
   man-index
   Administration
+   Hardware
   CHANGES
   Glossary
   INSTALL