This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
The new back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer by
searching the tree. The shortcoming of the new back ref is that it only works
for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these fuzzy back
references would be O(number_of_snapshots) and quite slow. The solution used
here is to use the fuzzy back references in the common case where a given tree
block is only referenced by one root, and use the full back references when
multiple roots have a reference
This patch makes btrfsck check more things, including
directory items, file extents, checksumming, inode link
counts etc.
The code for these checks is similar to the code verifies
extent back references. The main difference is that
shared tree blocks are treated specially. The partial
checking results(unresolved references and/or errors)
of shared sub-trees are cached. This avoids scanning
the shared blocks several times. Thank you,
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
This patch updates the ext3 to btrfs converter for the new
disk format. This mainly involves changing the convert's
data relocation and free space management code. This patch
also ports some functions from kernel module to btrfs-progs.
Thank you,
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
This adds a sequence number to the btrfs inode that is increased on
every update. NFS will be able to use that to detect when an inode has
changed, without relying on inaccurate time fields.
While we're here, this also:
Puts reserved space into the super block and inode
Adds a log root transid to the super so we can pick the newest super
based on the fsync log as well as the main transaction ID. For now
the log root transid is always zero, but that'll get fixed.
Adds a starting offset to the dev_item. This will let us do better
alignment calculations if we know the start of a partition on the disk.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.
But, this has a few problems:
* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.
* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.
* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.
* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.
* There is potentitally one copy of the checksum in each subvolume
referencing an extent.
The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:
* The checksum is against the data stored on disk, after any compression
or encryption is done.
* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.
This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.
The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This is the btrfs-progs version of the patch to add the ability to have
different csum algorithims. Note I didn't change the image maker since it
seemed a bit more complicated than just changing some stuff around so I will let
Yan take care of that.
Everything else was converted and for now a mkfs just
sets the type to be BTRFS_CSUM_TYPE_CRC32.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
This patch adds btrfs image tool. The image tool is
a debugging tool that creates/restores btrfs metadump
image.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
This patch does the following:
1) Update device management code to match the kernel code.
2) Allocator fixes.
3) Add a program called btrfstune to set/clear the SEEDING
super block flags.
This patch adds transaction IDs to root tree pointers.
Transaction IDs in tree pointers are compared with the
generation numbers in block headers when reading root
blocks of trees. This can detect some types of IO errors.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
The offset field in struct btrfs_extent_ref records the position
inside file that file extent is referenced by. In the new back
reference system, tree leaves holding reference to file extent
are recorded explicitly. We can quickly scan these tree leaves, so the
offset field is not required.
This patch also makes the back reference system check the objectid
when extents are being deleted
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
This patch makes the back reference system to explicit record the
location of parent node for all types of extents. The location of
parent node is placed into the offset field of backref key. Every
time a tree block is balanced, the back references for the affected
lower level extents are updated.
This is a utility option for the resizer, it makes sure to allocate
at offset bytes in the disk or higher. It ensures the resizer will have
something to move when testing it.
This patch improves converter's allocator and fixes a bug in data relocation
function. The new allocator caches free blocks as Btrfs's default allocator.
In testing here, the user CPU time reduced to half of the original when
checksum and small file packing was disabled. This patch also enlarges the
size of block groups created by the converter.
The main changes in this patch are adding chunk handing and data relocation
ability. In the last step of conversion, the converter relocates data in system
chunk and move chunk tree into system chunk. In the rollback process, the
converter remove chunk tree from system chunk and copy data back.
Regards
YZ
---
Block headers now store the chunk tree uuid
Chunk items records the device uuid for each stripes
Device extent items record better back refs to the chunk tree
Block groups record better back refs to the chunk tree
The chunk tree format has also changed. The objectid of BTRFS_CHUNK_ITEM_KEY
used to be the logical offset of the chunk. Now it is a chunk tree id,
with the logical offset being stored in the offset field of the key.
This allows a single chunk tree to record multiple logical address spaces,
upping the number of bytes indexed by a chunk tree from 2^64 to
2^128.
The mkfs code bootstraps the filesystem on a single device. Once
the raid block groups are setup, it needs to recow all of the blocks so
that each tree is properly allocated.
The first problem is that these SETGET macros lose typing information,
and therefore can't see the 'packed' attribute and therefore take
unaligned access SIGBUS signals on sparc64 when trying to derefernce
the member.
The next problem is a similar issue in btrfs_name_hash(). This gets
passed things like &key.offset which is a member of a packed
structure, losing this packed'ness information btrfs_name_hash()
performs a potentially unaligned memory access, again resulting in a
SIGBUS.