Page History
Btrfs presentation at the LUG is available here.
***************
* btrfs btree *
***************
A btrfs filesystem consists in a collection of btrees:
* Chunk tree (referenced by superblock)
* Root tree (referenced by superblock)
- device tree
- extent tree
- checksum tree
- data relocation tree
- fs tree
contains inode_item inode_ref dir_item dir_index extent_data.
The btrfs-debug-tree can be used to dump the btrees. Here is an output example .
btrfs stores all metadata structures (and to some extent even data if it can fit in a btree leaf) are stored as a item inside the btree.
...
Unlike traditional filesystems, btrfs can store structures of a different type inside the same leaf block. The beginning of the blocks list all the items (see btrfs_item structure) that can be found in this block and btrfs_item::offset & size tells you where to find the item data inside the leaf block.
------------------------
| btrfs_item 0 0 |
| btrfs_item 1 1 |
| btrfs_item 2 2 |
| ... |
|-----------------------
| |
| |
| data for item 2|
| |
| |
|------------------
| -----
| data for item 1|
|-----------------------
| |
| data for item 0|
| |
------------------------
We can then have an inode item sitting by an EA item sitting by an extent and all this inside the same leaf block. That's a very space-efficient approach.
*************
* Directory *
*************
...
******************************************************
* Space Reservation for metadata operations *
******************************************************
...
If space reservation for either data or metadata cannot be satisfied, the write fails with ENOSPC.
Otherwise, the reserved space is released when the new btree root is written to disk (transaction commit) through the following code path:
__extent_writepage()
->run_delalloc_range()
-> cow_file_range()
-> extent_clear_unlock_delalloc()
-> clear_extent_bit(...EXTENT_DELALLOC)
-> btrfs_delalloc_release_metadata()
btrfs-debug-tree output on:
* an empty filesystem which has just been formatted (1 single device, default option)
* a filesystem with one file (with inum/objid 257) created
* the same filesystem with a subvolume (subvolid 256) created
_bit(...EXTENT_DELALLOC)
-> btrfs_delalloc_release_metadata()
$ btrfs subvolume list /mnt/
ID 256 top level 5 path subvol************
* Checksum *
************
Btrfs checksums both data and metadata. Data checksums for extents are stored in a dedicated btree (see btrfs_csum_item).
In-line data and metadata are proteted by the btree checksum stored in btrfs_header (256-bit checksum).
Only one checksum type is supported for now, that's crc32c, but new checksum type can be easily added.
...
Lustre wide-striping requires support for large EA support, so btrfs would need to be modified to support large EA.
***********************
* Inode Versioning *
***********************
btrfs_inode_item stores the transaction id that last touched this inode. Unfortunately, the transaction id is not suitable for Lustre since it is out of control and updated each time the inode is COW, including for atime update for instance.
That said, btrfs_inode_item has a 'sequence' number update on file write. This field is intended to be used for the NFSv4 change attribute and be used by lustre to store the lustre version.
*******************
* Transactions *
*******************
Like jbd, btrfs has a dedicated thread (namely btrfs-transaction) in charge of transaction commit.
By default, it commits the new tree every 30s (see transaction_kthread()).
btrfs exports transaction to userspace through 2 ioctls (BTRFS_IOC_TRANS_START and BTRFS_IOC_TRANS_END).
This API is used by Ceph's OSD but does not allow to handle ENOSPC issue correctly.
A new API was proposed, more information are available here: http://lwn.net/Articles/361457/![]()