btrfs

Btrfs presentation at the LUG is available here.

***************
* btrfs btree *
***************

A btrfs filesystem consists in a collection of btrees:
* Chunk tree (referenced by superblock)
* Root tree (referenced by superblock)
- device tree
- extent tree
- checksum tree
- data relocation tree
- fs tree contains inode_item inode_ref dir_item dir_index extent_data.

The btrfs-debug-tree can be used to dump the btrees. .

btrfs stores all metadata structures (and to some extent even data if it can fit in a btree leaf) are stored as a item inside the btree.

The btrfs btree provides a generic facility to store items of variable size. Each item is associated with an unique key of the following type:

struct btrfs_disk_key {
            __le64 objectid;
            u8 type;
            __le64 offset;
}

The objectid is the most significant bits of the key. Key ordering tries to keep items sharing the same objectid close together in the btree. The objectid is typically the inode number.

Then multiple metadata structures are associated to a given inode: EAs, extents, directory entries, ... To differentiate all those structure, btrfs uses the type field of the key. Type examples are BTRFS_INODE_ITEM_KEY (core inode), BTRFS_DIR_ITEM_KEY (directory entry), BTRFS_XATTR_ITEM_KEY (EA). Please refer to ctree.h for an exhaustive list of item type.

Finally, we can still have mulitiple instances of the same metadata structure for a given inode. Examples are:
* directory entries: a directory can hopefully point to multiple files/dirs
* extents
* EAs
* ...
btrfs uses the offset field of the disk key to look up one specific instances of a given type. The value of this offset depends upon the item type. For extent, the offset is the logical offset of the extent in the file. For directory entry, the offset stores the hash of the file name.

btrees are updated in a Copy-On-Write (COW) manner, this means that btree nodes and leaves are never update in-place (no overwrite) and are thus always written to a new on-disk location.

btree blocks & nodes are also reference counted. This means that nodes & leaf blocks can be shared between different btree (e.g. snapshot) and be referenced multiple time inside the same btree (cp --reflink).

Unlike traditional filesystems, btrfs can store structures of a different type inside the same leaf block. The beginning of the blocks list all the items (see btrfs_item structure) that can be found in this block and btrfs_item::offset & size tells you where to find the item data inside the leaf block.

------------------------
| btrfs_item 0    |
| btrfs_item 1    |
| btrfs_item 2    |
|   ...                |
|-----------------------
|                      |
|                      |
| data for item 2|
|                  |
|                |
|-----------------------
| data for item 1|
|-----------------------
|                  |
| data for item 0|
|                      |
------------------------

We can then have an inode item sitting by an EA item sitting by an extent and all this inside the same leaf block. That's a very space-efficient approach.

*************

* Directory *
*************

Btrfs uses 2 directory indexes:

* BTRFS_DIR_ITEM_KEY is the first index used for lookup. The objectid in the key is the inode number of the directory and the offset is set to the hash of the file name.

* BTRFS_DIR_INDEX_KEY is the 2nd index used for readdir. The purpose of this index is to return inodes in inode number order. The objectid of the key is again the parent directory id and the offset is set to a sequence number incremented each time a new entry is added to the directory. Please refer to btrfs_set_inode_index() to see how sequence number are allocated and btrfs_real_readdir() for how they are parsed.

Both indexes use item of type btrfs_dir_item:

struct btrfs_dir_item {
        struct btrfs_disk_key location;
        __le64 transid;
        __le16 data_len;
        __le16 name_len;
        u8 type;
}

The file name is appended at the end of the structure.

A third item is used for directory. That's the back reference of type BTRFS_INODE_REF_KEY which has a btrfs_inode_ref item also including the file name reference, see btrfs_insert_inode_ref() for more information.

******************************************************
* Space Reservation for metadata operations *
******************************************************

When processing metadata operation, btrfs must make sure that all it will be able to allocate enough space to complete the metadata operation (the new btree root must be consistent). As a consequence, space is reserved when starting a transation (actually, we just get a journal handle) in btrfs_start_transaction() and released once we release the handle in btrfs_end_transaction() since at this point all blocks required to handle the metadata operation have been allocated.

Thus, btrfs_start_transaction() takes the number of items to be modified as a parameter, reserves the corresponding space, creates a handle and stores it in current->journal_info (although ->journla_info is not used anywhere, probably used to detect conflict with jbd). btrfs_trans_reserve_metadata() is in charge of booking metadata for the item modifications. The estimate is based on the worst-case scenario taking for granted that each item is the only one to be updated as part of the transaction.

The actual computation is done in calc_trans_metadata_size() as follows:
(root->leafsize + root->nodesize * (BTRFS_MAX_LEVEL - 1)) * 3 * num_items
= (4k + 4k * (8 - 1)) * 3 * num_items
= 96k * num_items

With btree node size = btree leave size = page size = 4k which are the default (and also max values) for now.

Creation operations (mknod/creat/mkdir) should require 5 items update (parent inode update, new child inode, 2 directory indexes, inode back reference & selinux EA). That said, btrfs only books space for 5 items today (looks like a bug). Based on the btrfs current implementation (2.6.39), here is a summary of metadata overhead accounted by btrfs on a per-operation basis:

Operation         #items Space booked
mknod/creat/mkdir 5       480k
link               3       288k
unlink            10       ~1m
rename            20       ~1.9m

**********************
* Delayed allocation *
**********************

As a modern filesystem, btrfs supports delayed allocation. In other words, blocks are not allocated at write system call but instead at commit time.
Btrfs maintains a red-black tree in memory for delayed extent allocation, allowing contiguous writes to be merged in the same on-disk extent (see inode_tree_add()). It is worth noting that btrfs does not use VFS's ->write_begin/end, but instead implement its own ->aio_write file operation (i.e. btrfs_file_aio_write()). This allows btrfs to process multiple pages at once and to operate on big extents instead of the usual page-per-page framework of the linux VFS. Delayed allocation can thus be handled at the extent level and be more performant.

Space for both data & metadata is reserved when diryting pages in memory to make sure that enough blocks will be available for the delayed allocation.
Booking is done the following code path btrfs_file_aio_write->__btrfs_buffered_write->btrfs_delalloc_reserve_space:
* btrfs_check_data_free_space() checks that enough data blocks are available and reserve it in the btrfs_space_info structure. If not, it performs chunk allocation.
* btrfs_delalloc_reserve_metadata() is in charge of reserving space for metadata. It uses the per-inode RB tree to estimate the number of extent splits (outstanding_extents which is the number of extent items we think we'll end up using while reserved_extents is the number of extent items we've reserved metadata for) required to handle the write. btrfs_split_extent_hook/btrfs_merge_extent_hook are the places where the number of oustanding extents is adjusted (resp bump/decrease).

If space reservation for either data or metadata cannot be satisfied, the write fails with ENOSPC.
Otherwise, the reserved space is released when the new btree root is written to disk (transaction commit) through the following code path:
__extent_writepage()
->run_delalloc_range()
-> cow_file_range()
   -> extent_clear_unlock_delalloc()
    -> clear_extent_bit(...EXTENT_DELALLOC)
     -> btrfs_delalloc_release_metadata()

************

* Checksum *
************

Btrfs checksums both data and metadata. Data checksums for extents are stored in a dedicated btree (see btrfs_csum_item).
In-line data and metadata are proteted by the btree checksum stored in btrfs_header (256-bit checksum).
Only one checksum type is supported for now, that's crc32c, but new checksum type can be easily added.

Checksums are computed at the last moment, when submitting the bios to disk. The call path is as follows:
btrfs_writepage()
->extent_write_full_page()
->extent_write_full_page()
   ->__extent_writepage()
    ->submit_extent_page()
     ->submit_one_bio()
      ->btrfs_submit_bio_hook()
       ->__btrfs_submit_bio_start()
        ->btrfs_csum_one_bio()
         ->btrfs_csum_data()
          ->crc32c()

In addition to the 256-bit checkout, the btrfs_header also includes:
* the block number where the btree block is supposed to sit. This allows to handle misplaced writes.
* a generation number. All btrfs structure referencing a btree block also has to store the generation field it expects that block to have. This allows to handle phantom writes

***************************
* Multiple device support *
***************************

All devices managed by btrfs are added to a pool of available storage.
Each device is divided into 1GB+ chunks. Then all those chunks are combined to form a virtual address space.
All block pointers used in a btrfs filesystem are relative to this virtual address space handled by a dedicated btree called the chunk tree.
The address space is then divided into block groups with a specific RAID configuration.

***********************
* Write ahead logging *
***********************

COW btree efficient for long running transactions. By default, btrfs commits the new root to disk every 30s whereas jbd2 which commits every 5s.
The COW btree is not suitable for frequent commits which can kill the performance.
To address this problem, a new specialized log for synchronous operations (e.g. fsync or O_SYNC writes) was added.
File or directory items copied to a dedicated btree. Thanks to this new write ahead log tree, btrfs can fsync a single file without syncing the whole filesystem. There is one such log btree per subvolume. The approach is very similar to ZFS’ ZIL
Like ZIL, btrfs log tree breaks lustre recovery since lustre transno can be committed out of order.

*********
* XATTR *
*********

As said before, btrfs stores extended attribute in dedicated btree items. The item type is BTRFS_XATTR_ITEM_KEY and the item content is the same as directory entries (i.e. btrfs_dir_item, see btrfs_insert_xattr_item()). The drawback is that one EA is limited to the size of a leaf block which is 4KB.

As a reminder, Lustre uses extended attributes for various purposes:
* On MDT:
   - LOV EA to store stripping information
     - stripe count/size/pattern + list of object ids
     - large striping (bug 4424) requires big xattr support
   - LMA (Luster Metadata Attributes)
     - store FID of this inode
     - also used by HSM & SOM
* On OST:
   - filter fid EA stores the object id (used when recovering files form lost+found) and back pointer to MDS object
)

Lustre wide-striping requires support for large EA support, so btrfs would need to be modified to support large EA.

***********************
* Inode Versioning *
***********************

btrfs_inode_item stores the transaction id that last touched this inode. Unfortunately, the transaction id is not suitable for Lustre since it is out of control and updated each time the inode is COW, including for atime update for instance.
That said, btrfs_inode_item has a 'sequence' number update on file write. This field is intended to be used for the NFSv4 change attribute and be used by lustre to store the lustre version.

*******************
* Transactions *
*******************

Like jbd, btrfs has a dedicated thread (namely btrfs-transaction) in charge of transaction commit.
By default, it commits the new tree every 30s (see transaction_kthread()).
btrfs exports transaction to userspace through 2 ioctls (BTRFS_IOC_TRANS_START and BTRFS_IOC_TRANS_END).
This API is used by Ceph's OSD but does not allow to handle ENOSPC issue correctly.
A new API was proposed, more information are available here: http://lwn.net/Articles/361457/

Page tree

btrfs