Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 4.0

Btrfs presentation at the LUG is available here.

***************
* btrfs btree *
***************

A btrfs filesystem consists in a collection of btrees:
* Chunk tree (referenced by superblock)
* Root tree (referenced by superblock)
  - device tree
  - extent tree
  - checksum tree
  - data relocation tree
  - fs tree
     contains inode_item inode_ref dir_item dir_index extent_data. Here is an output example btrfs_1file_1subvol.txt.

The btrfs-debug-tree can be used to dump the btrees. .

...

Unlike traditional filesystems, btrfs can store structures of a different type inside the same leaf block. The beginning of the blocks list all the items (see btrfs_item structure) that can be found in this block and btrfs_item::offset & size tells you where to find the item data inside the leaf block.

------------------------
| btrfs_item 0   0    |
| btrfs_item 1   1    |
| btrfs_item 2   2    |
|   ...                          |
|-----------------------
|                                      |
|                                      |
| data for item 2|
|                                      |
|                                      |
|-----------------------
| data for item 1|
|-----------------------
|                                      |
| data for item 0|
|                                      |
------------------------

We can then have an inode item sitting by an EA item sitting by an extent and all this inside the same leaf block. That's a very space-efficient approach.

...

A third item is used for directory. That's the back reference of type BTRFS_INODE_REF_KEY which has a btrfs_inode_ref item also including the file name reference, see btrfs_insert_inode_ref() for more information.

******************************************************
* Space Reservation for metadata operations *
******************************************************

When processing metadata operation, btrfs must make sure that all it will be able to allocate enough space to complete the metadata operation (the new btree root must be consistent). As a consequence, space is reserved when starting a transation (actually, we just get a journal handle) in btrfs_start_transaction() and released once we release the handle in btrfs_end_transaction() since at this point all blocks required to handle the metadata operation have been allocated.

...

Lustre wide-striping requires support for large EA support, so btrfs would need to be modified to support large EA.

***********************
* Inode Versioning *
***********************

btrfs_inode_item stores the transaction id that last touched this inode. Unfortunately, the transaction id is not suitable for Lustre since it is out of control and updated each time the inode is COW, including for atime update for instance.
That said, btrfs_inode_item has a 'sequence' number update on file write. This field is intended to be used for the NFSv4 change attribute and be used by lustre to store the lustre version.

*******************
* Transactions *
*******************

Like jbd, btrfs has a dedicated thread (namely btrfs-transaction) in charge of transaction commit.
By default, it commits the new tree every 30s (see transaction_kthread()).
btrfs exports transaction to userspace through 2 ioctls (BTRFS_IOC_TRANS_START and BTRFS_IOC_TRANS_END).
This API is used by Ceph's OSD but does not allow to handle ENOSPC issue correctly.
A new API was proposed, more information are available here: http://lwn.net/Articles/361457/