Executive Summary

This document proposes a comprehensive approach to implementing metadata redundancy in the Lustre filesystem, addressing the critical need for fault tolerance in metadata services. The solution provides redundancy for filesystem configurations, service data, and file/directory metadata through a phased implementation approach.

Goal

Implement comprehensive redundancy for all three types of Lustre filesystem metadata:
1. Filesystem configurations (currently managed by MGT, hosted on MGS separately or combined with MDT0)
2. System service data (FLD, quota, flock, and BFL services - currently on MDT0)
3. File and directory inodes (distributed across MDTs)

Key Design Components

1. Fault-Tolerant MGS (Management Service)

- Disable separate MGS and deploy MGT on MDT
- FID cannot assist MGS since it maintains system status, and all nodes must have consensus on its location to prevent brain splits:
  * Reserve FID_SEQ_MDTXXXX (from 0x10000 to 0x1ffff) for sequences allocated to MDT0 through MDTffff
  * Use FID_SEQ_MDTXXXX for configuration files, reserving OIDs for them (e.g., <FSNAME>-MDT0000 to <FSNAME>-MDTFFFF, <FSNAME>-client, nodemap, etc.)
  * Replicate configuration files from MDT0 to MDT min(7, LARGEST_MDT_INDEX)
  * MDT0 serves as the MGS leader by default. Administrators can promote another MDT as leader via 'lctl --device <MGS_DEV> promote'. However, if MDT0 is active, this may not take effect unless MDT0 is explicitly demoted via 'lctl --device <MGS_DEV> demote' on MDT0
  * Upon startup or current MGS failure, all nodes will iterate through MDT0 to MDT7 to find the leader. If a non-leader MGS receives a request, it will reject it with -EXDEV
- Position MGT above LOD to enable writing configuration files using distributed transactions
- System recovery is not possible if MDT0 through MDT7 fail simultaneously

2. Fault-Tolerant FLD Service

- Since FID is used to locate replicas, it cannot assist the FLD service:
  * Replicate FLDB from MDT0 to MDT min(7, LARGEST_MDT_INDEX)
- Distributed transactions cannot handle objects with local FID:
  * FLDB object FID is [FID_SEQ_MDTX:FID_SEQ_CTL_OID:0]
  * Position FLD above LOD to write FLDB using distributed transactions
- The MDT hosting the leader MGS acts as the FLD service leader

3. Fault-Tolerant Quota/Flock/BFL Service

- Implements the same approach as FLD
- Position QMT above LOD to write quota files using distributed transactions

4. File Metadata Redundancy

- FID Structure Enhancement:
  * Reserve first 4 bits of FID sequence for replica ID
  * Support up to 16 replicas per file
  * Example FID structure for 3 replicas:
    . R0: [0x200000401:0x1:0x0]
    . R1: [0x1000000200000401:0x1:0x0]
    . R2: [0x2000000200000401:0x1:0x0]
  * Each replica directory tree operates independently
  * Replicas do not need to store FIDs of other replicas
- lu_seq_range Structure Changes:
  * Extend struct lu_seq_range to include 16 replica indices and replica count:
    . struct lu_seq_range {
        __u64 lsr_start;
        __u64 lsr_end;
        __u32 lsr_flags;
        __u32 lsr_count;
        __u32 lsr_index[16];
      }
   * Target MDT allocates sequences for all replicas in a single request
- Rebuild Process:
  * The leader MGS acts as the rebuild coordinator
  * The leader MGS (also FLD service leader) relocates sequences from the failed MDT to other MDTs based on instance and space usage
  * The leader MGS backs up FLDB during rebuild: relocated sequences are updated in the new FLDB while the old one determines if sequences were on the failed MDT. The old FLDB is removed after rebuild
  * MDTs query FLD for relocated sequences, scan local files for largest OIDs of relocated sequences, and send them to target MDTs. Target MDTs recreate lost objects by reading from their replicas
  * If object recreation fails (from all replicas) with -ENOENT or -ESTALE, it is skipped and rebuild continues. Other errors halt rebuild and notify the leader. If some MDTs complete rebuild with errors, the system enters 'ERROR' status
  * In 'ERROR' status, administrators can reinitiate rebuild via 'lctl --device <MGS_DEV> rebuild' on the leader MGS
  * Multiple MDT removals follow the same process. MDT removal during rebuild won't interrupt the current process but will trigger a new rebuild afterward
  * Adding new MDTs won't trigger rebuild since no objects are lost, but the new MDT will be preferred for new sequence allocation

Use Cases

Upgrade MDS

1. Bind MGS with MDT0
2. MDT0 updates FLDB, configuration and quota files FIDs
3. MDT1 to MDT7 create FLDB, configuration and quota files, and update contents upon connection to MDT0
4. All nodes enqueue PR lock of system status to MDT0
5. MDT0 replies with status 'OLD'

Enabling Metadata Redundancy On An Existing System

1. Set replicate count via 'lctl conf_param <fsname>.sys.mr_count=<REPLICATE_COUNT>' on MDT0
2. MDT0 allocates replica sequences for existing sequences by MDT instances
3. MDT0 sets system status to 'NEW', and revokes status locks
4. All MDTs start to rebuild, when rebuild finishes, notify MDT0
5. After all MDTs finish rebuild, MDT0 sets system status to 'NORMAL', and revokes status locks

Create

1. Target MDT allocates FID for R0, and pack replica ID in it for other replicas
2. If new sequence is allocated, also allocate replica sequences
3. Distributed transaction ensures atomic creation
4. Transparent to clients: Clients knows R0 only in normal mode

Access

1. Transparent to clients: Clients knows R0 only in normal mode

Delete MDT

1. Shutdown failed MDT (e.g. MDT0) if it's still up
2. Admin promotes MDT2 (it could be any MDT from MDT1 to MDT7) to be MGS leader via 'lctl --device <MGS_DEV> promote' on MDT2
3. Admin initiates MDT0 removal via 'lctl --device <MDT0_DEV> delete' on MDT2
4. MDT2 scans FLDB and relocates sequences allocated to MDT0
5. MDT2 sets system status to 'REBUILD', and revokes status locks
6. All MDTs requeue system status locks on MDT2, and know MDT0 is deleted from reply, and start to rebuild files on MDT0
7. System returns to normal mode after rebuild

Access File During Rebuild

1. Same as above, MDT0 is removed, MDT2 is promoted to be MGS leader
2. Client finds MDT2 is MGS leader, and knows system status is 'REBUILD' from the reply
3. Client revalidates '/' with FID [0x200000007:0x1:0x0], finds it's located on MDT0 which is removed, it tries to lookup '/' R1 FID [0x1000000200000007:0x1:0x0] to locate its MDT, and then revalidate '/' with R1 FID there
4. Resolve pathname one by one, during this process if some of the intermediate directory are located on MDT0, try lookup with other replica FID, and finally the operation is sent to the MDT where one of the file replica is located

Access File After Rebuild

1. All MDTs notice system status is 'NORMAL', revoke all ldlm locks held by clients
2. Client revalidates '/' with FID [0x200000007:0x1:0x0] which may not be located on MDT0
3. File pathname resolution and operation handling is the same as before, i.e. client only needs to access R0


Implementation Phases

Phase 1: Fault-Tolerant Services

- Implement fault-tolerant services for MGS, FLD, quota, flock, and BFL
- Support service file upgrades
- Add commands for MGS promotion/demotion

Phase 2: File Replication

- Implement replicated file creation and modifications

Phase 3: Rebuild Framework

- Add commands for failed MDT removal and rebuild initiation
- Add commands to query leader MGS location and system status
- Develop rebuild coordinator in MGS
- Implement file scanning and recreation logic
- Implement FLD rebuild support


Issues & Risks

Compatibility:

- No downgrade path to non-redundant versions

Performance Impact:

- Write operations require distributed transactions
- Rebuild process may be time-intensive

Complexity:

- Fault-tolerant services require extensive code changes in MGS, FLD, and quota modules
- Rebuild complexity is comparable to LFSCK
- DoM file replication and rebuild require special handling


Future Enhancements

- Support per-directory replica count
  * Add 'mr_count' field in default LMV, inherited through 'max-mr-inherit'
  * Default LMV on ROOT is straightforward to handle, but if set deep in the hierarchy, extra replicas are dangling and will be placed under 'MR_PARENT_DIR'
- Support per-file replica count
  * Consider necessity - files could be placed in directories with appropriate 'mr_count'
- Enhance distributed transaction performance
- Enhance rebuild performance
- Implement monitoring and reporting tools
- Add automatic failure detection

  • No labels