You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

h1. Immediate Mirroring Design Sketch

h2. Summary
A brief overview of the immediate mirroring design, covering writer lock acquisition, layout transitions, mirror staling and recovery, and client-side duplication logic.

h2. Writer Lock Acquisition
Before any write operation (mmap write, setattr→truncate, etc.), the client must acquire an active writer lock on the special MDS IBITs bit called ACTIVE_WRITERS.  
* This lock is a CW lock.  
* The MDS uses it to track active writers on a file.

h3. ACTIVE_WRITER_LOCK_LIFETIMES {anchor:ActiveWriterLockLifetimes}
* Clients may cache the lock for a short duration (e.g., 1–5 seconds) to batch multiple writes without repeated MDS requests.  
* Locks are only released once the data is durably committed to OSTs.

h2. Layout Handling
Once the active writer lock is held, layout handling proceeds as follows:

# If the layout is RDONLY:
#* Client sends a write intent to MDS.  
#* MDS transitions layout to WRITE_PENDING and stales all but one replica.  
#* See [Staling During Writes & Primary Replica Selection|#StalingDuringWrites].

# All other layout types follow normal behavior.

h3. STALING DURING WRITES & PRIMARY REPLICA SELECTION {anchor:StalingDuringWrites}
We keep a single “primary” replica readable during writes to ensure consistency across clients.  
* Writing to all mirrors prevents data divergence without distributed transactions.  
* Secondary replicas remain stale until writes complete.

h2. Write Completion and Lock Release
When a write completes (data committed to OSTs):  
* Client releases the active writer lock (or holds briefly for reuse).  
* Clients report success/failures in the LDLM_CANCEL reply, aggregating results from all writes under that lock.  
* Once MDS sees no active locks, it transitions layout back to RDONLY and unstales only successful mirrors.

h2. Error Handling and Mirror State
* If some mirrors fail during writes, MDS unstales only those without errors.  
** E.g., with three mirrors and three writes: if mirrors 0 and 1 error, only mirror 2 is unstaled; 0 and 1 are marked errored.  
* Userspace may attempt resync on errored mirrors or provision new ones.  
* If all mirrors fail, MDS must determine reachable mirrors or take alternate recovery steps.  
* Introduce a new “WRITE_ERRORS” (or “WRITE_INCONSISTENT”) flag to distinguish error-staled mirrors from ongoing-write-staled ones.  
* Notify monitoring tools (e.g., Lamigo) when a mirror becomes stale due to error.

h3. MDS Eviction Handling
If a client holding an active writer lock is evicted:  
* MDS cannot safely un-stale mirrors—assume writes to the primary succeeded.  
* On next write intent, a new primary is selected if the former primary failed.

h2. Client Implementation Details
* Duplicate all writes/write-like operations to all mirrors (CLIO modifications needed).  
** DIO: straightforward—duplicate IO requests.  
** Buffered IO: complex—must avoid redundant page-cache operations.  
* Fail write-like ops quickly upon mirror errors (similar to read failures).  
* Record per-mirror errors to report in cancellation replies.  
* Ensure the active writer lock persists until all IO is durably committed.

h2. Internal Links
* [ACTIVE_WRITER_LOCK_LIFETIMES|#ActiveWriterLockLifetimes]  
* [STALING DURING WRITES & PRIMARY REPLICA SELECTION|#StalingDuringWrites]  

  • No labels