Immediate Mirror Design Sketch

h1. Immediate Mirroring Design Sketch

h2. Summary
A brief overview of the immediate mirroring design, covering writer lock acquisition, layout transitions, mirror staling and recovery, and client-side duplication logic.

h2. Writer Lock Acquisition
Before any write operation (mmap write, setattr→truncate, etc.), the client must acquire an active writer lock on the special MDS IBITs bit called ACTIVE_WRITERS.
* This lock is a CW lock.
* The MDS uses it to track active writers on a file.

h3. ACTIVE_WRITER_LOCK_LIFETIMES {anchor:ActiveWriterLockLifetimes}
* Clients may cache the lock for a short duration (e.g., 1–5 seconds) to batch multiple writes without repeated MDS requests.
* Locks are only released once the data is durably committed to OSTs.

h2. Layout Handling
Once the active writer lock is held, layout handling proceeds as follows:

# If the layout is RDONLY:
#* Client sends a write intent to MDS.
#* MDS transitions layout to WRITE_PENDING and stales all but one replica.
#* See [Staling During Writes & Primary Replica Selection|#StalingDuringWrites].

# All other layout types follow normal behavior.

h3. STALING DURING WRITES & PRIMARY REPLICA SELECTION {anchor:StalingDuringWrites}
We keep a single “primary” replica readable during writes to ensure consistency across clients.
* Writing to all mirrors prevents data divergence without distributed transactions.
* Secondary replicas remain stale until writes complete.

h2. Write Completion and Lock Release
When a write completes (data committed to OSTs):
* Client releases the active writer lock (or holds briefly for reuse).
* Clients report success/failures in the LDLM_CANCEL reply, aggregating results from all writes under that lock.
* Once MDS sees no active locks, it transitions layout back to RDONLY and unstales only successful mirrors.

h2. Error Handling and Mirror State
* If some mirrors fail during writes, MDS unstales only those without errors.
** E.g., with three mirrors and three writes: if mirrors 0 and 1 error, only mirror 2 is unstaled; 0 and 1 are marked errored.
* Userspace may attempt resync on errored mirrors or provision new ones.
* If all mirrors fail, MDS must determine reachable mirrors or take alternate recovery steps.
* Introduce a new “WRITE_ERRORS” (or “WRITE_INCONSISTENT”) flag to distinguish error-staled mirrors from ongoing-write-staled ones.
* Notify monitoring tools (e.g., Lamigo) when a mirror becomes stale due to error.

h3. MDS Eviction Handling
If a client holding an active writer lock is evicted:
* MDS cannot safely un-stale mirrors—assume writes to the primary succeeded.
* On next write intent, a new primary is selected if the former primary failed.

h2. Client Implementation Details
* Duplicate all writes/write-like operations to all mirrors (CLIO modifications needed).
** DIO: straightforward—duplicate IO requests.
** Buffered IO: complex—must avoid redundant page-cache operations.
* Fail write-like ops quickly upon mirror errors (similar to read failures).
* Record per-mirror errors to report in cancellation replies.
* Ensure the active writer lock persists until all IO is durably committed.

h2. Internal Links
* [ACTIVE_WRITER_LOCK_LIFETIMES|#ActiveWriterLockLifetimes]
* [STALING DURING WRITES & PRIMARY REPLICA SELECTION|#StalingDuringWrites]

Page tree

Immediate Mirror Design Sketch