Overview & Design Summary

FLR Today

In the current implementation of FLR (introduced in Lustre 2.11), the system uses a "delayed write" approach for maintaining file mirrors. This means:

Writing to Mirrored Files: When a client writes to a mirrored file, only one primary (preferred) mirror is updated directly during the write operation. The other mirrors are simply marked as "stale" to indicate they're out of sync with the primary mirror.

Today, we have manual synchronization: after a write, the lfs mirror resync command must be run to synchronize the stale mirrors with the primary mirror. This command copies data from the synced mirror to the stale mirrors and removes the stale flag from successfully copied mirrors.

Layout state and staleness is managed through a careful series of layout state changes which are described File-level replication state machine in CoreDesignConcept in this document.

This delayed write approach was implemented in the first phase of FLR to avoid the complexity of maintaining consistency across multiple mirrors during concurrent writes. By updating only one mirror during writes and marking others as stale, the system maintains a consistent view of the file data, at the cost of requiring explicit synchronization after writes complete.

The current implementation does not provide immediate redundancy, since only on a single mirror is up to date until an explicit resync operation is performed. This approach enables things like hot pools synchronization, but the lack of immediate write redundancy severely limits the use cases.

Immediate Mirroring

We have a requirement to do immediate mirroring on Lustre files, where all writes (and related ops) are replicated to multiple mirrors immediately. The infrastructure created for this will also be used for immediate erasure coding.

The goal is to have redundancy immediately, but it is not required to have all mirrors available during writes - while writes are in flight, it is not possible to guarantee all mirrors will provide a consistent state without MVCC (multi-version concurrency control). With MVCC, it is possible to solve this problem by specifying a particular 'stable' version for reads during a write, but Lustre's storage backends (ldiskfs, ZFS) do not provide this functionality. Instead, during writes, only a single mirror will be available for reads.

The core idea of the design is this:

Clients will inform the MDS when they are writing (take an "active writer" lock), the MDS will "unseal" the layout, allowing writes (FLR Write Pending state). The clients will send all writes to all online mirrors. A single primary mirror will be selected for reads during this time, since we cannot guarantee all mirrors are identical during writes. This mirror is the "write leader", which will also be used for write ordering by taking locks in this mirror first. Other mirrors are marked stale during writes. (If we did not, different clients could see differing file contents, which is unacceptable.)

Clients will hold the active writer lock until they have written all data to disk and possibly for slightly longer to allow for reuse. If a client experiences a write error, it will finish all writes currently in flight (syncing to disk), then return the active writer lock to the MDS, with information about the write failure (primarily which mirror failed, but also the write extent). On receipt of an error, the MDS will request the active writer lock in EX mode, which forces all clients to flush any existing writes and not start new ones until the lock can be re-acquired. If no error occurs, the clients will return all active writer locks to the MDS shortly after they have completed writing, then the MDS will take the active writer lock to ensure no further writes from clients.

In either case, once the MDS has the active writer lock granted, it will begin transitioning the layout back to RDONLY. Note/TODO: The MDS must do something about evicted clients to ensure they don't write stale data, as they may have writes inflight and will not receive the active writer lock cancellation. This can probably be modeled on the approach taken by mirror resync - if nothing else, the MDS can take data extent locks on the entire file to force flushing. We could probably only do this on eviction, so it would not be too painful. If no write errors were reported, the MDS can simply un-stale the secondary mirrors and transition the layout back to RDONLY. If write errors were reported, the MDS will mark the errored mirrors as INCONSISTENT (a special version of stale, which is only cleared by a full data resync), and will notify userspace (probably via changelog) to attempt repair or replacement. INCONSISTENT mirrors will be treated like STALE mirrors, and not used for anything until they have been repaired.

If the primary write mirror becomes unavailable during writes, the clients will inform the metadata server of write errors as normal. The metadata server will handle this the same as any error - The mirror is marked INCONSISTENT. The MDS will then select an in-sync mirror (where no writes failed) as the new primary for writes. If no mirrors completed all writes without error, there is a policy decision to make - we could either try to determine the "least" damaged mirror, or we could simply default to the previous write primary.

If a client is evicted from the MDS, it will be assumed to have failed writes to all mirrors other than the current write leader. This is discussed further in FailureHandling. MDS failure is also discussed there - certain information will have to be persisted to allow failover/recovery of the MDS.

The last unaddressed issue is write ordering on the OSTs. If there are concurrent overlapping writes (or fallocate or truncate operations), the order in which they complete is indeterminate, and since we want to write all mirrors in parallel, the ordering could be different on different mirrors. This would result in mirrors containing different data after a write phase, which is obviously unacceptable.

Initially, we will solve this by using the write leader mirror for locking, and requiring locks always be taken first on this mirror. TODO: how to handle this when a write leader fails in the middle needs consideration; it is hopefully just a matter of failing to acquire the lock, or if the lock has been acquired and then an error occurs on write, report this error to the MDS which will wind up that write phase

One major downside of this approach is it requires us to use LDLM locks for direct IO, which has a significant cost for shared file writes. In order to avoid this, we will we need a way to track the order of write (and other data-affecting) operations on OSTs and communicate this information to the MDT at the end of a write phase.

We have a tentative plan for this described in: TODO PUT AT THE END OF THE DOCUMENT

Our plan is to use 'chained' RPC checksums. We will take care to ensure that all write (and related) RPCs to different mirrors are identical - note this requires the mirrors to have identical layout geometry. Then, when a write phase opens, when the OST is notified to update the layout generation on the stripe (which is done as part of FLR today), we will inform it the stripe object is part of an immediate mirror file. It will take the write RPC checksum from each write RPC, and 'chain' them together as they are committed (write commits can be ordered by their journal transaction #, even if they are occurring in parallel). The result of this is a single checksum value which encodes both the writes and their ordering. We can include non-write operations in this by checksumming their arguments (eg, the byte range and fallocate op type for fallocate).

The write primary mirror will be the 'correct' ordering, to which the others are compared. If ordering on a secondary mirror disagrees with the primary, that mirror will be marked inconsistent (to be repaired later).

We will need to determine how and if we want to specifically apply this only to overlapping operations - That requires tracking the extent of all data modifying operations as we proceed through a write phase, but it protects us from the case of non-overlapping writes causing an apparent inconsistency where none exists. Whether or not this is needed is an open design question.

This is describe in detail in TODO<--- Also need to cover recovery.

There are a number of recovery scenarios which are covered in more detail below.

Relationship to Erasure Coding

Immediate mirroring is separate from FLR Erasure Coding, and can be done partly in parallel. Erasure coding has no specific implications for immediate mirroring and will be covered in a different document.

Core Design Concept

Before starting any write/OST modifying operation (mmap write, setattr->truncate, etc.), clients will take an active writer lock on a special MDS IBITs bit called ACTIVE_WRITERS. This lock will be a CW lock so multiple clients can hold it at the same time. The MDS will use this lock to determine if there are any active writers on a file.

Layout handling then proceeds as currently: If the layout is RDONLY, the client will send a write intent to the MDS, which will transition the layout to WRITE_PENDING and stale all replicas except one. See Staling During Writes & Primary Replica Selection.

FLR Immediate Mirroring State Machine

Write Operation Flow

As noted above, before starting a write (or other OST modifying operation), the client will request an AW lock from the MDS. It holds this lock throughout the writing process until all data has been committed to OST storage. At this point, it can release the lock. (For direct IO, this is immediately after the write completes, for buffered IO writes, which are async, we will have to detect when the pages are committed.)

This is critical - The client will only release the active writer lock once the data is committed to disk on the OSTs, so it is known durable. The client may also keep the lock briefly (1 or 5 seconds) to allow reuse for subsequent writes. See Active Writer Lock Lifetimes.

The success or failure of writes is recorded and communicated to the MDS in the cancellation reply, combining across all writes which occurred under that lock. (NB: If an error occurs, the client must stop using the lock for new writes, but several writes may be in progress.)

When the MDS has no more active writer locks, it will examine the errors reported, and transition the layout back to RDONLY, un-stale-ing mirrors where there were no write errors.

Example Scenario
With one client, if there are three mirrors and three writes:

If write 1 errored to mirror 0
If write 2 errored to mirror 1

The client will report errors on mirror 0 and mirror 1 to the MDS as part of AW lock cancellation (see Client Implementation for more). The MDS would then un-stale only mirror 2 and mark the other two mirrors as inconsistent. (Userspace can try to resync to these mirrors, and if this fails, will need to add new mirrors.)

Note: If all the mirrors fail writes during an IO, we will have to decide how to handle this. See Open Problems.

Client Implementation

Modifying all mirrors - IO Duplication

The client will need to duplicate all writes and write-like operations to all of the mirrors. This will involve substantial work on the CLIO side to do correctly.
Note: Ideally we would not have to compress or encrypt data more than once when sending to multiple servers, but this may not be avoidable in the first version.

Direct I/O vs Buffered I/O

Direct I/O: This should be relatively straightforward, as DIO does not interact with the page cache and can easily duplicate IO requests (if we do not mind duplicating some work on the client).
Buffered I/O: This will be much trickier, where just repeating the operation would hit the same pages in the page cache. How to handle this is an open question

IO Commit & Lock Holding

The client must hold the active writer lock until all IO has been committed to the OSTs, so it is known-safe on disk.

This applies to DIO & buffered writes, but also to other modifying operations (eg, setattr).

For DIO this is easy - all DIO is immediately committed
For buffered, this will be more work - we'll need to track pages differently than today, associating them with something that lives after the end of the write() syscall so it can track them to full commit on the OSTs.

Error Management

Fast-fail for Modifying Operations

The client must fail modifying operations quickly, similarly to how it fails reads when there are other mirrors. Today, the client will wait indefinitely attempting to do a write/modification operation, since the only other option is a hard failure.

Error Communication to Server

Record all of the errors as appropriate so it can communicate them back to the server in the cancel reply.

This will require a new type of LDLM_CANCEL reply, one which includes the LVB so the client can communicate this info. This will also require crafting the LVB structure to send over to the server with this information. It will also need CLIO changes to collect errors from all types of operations which use the Active Writer lock, since any modification failure on a mirror renders that mirror inconsistent.

Failure Handling

Mirror States: Inconsistent vs Stale mirrors

Before we discuss error conditions, a note on mirror states.

We will have to distinguish between:

Mirrors which are "stale due to ongoing write", which are just stale

Mirrors which are "stale due to error", which I am tentatively calling inconsistent (an inconsistent mirror is always stale as well).

The inconsistent mirrors are excluded when the MDS unstales the mirrors (transitioning the layout back to RDONLY). We will need to add a new mirror flag to identify this, INCONSISTENT or WRITE_INCONSISTENT. INCONSISTENT mirrors are always stale, to benefit from existing handling for stale mirrors, but inconsistent mirrors are also ignored for new writes. The inconsistent flag is only cleared by a complete data resync to that mirror.

We will also want to notify lamigo or similar when a mirror becomes inconsistent so it can attempt to recover state.

Write/Update Error from Client

Clients will collect all write (or non-write update, eg, setattr) errors, associating them with the Active Writer lock. Note an error can be during initial write or while trying to commit a write on the OST. After an error, the client should stop allowing new writes to start under an existing Active Writer lock and return the current lock immediately (ie, no caching). The errors are communicated to the MDT in the lock value block which is sent to the MDT on the cancel. The MDT then marks all mirrors with errors as INCONSISTENT and does not un-stale them.

Client Eviction from OST

A client eviction from an OST will cause a write error (either during the first part of the write or when the client requests commit), which is reported to the MDT by the client when it cancels the Active Write lock and is handled just like other write errors. Like any write error, this indicates this particular mirror is now INCONSISTENT and the server must mark it as such to be resolved by a later RESYNC operation from lamigo or other userspace tool.

Client Eviction from MDT

Critically, if there is an MDT eviction of a client with an active writer lock, the mirrors cannot be unstaled because we cannot know if that client completed writes successfully. In that case, we will assume that client successfully wrote to the primary mirror and mark the other mirrors as inconsistent.

We could also possibly compare the OST commit vectors to check status - if any mirror has "all" the writes, it could be un-stale-ed. See Open Problems for more on the OST commit vector.

Client Loss

Client loss - where a client crashes or similar - is the same as MDT eviction, since such a client is evicted from the MDT and OSTs, and the MDT eviction takes priority over the OST evictions. All mirrors except the primary will be marked INCONSISTENT (or we could possibly use the write commit vectors from the OSTs to do better than this).

Primary Mirror Write Failure

In case of a write failure to the primary mirror, the clients will attempt to complete all other writes and inform the MDS as usual. The MDS will then mark the primary mirror stale as part of the transition to RDONLY, and will select a new primary mirror when the next write intent request arrives. It will do this by examining all the write statuses it received from active writers, and finding a mirror which did not experience any errors. See Open Problems for what to do if all mirrors had write failures, and Client Loss in this section for what to do if a client is lost entirely.

This should work transparently with minimal changes.

Notes

Staling During Writes & Primary Replica Selection

We have chosen to keep only one mirror readable while writing is going on, even though we will write to all of the mirrors. This is because all clients must see the same data when they read the file, and we cannot guarantee this with multiple mirrors without a complex transactional approach*.
A write may have arrived at one mirror but not yet at a different one, and two clients reading could see different data. This cannot be avoided unless we implement distributed transactions and rollback, which is not practical. (We could try locking all OSTs together before writing, but direct IO does not use locks and other problems come up in failure cases.)
So we keep a single primary replica readable during writes, even though we always try to update all replicas.
Note: The 'simple' approach would be to lock all mirrors before doing any writing and unlock them together. However, this is not possible for DIO, which does not use LDLM locks on the client, but instead only uses them locally on each separate server.

Active Writer Locks

The active writer lock will be cached on the client so it can be used by multiple writes (so every write does not generate a separate MDS lock request), but the LRU/cache time will be deliberately kept short (e.g., perhaps 1 second or 5 seconds). This should allow continuous writes from one process to work, but will not hold the lock for any length of time otherwise. This is because the presence of Active Writer locks prevents the MDS from considering the secondary mirrors in sync, so it reduces data reliability to cache them for any longer.

When an Active Writer lock is called back, the client will need to generate sync RPCs for all outstanding data for the file and confirm they have completed before it can allow the lock to be called back. Note this also has to handle the case where some mirrors can't be written - these errors would also need to be sent back to the server.

Open Problems

Write Ordering of overlapping writes across multiple mirrors

If we have two clients writing the same region of the file with multiple mirrors, how can we ensure we get the same ordering in every case? If A and B are both writes to the same section of file - how can we ensure that if A→B happens on one mirror, so B is the final result, we get the same order (A→B, resulting in B) on other mirrors.

Buffered IO could do this by taking and holding all of the DLMlocks*, but this can't be done for direct IO because direct IO does not use client side dlmlocks. (This is an important property of direct IO and not something we want to change.)
*or by clever use of the primary write mirror ldlmlock, but I think this has issues with primary mirror failure.

One suggestion is having the OSTs maintain a vector of committed writes & extents for Active Writer files. I believe the OSTs would only need to maintain this across a single layout generation. As part of transitioning an immediate mirror back to RDONLY, the server would request the write vectors from all of the OSTs in each mirror. It would then "collapse" these to include only sets of overlapping writes (every overlap would be tracked separately). It would then compare all mirrors to the current primary mirror, and any mirrors which committed overlapping writes in a different order would be marked INCONSISTENT and not-un-staled. These mirrors could be made consistent only by a full data resync (ie, a lfs mirror resync type operation).

Active Writer Lock matching - Resolved

"If an error occurs, the client must stop using the active writer lock for new writes, but several writes may be in progress."

This is necessary because we need to communicate the error back to the server ASAP, so it can start reacting. This seems straightforward enough - we will start the cancellation process on the client so the lock cannot be matched by new writes. The issue is if writes continue, they will create new AW locks. These would not be blocked until the server asked for the EX lock to do the layout changes. This is probably acceptable, since that would flush any newly created AW locks and would probably prevent any more AW locks from being granted while it was in the waiting queue.

What if all mirrors fail?

What if write/modify operations fail to all mirrors? This could occur, and the MDS will receive this information. It will then have to decide what to do - marking all mirrors stale renders the file inaccessible, but of course no mirror is entirely "good". This should be rare, so we will probably want to notify the admin, but keep the file accessible (ie, do not mark every mirror stale/inconsistent) - we could take whichever mirror experienced the fewest errors or stick with our existing primary/write target mirror regardless of error counts, and mark all of the others stale + inconsistent.

Replacing Mirrors/Permanent Failure

How do we decide a mirror is permanently failed and what do we do about it? It seems likely to me the answer is something like "report INCONSISTENT mirrors to lamigo, lamigo will try to resync those mirrors, but if this fails, will give up and add a new mirror". This is a bit tricky since adding new mirrors is only possible while the file is not being modified, so files which are written to often/constantly for a long time could lose durability if they lose a mirror, since we can't add a mirror during writes (without blocking for a long time).