Overview & Design Summary
We have a requirement to do immediate mirroring on Lustre files, where all writes (and related ops) are replicated to multiple mirrors immediately (or close to immediately).
The goal is to have redundancy as soon as possible, but it is not required to have all mirrors available during writes - we are allowed a window where there may be limited redundancy.
This is key in making replication tractable without requiring MVCC, which is not provided by the Lustre backends.
The core idea of the design is this:
Clients will inform the MDS when they are writing (take an "active writer" lock), the MDS will "unseal" the layout, allowing writes (FLR Write Pending state). The clients will send all writes to all online mirrors. A single primary mirror will be selected for reads during this time, since we cannot guarantee all mirrors are identical during writes. Other mirrors are marked stale during writes. (If we did not, different clients could see differing file contents, which is unacceptable.) Once the writes are complete, the clients will inform the MDS if their writes were successful (giving information about all the mirrors). If all writes were successful, the MDS will un-stale the secondary mirrors and re-seal the layout (FLR Read Only state). If any writes fail, the MDS will mark those mirrors as out of sync and notify userspace to attempt repair or replacement. Those mirrors will not be used until they have been repaired or replaced (by copying from a non-stale mirror).
If the primary write mirror becomes unavailable during writes, the clients will inform the metadata server as normal. The metadata server will handle this the same as any error - The mirror is marked out of sync. The MDS will then select an in-sync mirror (where no writes failed) as the new primary for writes.
There are a number of recovery scenarios which are covered in more detail below.
Relationship to Erasure Coding
Immediate mirroring is separate from erasure coding, and can be done partly in parallel. Erasure coding has no specific implications for immediate mirroring and will be covered in a different document.
Core Design Concept
Before starting any write/OST modifying operation (mmap write, setattr->truncate, etc.), clients will take an active writer lock on a special MDS IBITs bit called ACTIVE_WRITERS. This lock will be a CW lock so multiple clients can hold it at the same time. The MDS will use this lock to determine if there are any active writers on a file.
Layout handling then proceeds as currently: If the layout is RDONLY, the client will send a write intent to the MDS, which will transition the layout to WRITE_PENDING and stale all replicas except one. See Staling During Writes & Primary Replica Selection.
Write Operation Flow
As noted above, before starting a write (or other OST modifying operation), the client will request an AW lock from the MDS. It holds this lock throughout the writing process until all data has been committed to OST storage. At this point, it can release the lock. (For direct IO, this is immediately after the write completes, for buffered IO writes, which are async, we will have to detect when the pages are committed.)
This is critical - The client will only release the active writer lock once the data is committed to disk on the OSTs, so it is known durable. The client may also keep the lock briefly (1 or 5 seconds) to allow reuse for subsequent writes. See Active Writer Lock Lifetimes.
The success or failure of writes is recorded and communicated to the MDS in the cancellation reply, combining across all writes which occurred under that lock. (NB: If an error occurs, the client must stop using the lock for new writes, but several writes may be in progress.)
When the MDS has no more active writer locks, it will examine the errors reported, and transition the layout back to RDONLY, un-stale-ing mirrors where there were no write errors.
Example Scenario
With one client, if there are three mirrors and three writes:
If write 1 errored to mirror 0
If write 2 errored to mirror 1
The client will report errors on mirror 0 and mirror 1. The MDS would then un-stale only mirror 2 and mark the other two mirrors as inconsistent. (Userspace can try to resync to these mirrors, and if this fails, will need to add new mirrors.)
Note: If all the mirrors fail writes during an IO, we will have to decide how to handle this. See Open Problems.
This will require a new type of LDLM_CANCEL reply, one which includes the LVB so the client can communicate this info. This will also require crafting the LVB structure to send over to the server with this information. It will also need CLIO changes to collect errors from all types of operations which use the Active Writer lock, since any modification failure on a mirror renders that mirror inconsistent.
Failure Handling
Mirror State Management
We will have to distinguish between:
Mirrors which are "stale due to ongoing write", which are just stale
Mirrors which are "stale due to error", which I am tentatively calling inconsistent (an inconsistent mirror is always stale as well)
The inconsistent mirrors are excluded when the MDS unstales the mirrors (transitioning the layout back to RDONLY). We will need to add a new mirror flag to identify this, INCONSISTENT or WRITE_INCONSISTENT. INCONSISTENT mirrors are always stale, to benefit from existing handling for stale mirrors, but inconsistent mirrors are also ignored for new writes. The inconsistent flag is only cleared by a complete data resync to that mirror.
Client Eviction from MDS
Critically, if there is an MDS eviction of a client with an active writer lock, the mirrors cannot be unstaled because we cannot know if that client completed writes successfully. In that case, we will assume that client successfully wrote to the primary mirror and mark the other mirrors as inconsistent.
We could also possibly compare the OST commit vectors to check status - if any mirror has "all" the writes, it could be un-stale-ed. See Open Problems for more on the OST commit vector.
Primary Mirror Write Failure
In case of a write failure to the primary mirror, the clients will attempt to complete all other writes and inform the MDS as usual. The MDS will then mark the primary mirror stale as part of the transition to RDONLY, and will select a new primary mirror when the next write intent request arrives. This should work transparently with minimal changes.
We will also want to notify lamigo or similar when a mirror becomes stale due to error so it can attempt to recover state.
Client Eviction
From MDS
From OST
Client Implementation
The client will need to duplicate all writes and write-like operations to all of the mirrors. This will involve substantial work on the CLIO side to do correctly.
Note: Ideally we would not have to compress or encrypt data more than once when sending to multiple servers, but this may not be avoidable.
Direct I/O vs Buffered I/O
Direct I/O: This should be relatively straightforward, as DIO does not interact with the page cache and can easily duplicate IO requests (if we do not mind duplicating some work)
Buffered I/O: This will be much trickier, where just repeating the operation would hit the same pages in the page cache. How to handle this is an open question
Additional Client Requirements
The client will also need to:
Hold the active writer lock until all IO has been committed to the OSTs, so it is known-safe
For DIO this is easy - all DIO is immediately committed
For buffered, this will be more work - we'll need to track pages differently.
Fail write-like operations quickly, similarly to how it fails reads when there are other mirrors
Record all of the errors as appropriate so it can communicate them back to the server in the cancel reply
Notes
Staling During Writes & Primary Replica Selection
We have chosen to keep only one mirror readable while writing is going on, even though we will write to all of the mirrors. This is because all clients must see the same data when they read the file, and we cannot guarantee this with multiple mirrors without a complex transactional approach*.
A write may have arrived at one mirror but not yet at a different one, and two clients reading could see different data. This cannot be avoided unless we implement distributed transactions and rollback, which is not practical. (We could try locking all OSTs together before writing, but direct IO does not use locks and other problems come up in failure cases.)
So we keep a single primary replica readable during writes, even though we always try to update all replicas.
Note: The 'simple' approach would be to lock all mirrors before doing any writing and unlock them together. However, this is not possible for DIO, which does not use LDLM locks on the client, but instead only uses them locally on each separate server.
Active Writer Locks
The active writer lock will be cached on the client so it can be used by multiple writes (so every write does not generate a separate MDS lock request), but the LRU/cache time will be deliberately kept short (e.g., perhaps 1 second or 5 seconds). This should allow continuous writes from one process to work, but will not hold the lock for any length of time otherwise. This is because the presence of Active Writer locks prevents the MDS from considering the secondary mirrors in sync, so it reduces data reliability to cache them for any longer.
When an Active Writer lock is called back, the client will need to generate sync RPCs for all outstanding data for the file and confirm they have completed before it can allow the lock to be called back. Note this also has to handle the case where some mirrors can't be written - these errors would also need to be sent back to the server.
Open Problems
Write Ordering of overlapping writes across multiple mirrors
If we have two clients writing the same region of the file with multiple mirrors, how can we ensure we get the same ordering in every case? If A and B are both writes to the same section of file - how can we ensure that if A→B happens on one mirror, so B is the final result, we get the same order (A→B, resulting in B) on other mirrors.
Buffered IO could do this by taking and holding all of the DLMlocks*, but this can't be done for direct IO because direct IO does not use client side dlmlocks. (This is an important property of direct IO and not something we want to change.)
*or by clever use of the primary write mirror ldlmlock, but I think this has issues with primary mirror failure.
One suggestion is having the OSTs maintain a vector of committed writes & extents for Active Writer files. I believe the OSTs would only need to maintain this across a single layout generation. As part of transitioning an immediate mirror back to RDONLY, the server would request the write vectors from all of the OSTs in each mirror. It would then "collapse" these to include only sets of overlapping writes (every overlap would be tracked separately). It would then compare all mirrors to the current primary mirror, and any mirrors which committed overlapping writes in a different order would be marked INCONSISTENT and not-un-staled. These mirrors could be made consistent only by a full data resync (ie, a lfs mirror resync type operation).
Active Writer Lock matching - Resolved
"If an error occurs, the client must stop using the active writer lock for new writes, but several writes may be in progress."
This is necessary because we need to communicate the error back to the server ASAP, so it can start reacting. This seems straightforward enough - we will start the cancellation process on the client so the lock cannot be matched by new writes. The issue is if writes continue, they will create new AW locks. These would not be blocked until the server asked for the EX lock to do the layout changes. This is probably acceptable, since that would flush any newly created AW locks and would probably prevent any more AW locks from being granted while it was in the waiting queue.
What if all mirrors fail?
What if write/modify operations fail to all mirrors? This could occur, and the MDS will receive this information. It will then have to decide what to do - marking all mirrors stale renders the file inaccessible, but of course no mirror is entirely "good". This should be rare, so we will probably want to notify the admin, but keep the file accessible (ie, do not mark every mirror stale/inconsistent) - we could take whichever mirror experienced the fewest errors or stick with our existing primary/write target mirror regardless of error counts, and mark all of the others stale + inconsistent.