Page History
...
If the primary write mirror becomes unavailable during writes, the clients will inform the metadata server of write errors as normal. The metadata server will handle this the same as any error - The mirror is marked INCONSISTENT. The MDS will then select an in-sync mirror (where no writes failed) as the new primary for writes. If no mirrors completed all writes without error, there is a policy decision to make - we could either try to determine the "least" damaged mirror, or we could simply default to the previous write primary.
If a client is evicted from the MDS, it will be assumed to have failed writes to all mirrors other than the current write leader. This is discussed further in FailureHandling. MDS failure is also discussed there - certain information will have to be persisted to allow failover/recovery of the MDS.
The last unaddressed issue is write ordering on the OSTs. If there are concurrent overlapping writes (or fallocate or truncate operations), the order in which they complete is indeterminate, and since we want to write all mirrors in parallel, the ordering could be different on different mirrors. This would result in mirrors containing different data after a write phase, which is obviously unacceptable.
Initially, we will solve this by using the write leader mirror for locking, and requiring locks always be taken first on this mirror. TODO: how to handle this when a write leader fails in the middle needs consideration; it is hopefully just a matter of failing to acquire the lock, or if the lock has been acquired and then an error occurs on write,
This means report this error to the MDS which will wind up that write phase
One major downside of this approach is it requires us to use LDLM locks for direct IO, which has a significant cost for shared file writes. In order to avoid this, we will we need a way to track the order of write (and other data-affecting) operations on OSTs and communicate this information to the MDT at the end of a write phase.
We have a tentative plan for this described in: TODO PUT AT THE END OF THE DOCUMENT
Our plan is to use 'chained' RPC checksums. We will take care to ensure that all write (and related) RPCs to different mirrors are identical - note this requires the mirrors to have identical layout geometry. Then, when a write phase opens, when the OST is notified to update the layout generation on the stripe (which is done as part of FLR today), we will inform it the stripe object is part of an immediate mirror file. It will take the write RPC checksum from each write RPC, and 'chain' them together as they are committed (write commits can be ordered by their journal transaction #, even if they are occurring in parallel). The result of this is a single checksum value which encodes both the writes and their ordering. We can include non-write operations in this by checksumming their arguments (eg, the byte range and fallocate op type for fallocate).
The write primary mirror will be the 'correct' ordering, to which the others are compared. If ordering on a secondary mirror disagrees with the primary, that mirror will be marked inconsistent (to be repaired later).
We will need to determine how and if we want to specifically apply this only to overlapping operations - That requires tracking the extent of all data modifying operations as we proceed through a write phase, but it protects us from the case of non-overlapping writes causing an apparent inconsistency where none exists. Whether or not this is needed is an open design question.
This is describe in detail in TODO<--- Also need to cover recovery.
...