Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

We have a requirement to do immediate mirroring on Lustre files, where all writes (and related ops) are replicated to multiple mirrors immediately.  The infrastructure created for this will also be used for immediate erasure coding.

The goal is to have redundancy immediately, but it is not required to have all mirrors available during writes - while writes are in flight, it is not possible to guarantee all mirrors will provide a consistent state without MVCC (multi-version concurrency control).  With MVCC, it is possible to solve this problem by specifying a particular 'stable' version for reads during a write, but Lustre's storage backends (ldiskfs, ZFS) do not provide this functionality.  Instead, during writes, only a single mirror will be available for reads.

...

Clients will inform the MDS when they are writing (take an "active writer" lock), the MDS will "unseal" the layout, allowing writes (FLR Write Pending state).  The clients will send all writes to all online mirrors.  A single primary mirror will be selected for reads during this time, since we cannot guarantee all mirrors are identical during writes.  This mirror is the "write leader", which will also be used for write ordering by taking locks in this mirror first.  Other mirrors are marked stale during writes.  (If we did not, different clients could see differing file contents, which is unacceptable.)  Once the writes are complete, the clients will inform the MDS if their writes were successful (giving information about all the mirrors).  If all writes were successful, the MDS will un-stale the secondary mirrors and re-seal the layout (FLR Read Only state).  If any writes fail, the MDS will mark those mirrors as out of sync and notify userspace to attempt repair or replacement.  Those mirrors will not be used until they have been repaired or replaced (by copying from a non-stale mirror).

Clients will hold the active writer lock until they have written all data to disk and possibly for slightly longer to allow for reuse.  If a client experiences a write error, it will finish all writes currently in flight (syncing to disk), then return the active writer lock to the MDS, with information about the write failure (primarily which mirror failed, but also the write extent).  On receipt of an error, the MDS will request the active writer lock in EX mode, which forces all clients to flush any existing writes and not start new ones until the lock can be re-acquired.  If no error occurs, the clients will return all active writer locks to the MDS shortly after they have completed writing, then the MDS will take the active writer lock to ensure no further writes from clients.

In either case, once the MDS has the active writer lock granted, it will begin transitioning the layout back to RDONLY.  Note/TODO: The MDS must do something about evicted clients to ensure they don't write stale data, as they may have writes inflight and will not receive the active writer lock cancellation.  This can probably be modeled on the approach taken by mirror resync - if nothing else, the MDS can take data extent locks on the entire file to force flushing.  We could probably only do this on eviction, so it would not be too painful.  If no write errors were reported, the MDS can simply un-stale the secondary mirrors and transition the layout back to RDONLY.  If write errors were reported, the MDS will mark the errored mirrors as INCONSISTENT (a special version of stale, which is only cleared by a full data resync), and will notify userspace (probably via changelog) to attempt repair or replacement.  INCONSISTENT mirrors will be treated like STALE mirrors, and not used for anything until they have been repaired.

If the primary write mirror becomes unavailable during writes, the clients will inform the metadata server of write errors as normal.  The metadata server will handle this the If the primary write mirror becomes unavailable during writes, the clients will inform the metadata server of write errors as normal.  The metadata server will handle this the same as any error - The mirror is marked out of sync (STALE or INCONSISTENT).  The MDS will then select an in-sync mirror (where no writes failed) as the new primary for writes.  If no mirrors completed all writes without error, there is a policy decision to make - we could either try to determine the "least" damaged mirror, or we could simply default to the previous write primary.
Client eviction is discussed in FailureHandling; briefly it will be considered a failure stateto the previous write primary.

If a client is evicted from the MDS, it will be assumed to have failed writes to all mirrors other than the current write leader.  This is discussed further in FailureHandling.

The last unaddressed issue is write ordering on the OSTs.  If there are concurrent overlapping writes (or fallocate or truncate operations), the order in which they complete is indeterminate, and since we want to write all mirrors in parallel, the ordering could be different on different mirrors.  This would result in mirrors containing different data after a write phase, which is obviously unacceptable.  could be different on different mirrors.  This would result in mirrors containing different data after a write phase, which is obviously unacceptable.

Initially, we will solve this by using the write leader mirror for locking, and requiring locks always be taken first on this mirror.  TODO: how to handle this when a write leader fails in the middle needs consideration; it is hopefully just a matter of failing to acquire the lock, or if the lock has been acquired and then an error occurs,  

This means we need a way to track write (and other data-affecting) operations on OSTs and communicate this information to the MDT at the end of a write phase.

...