You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 7 Next »

Overview & Design Summary


We have a requirement to do immediate mirroring on Lustre files, where all writes (and related ops) are replicated to multiple mirrors immediately (or close to immediately).

The goal is to have redundancy as soon as possible, but it is not required to have all mirrors available during writes - we are allowed a window where there may be limited redundancy.

This is key in making replication tractable without requiring MVCC, which is not provided by the Lustre backends.

The core idea of the design is this:

Clients will inform the MDS when they are writing (take an "active writer" lock), the MDS will "unseal" the layout, allowing writes (FLR Write Pending state).  The clients will send all writes to all online mirrors.  A single primary mirror will be selected for reads during this time, since we cannot guarantee all mirrors are identical during writes.  Other mirrors are marked stale during writes.  (If we did not, different clients could see differing file contents, which is unacceptable.)  Once the writes are complete, the clients will inform the MDS if their writes were successful (giving information about all the mirrors).  If all writes were successful, the MDS will un-stale the secondary mirrors and re-seal the layout (FLR Read Only state).  If any writes fail, the MDS will mark those mirrors as out of sync and notify userspace to attempt repair or replacement.  Those mirrors will not be used until they have been repaired or replaced (by copying from a non-stale mirror).

If the primary write mirror becomes unavailable during writes, the clients will inform the metadata server as normal.  The metadata server will handle this the same as any error - The mirror is marked out of sync.  The MDS will then select an in-sync mirror (where no writes failed) as the new primary for writes.

There are a number of recovery scenarios which are covered in more detail below.

Relationship to Erasure Coding

Immediate mirroring is separate from erasure coding, and can be done partly in parallel.  Erasure coding has no specific implications for immediate mirroring and will be covered in a different document.

Core Design Concept


Before starting any write/OST modifying operation (mmap write, setattr->truncate, etc.), clients will take an active writer lock on a special MDS IBITs bit called ACTIVE_WRITERS. This lock will be a CW lock so multiple clients can hold it at the same time. The MDS will use this lock to determine if there are any active writers on a file.


Layout handling then proceeds as currently: If the layout is RDONLY, the client will send a write intent to the MDS, which will transition the layout to WRITE_PENDING and stale all replicas except one. See

 Staling During Writes & Primary Replica Selection.

Write Operation Flow

As noted above, before starting a write (or other OST modifying operation), the client will request an AW lock from the MDS.  It holds this lock throughout the writing process until all data has been committed to OST storage.  At this point, it can release the lock.

This is critical - The client will only release the active writer lock once the data is committed to disk on the OSTs, so it is known durable. The client may also keep the lock briefly (1 or 5 seconds) to allow reuse for subsequent writes. See Active Writer Lock Lifetimes.


Once all writes have completed (with success or error), all of the clients will release their active writer locks. The clients will communicate the success or failure of writes to the MDS in the cancellation reply, combining across all writes which occurred under that lock. Once the MDS sees there are no more active writer locks, it will transition the layout back to RDONLY, un-stale-ing mirrors where there were no errors.
Example Scenario
With one client, if there are three mirrors and three writes:

If write 1 errored to mirror 0
If write 2 errored to mirror 1

The client will report errors on mirror 0 and mirror 1. The MDS would then un-stale only mirror 2 and mark the other two mirrors as errored. (Userspace can try to resync to these mirrors, and if this fails, will need to add new mirrors.)
{info}
Note: If all the mirrors fail writes during an IO, we will have to decide what to do - probably determine if any of the mirrors can be reached now.
{info}
Technical Implementation Requirements
This will require a new type of LDLM_CANCEL reply, one which includes the LVB so the client can communicate this info.


Failure Handling

MDS Eviction

Critically, if there is an MDS eviction of a client with an active writer lock, the mirrors cannot be unstaled because we cannot know if that client completed writes successfully. In that case, we will assume that client successfully wrote to the primary mirror.

Primary Mirror Write Failure

In case of a write failure to the primary mirror, the clients will attempt to complete all other writes and inform the MDS as usual. The MDS will then mark the primary mirror stale as part of the transition to RDONLY, and will select a new primary mirror when the next write intent request arrives. This should work transparently with minimal changes.
Mirror State Management
We will have to distinguish between:

Mirrors which are "stale due to error"
Mirrors which are "stale due to ongoing write"

The "stale due to error" mirrors are excluded when the MDS unstales the mirrors (transitioning back to RDONLY). It may be best to add a new WRITE_ERRORS or WRITE_INCONSISTENT or similar flag to identify this.

We will also want to notify lamigo or similar when a mirror becomes stale due to error so it can attempt to recover state.

Client Eviction

From MDS

From OST


Client Implementation


The client will need to duplicate all writes and write-like operations to all of the mirrors. This will involve substantial work on the CLIO side to do correctly.
Note: Ideally we would not have to compress or encrypt data more than once when sending to multiple servers, but this may not be avoidable.


Direct I/O vs Buffered I/O

Direct I/O: This should be relatively straightforward, as DIO does not interact with the page cache and can easily duplicate IO requests (if we do not mind duplicating some work)
Buffered I/O: This will be much trickier, where just repeating the operation would hit the same pages in the page cache. How to handle this is an open question

Additional Client Requirements
The client will also need to:

Hold the active writer lock until all IO has been committed to the OSTs, so it is known-safe

For DIO this is easy - all DIO is immediately committed
For buffered, this will be more work - we'll need to track pages differently.


Fail write-like operations quickly, similarly to how it fails reads when there are other mirrors
Record all of the errors as appropriate so it can communicate them back to the server in the cancel reply


 Notes

Staling During Writes & Primary Replica Selection


We have chosen to keep only one mirror readable while writing is going on, even though we will write to all of the mirrors. This is because all clients must see the same data when they read the file, and we cannot guarantee this with multiple mirrors without a complex transactional approach*.
A write may have arrived at one mirror but not yet at a different one, and two clients reading could see different data. This cannot be avoided unless we implement distributed transactions and rollback, which is not practical. (We could try locking all OSTs together before writing, but direct IO does not use locks and other problems come up in failure cases.)
So we keep a single primary replica readable during writes, even though we always try to update all replicas.
Note: The 'simple' approach would be to lock all mirrors before doing any writing and unlock them together. However, this is not possible for DIO, which does not use LDLM locks on the client, but instead only uses them locally on each separate server.

Active Writer Lock Lifetimes

The active writer lock will be cached on the client so it can be used by multiple writes (so every write does not generate a separate MDS lock request), but the cache time will be deliberately kept short (e.g., perhaps 1 second or 5 seconds).
This is because the presence of Active Writer locks prevents the MDS from considering the secondary mirrors in sync.

  • No labels