You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

Overview
The core idea of the design is this: (WRITE BRIEF SUMMARY)
Core Design Concept
Before starting any write operation (mmap write, setattr->truncate, etc.), clients will take an active writer lock on a special MDS IBITs bit called ACTIVE_WRITERS. This lock will be a CW lock. The MDS will use this lock to determine if there are any writers on a file.
Layout handling then proceeds normally: If the layout is RDONLY, the client will send a write intent to the MDS, which will transition the layout to WRITE_PENDING and stale all replicas except one. See Staling During Writes & Primary Replica Selection.
Write Operation Flow
When the write operation is complete (i.e., the data is committed to disk on the OSTs), the client will release the active writer lock. This is critical - The client will only release the active writer lock once the data is committed to disk on the OSTs, so it is known durable. The client may also keep the lock briefly (1 or 5 seconds) to allow reuse for subsequent writes. See Active Writer Lock Lifetimes.
Once all writes have completed (with success or error), all of the clients will release their active writer locks. The clients will communicate the success or failure of writes to the MDS in the cancellation reply, combining across all writes which occurred under that lock. Once the MDS sees there are no more active writer locks, it will transition the layout back to RDONLY, un-stale-ing mirrors where there were no errors.
Example Scenario
With one client, if there are three mirrors and three writes:

If write 1 errored to mirror 0
If write 2 errored to mirror 1

The client will report errors on mirror 0 and mirror 1. The MDS would then un-stale only mirror 2 and mark the other two mirrors as errored. (Userspace can try to resync to these mirrors, and if this fails, will need to add new mirrors.)
{info}
Note: If all the mirrors fail writes during an IO, we will have to decide what to do - probably determine if any of the mirrors can be reached now.
{info}
Technical Implementation Requirements
This will require a new type of LDLM_CANCEL reply, one which includes the LVB so the client can communicate this info.
Failure Handling
MDS Eviction
Critically, if there is an MDS eviction of a client with an active writer lock, the mirrors cannot be unstaled because we cannot know if that client completed writes successfully. In that case, we will assume that client successfully wrote to the primary mirror.
Primary Mirror Write Failure
In case of a write failure to the primary mirror, the clients will attempt to complete all other writes and inform the MDS as usual. The MDS will then mark the primary mirror stale as part of the transition to RDONLY, and will select a new primary mirror when the next write intent request arrives. This should work transparently with minimal changes.
Mirror State Management
We will have to distinguish between:

Mirrors which are "stale due to error"
Mirrors which are "stale due to ongoing write"

The "stale due to error" mirrors are excluded when the MDS brings the layout back to normal. It may be best to add a new WRITE_ERRORS or WRITE_INCONSISTENT or similar flag to identify this.
{warning}
We will also want to notify lamigo or similar when a mirror becomes stale due to error.
{warning}
Client Implementation
The client will need to duplicate all writes and write-like operations to all of the mirrors. This will involve substantial work on the CLIO side to do correctly.
{info}
Note: Ideally we would not have to compress or encrypt data more than once when sending to multiple servers, but this may not be avoidable.
{info}
Direct I/O vs Buffered I/O

Direct I/O: This should be relatively straightforward, as DIO does not interact with the page cache and can easily duplicate IO requests (if we do not mind duplicating some work)
Buffered I/O: This will be much trickier, where just repeating the operation would hit the same pages in the page cache. How to handle this is an open question

Additional Client Requirements
The client will also need to:

Hold the active writer lock until all IO has been committed to the OSTs, so it is known-safe

For DIO this is easy - all DIO is immediately committed
For buffered, this will be more work


Fail write-like operations quickly, similarly to how it fails reads when there are other mirrors
Record all of the errors as appropriate so it can communicate them back to the server in the cancel reply

Staling During Writes & Primary Replica Selection
We have chosen to keep only one mirror readable while writing is going on, even though we will write to all of the mirrors. This is because all clients must see the same data when they read the file, and we cannot guarantee this with multiple mirrors without a complex transactional approach*.
A write may have arrived at one mirror but not yet at a different one, and two clients reading could see different data. This cannot be avoided unless we implement distributed transactions and rollback, which is not practical. (We could try locking all OSTs together before writing, but direct IO does not use locks and other problems come up in failure cases.)
So we keep a single primary replica readable during writes, even though we always try to update all replicas.
{info}
Note: The 'simple' approach would be to lock all mirrors before doing any writing and unlock them together. However, this is not possible for DIO, which does not use LDLM locks on the client, but instead only uses them locally on each separate server.
{info}
Active Writer Lock Lifetimes
The active writer lock will be cached on the client so it can be used by multiple writes (so every write does not generate a separate MDS lock request), but the cache time will be deliberately kept short (e.g., perhaps 1 second or 5 seconds).
This is because the presence of Active Writer locks prevents the MDS from considering the secondary mirrors in sync.

  • No labels