1. Overview & Design Summary
1.1. FLR Today
In the current implementation of FLR (introduced in Lustre 2.11), the system uses a "delayed write" approach for maintaining file mirrors. This means:
Writing to Mirrored Files: When a client writes to a mirrored file, only one primary (preferred) mirror is updated directly during the write operation. The other mirrors are simply marked as "stale" to indicate they're out of sync with the primary mirror.
Today, we have manual synchronization: after a write, the lfs mirror resync command must be run to synchronize the stale mirrors with the primary mirror. This command copies data from the synced mirror to the stale mirrors and removes the stale flag from successfully copied mirrors.
Layout state and staleness is managed through a careful series of layout state changes which are described File-level replication state machine in CoreDesignConcept in this document.
This delayed write approach was implemented in the first phase of FLR to avoid the complexity of maintaining consistency across multiple mirrors during concurrent writes. By updating only one mirror during writes and marking others as stale, the system maintains a consistent view of the file data, at the cost of requiring explicit synchronization after writes complete.
The current implementation does not provide immediate redundancy, since only on a single mirror is up to date until an explicit resync operation is performed. This approach enables things like hot pools synchronization, but the lack of immediate write redundancy severely limits the use cases.
1.2. Immediate Mirroring
We have a requirement to do immediate mirroring on Lustre files, where all writes (and related ops) are replicated to multiple mirrors immediately. The infrastructure created for this will also be used for immediate erasure coding.
The goal is to have redundancy immediately, but it is not required to have all mirrors available during writes - while writes are in flight, it is not possible to guarantee all mirrors will provide a consistent state without MVCC (multi-version concurrency control). With MVCC, it is possible to solve this problem by specifying a particular 'stable' version for reads during a write, but Lustre's storage backends (ldiskfs, ZFS) do not provide this functionality. Instead, during writes, only a single mirror will be available for reads.
The core idea of the design is this:
Clients will inform the MDS when they are writing (take an "active writer" lock), the MDS will "unseal" the layout, allowing writes (FLR Write Pending state). The clients will send all writes to all online mirrors. A single primary mirror will be selected for reads during this time, since we cannot guarantee all mirrors are identical during writes. This mirror is the "write leader", which will also be used for write ordering by taking locks in this mirror first. Other mirrors are marked stale during writes. (If we did not, different clients could see differing file contents, which is unacceptable.)
Clients will hold the active writer lock until they have written all data to disk and possibly for slightly longer to allow for reuse. If a client experiences a write error, it will finish all writes currently in flight (syncing to disk), then return the active writer lock to the MDS, with information about the write failure (primarily which mirror failed, but also the write extent). On receipt of an error, the MDS will request the active writer lock in EX mode, which forces all clients to flush any existing writes and not start new ones until the lock can be re-acquired. If no error occurs, the clients will return all active writer locks to the MDS shortly after they have completed writing, then the MDS will take the active writer lock to ensure no further writes from clients.
In either case, once the MDS has the active writer lock granted, it will begin transitioning the layout back to RDONLY. Note/TODO: The MDS must do something about evicted clients to ensure they don't write stale data, as they may have writes inflight and will not receive the active writer lock cancellation. This can probably be modeled on the approach taken by mirror resync - if nothing else, the MDS can take data extent locks on the entire file to force flushing. We could probably only do this on eviction, so it would not be too painful. If no write errors were reported, the MDS can simply un-stale the secondary mirrors and transition the layout back to RDONLY. If write errors were reported, the MDS will mark the errored mirrors as INCONSISTENT (a special version of stale, which is only cleared by a full data resync), and will notify userspace (probably via changelog) to attempt repair or replacement. INCONSISTENT mirrors will be treated like STALE mirrors, and not used for anything until they have been repaired.
If the primary write mirror becomes unavailable during writes, the clients will inform the metadata server of write errors as normal. The metadata server will handle this the same as any error - The mirror is marked INCONSISTENT. The MDS will then select an in-sync mirror (where no writes failed) as the new primary for writes. If no mirrors completed all writes without error, there is a policy decision to make - we could either try to determine the "least" damaged mirror, or we could simply default to the previous write primary.
If a client is evicted from the MDS, it will be assumed to have failed writes to all mirrors other than the current write leader. This is discussed further in FailureHandling. MDS failure is also discussed there - certain information will have to be persisted to allow failover/recovery of the MDS.
The last unaddressed issue is write ordering on the OSTs. If there are concurrent overlapping writes (or fallocate or truncate operations), the order in which they complete is indeterminate, and since we want to write all mirrors in parallel, the ordering could be different on different mirrors. This would result in mirrors containing different data after a write phase, which is obviously unacceptable.
Initially, we will solve this by using the write leader mirror for locking, and requiring locks always be taken first on this mirror. TODO: how to handle this when a write leader fails in the middle needs consideration; it is hopefully just a matter of failing to acquire the lock, or if the lock has been acquired and then an error occurs on write, report this error to the MDS which will wind up that write phase
One major downside of this approach is it requires us to use LDLM locks for direct IO, which has a significant cost for shared file writes. In order to avoid this, we will we need a way to track the order of write (and other data-affecting) operations on OSTs and communicate this information to the MDT at the end of a write phase.
We have a tentative plan for this described in: TODO PUT AT THE END OF THE DOCUMENT
Our plan is to use 'chained' RPC checksums. We will take care to ensure that all write (and related) RPCs to different mirrors are identical - note this requires the mirrors to have identical layout geometry. Then, when a write phase opens, when the OST is notified to update the layout generation on the stripe (which is done as part of FLR today), we will inform it the stripe object is part of an immediate mirror file. It will take the write RPC checksum from each write RPC, and 'chain' them together as they are committed (write commits can be ordered by their journal transaction #, even if they are occurring in parallel). The result of this is a single checksum value which encodes both the writes and their ordering. We can include non-write operations in this by checksumming their arguments (eg, the byte range and fallocate op type for fallocate).
The write primary mirror will be the 'correct' ordering, to which the others are compared. If ordering on a secondary mirror disagrees with the primary, that mirror will be marked inconsistent (to be repaired later).
We will need to determine how and if we want to specifically apply this only to overlapping operations - That requires tracking the extent of all data modifying operations as we proceed through a write phase, but it protects us from the case of non-overlapping writes causing an apparent inconsistency where none exists. Whether or not this is needed is an open design question.
This is describe in detail in TODO<--- Also need to cover recovery.
There are a number of recovery scenarios which are covered in more detail below.
1.3. Relationship to Erasure Coding
Immediate mirroring is separate from FLR Erasure Coding, and can be done partly in parallel. Erasure coding has no specific implications for immediate mirroring and will be covered in a different document.
2. Core Design Concept
Before starting any write/OST modifying operation (mmap write, setattr->truncate, etc.), clients will take an active writer lock on a special MDS IBITs bit called ACTIVE_WRITERS. This lock will be a CW lock so multiple clients can hold it at the same time. The MDS will use this lock to determine if there are any active writers on a file.
Layout handling then proceeds as currently: If the layout is RDONLY, the client will send a write intent to the MDS, which will transition the layout to WRITE_PENDING and stale all replicas except one. See Staling During Writes & Primary Replica Selection.
FLR Immediate Mirroring State Machine
2.1. Write Operation Flow
As noted above, before starting a write (or other OST modifying operation), the client will request an AW lock from the MDS. It holds this lock throughout the writing process until all data has been committed to OST storage. At this point, it can release the lock. (For direct IO, this is immediately after the write completes, for buffered IO writes, which are async, we will have to detect when the pages are committed.)
This is critical - The client will only release the active writer lock once the data is committed to disk on the OSTs, so it is known durable. The client may also keep the lock briefly (1 or 5 seconds) to allow reuse for subsequent writes. See Active Writer Lock Lifetimes.
The success or failure of writes is recorded and communicated to the MDS in the cancellation reply, combining across all writes which occurred under that lock. (NB: If an error occurs, the client must stop using the lock for new writes, but several writes may be in progress.)
When the MDS has no more active writer locks, it will examine the errors reported, and transition the layout back to RDONLY, un-stale-ing mirrors where there were no write errors.
Example Scenario
With one client, if there are three mirrors and three writes:
If write 1 errored to mirror 0
If write 2 errored to mirror 1
The client will report errors on mirror 0 and mirror 1 to the MDS as part of AW lock cancellation (see Client Implementation for more). The MDS would then un-stale only mirror 2 and mark the other two mirrors as inconsistent. (Userspace can try to resync to these mirrors, and if this fails, will need to add new mirrors.)
Note: If all the mirrors fail writes during an IO, we will have to decide how to handle this. See Open Problems.
2.2. Client Implementation
2.2.1. Modifying all mirrors - IO Duplication
The client will need to duplicate all writes and write-like operations to all of the mirrors. This will involve substantial work on the CLIO side to do correctly.
Note: Ideally we would not have to compress or encrypt data more than once when sending to multiple servers, but this may not be avoidable in the first version.
2.2.1.1. Direct I/O vs Buffered I/O
Direct I/O: This should be relatively straightforward, as DIO does not interact with the page cache and can easily duplicate IO requests (if we do not mind duplicating some work on the client).
Buffered I/O: This will be much trickier, where just repeating the operation would hit the same pages in the page cache. How to handle this is an open question
2.2.1.2. IO Commit & Lock Holding
The client must hold the active writer lock until all IO has been committed to the OSTs, so it is known-safe on disk.
This applies to DIO & buffered writes, but also to other modifying operations (eg, setattr).
For DIO this is easy - all DIO is immediately committed
For buffered, this will be more work - we'll need to track pages differently than today, associating them with something that lives after the end of the write() syscall so it can track them to full commit on the OSTs.
2.2.2. Error Management
2.2.2.1. Fast-fail for Modifying Operations
The client must fail modifying operations quickly, similarly to how it fails reads when there are other mirrors. Today, the client will wait indefinitely attempting to do a write/modification operation, since the only other option is a hard failure.
2.2.2.2. Error Communication to Server
Record all of the errors as appropriate so it can communicate them back to the server in the cancel reply.
This will require a new type of LDLM_CANCEL reply, one which includes the LVB so the client can communicate this info. This will also require crafting the LVB structure to send over to the server with this information. It will also need CLIO changes to collect errors from all types of operations which use the Active Writer lock, since any modification failure on a mirror renders that mirror inconsistent.
2.3. Failure Handling
2.3.1. Mirror States: Inconsistent vs Stale mirrors
Before we discuss error conditions, a note on mirror states.
We will have to distinguish between:
Mirrors which are "stale due to ongoing write", which are just stale
Mirrors which are "stale due to error", which I am tentatively calling inconsistent (an inconsistent mirror is always stale as well).
The inconsistent mirrors are excluded when the MDS unstales the mirrors (transitioning the layout back to RDONLY). We will need to add a new mirror flag to identify this, INCONSISTENT or WRITE_INCONSISTENT. INCONSISTENT mirrors are always stale, to benefit from existing handling for stale mirrors, but inconsistent mirrors are also ignored for new writes. The inconsistent flag is only cleared by a complete data resync to that mirror.
We will also want to notify lamigo or similar when a mirror becomes inconsistent so it can attempt to recover state.
2.3.2. Write/Update Error from Client
Clients will collect all write (or non-write update, eg, setattr) errors, associating them with the Active Writer lock. Note an error can be during initial write or while trying to commit a write on the OST. After an error, the client should stop allowing new writes to start under an existing Active Writer lock and return the current lock immediately (ie, no caching). The errors are communicated to the MDT in the lock value block which is sent to the MDT on the cancel. The MDT then marks all mirrors with errors as INCONSISTENT and does not un-stale them.
2.3.2.1. Client Eviction from OST
A client eviction from an OST will cause a write error (either during the first part of the write or when the client requests commit), which is reported to the MDT by the client when it cancels the Active Write lock and is handled just like other write errors. Like any write error, this indicates this particular mirror is now INCONSISTENT and the server must mark it as such to be resolved by a later RESYNC operation from lamigo or other userspace tool.
2.3.3. Client Eviction from MDT
Critically, if there is an MDT eviction of a client with an active writer lock, the mirrors cannot be unstaled because we cannot know if that client completed writes successfully. In that case, we will assume that client successfully wrote to the primary mirror and mark the other mirrors as inconsistent.
We could also possibly compare the OST commit vectors to check status - if any mirror has "all" the writes, it could be un-stale-ed. See Open Problems for more on the OST commit vector.
2.3.3.1. Client Loss
Client loss - where a client crashes or similar - is the same as MDT eviction, since such a client is evicted from the MDT and OSTs, and the MDT eviction takes priority over the OST evictions. All mirrors except the primary will be marked INCONSISTENT (or we could possibly use the write commit vectors from the OSTs to do better than this).
2.3.4. Primary Mirror Write Failure
In case of a write failure to the primary mirror, the clients will attempt to complete all other writes and inform the MDS as usual. The MDS will then mark the primary mirror stale as part of the transition to RDONLY, and will select a new primary mirror when the next write intent request arrives. It will do this by examining all the write statuses it received from active writers, and finding a mirror which did not experience any errors. See Open Problems for what to do if all mirrors had write failures, and Client Loss in this section for what to do if a client is lost entirely.
This should work transparently with minimal changes.
3. Notes
3.1. Staling During Writes & Primary Replica Selection
We have chosen to keep only one mirror readable while writing is going on, even though we will write to all of the mirrors. This is because all clients must see the same data when they read the file, and we cannot guarantee this with multiple mirrors without a complex transactional approach*.
A write may have arrived at one mirror but not yet at a different one, and two clients reading could see different data. This cannot be avoided unless we implement distributed transactions and rollback, which is not practical. (We could try locking all OSTs together before writing, but direct IO does not use locks and other problems come up in failure cases.)
So we keep a single primary replica readable during writes, even though we always try to update all replicas.
Note: The 'simple' approach would be to lock all mirrors before doing any writing and unlock them together. However, this is not possible for DIO, which does not use LDLM locks on the client, but instead only uses them locally on each separate server.
3.2. Active Writer Locks
The active writer lock will be cached on the client so it can be used by multiple writes (so every write does not generate a separate MDS lock request), but the LRU/cache time will be deliberately kept short (e.g., perhaps 1 second or 5 seconds). This should allow continuous writes from one process to work, but will not hold the lock for any length of time otherwise. This is because the presence of Active Writer locks prevents the MDS from considering the secondary mirrors in sync, so it reduces data reliability to cache them for any longer.
When an Active Writer lock is called back, the client will need to generate sync RPCs for all outstanding data for the file and confirm they have completed before it can allow the lock to be called back. Note this also has to handle the case where some mirrors can't be written - these errors would also need to be sent back to the server.
4. Open Problems
4.1. Write Ordering of overlapping writes across multiple mirrors
If we have two clients writing the same region of the file with multiple mirrors, how can we ensure we get the same ordering in every case? If A and B are both writes to the same section of file - how can we ensure that if A→B happens on one mirror, so B is the final result, we get the same order (A→B, resulting in B) on other mirrors.
Buffered IO could do this by taking and holding all of the DLMlocks*, but this can't be done for direct IO because direct IO does not use client side dlmlocks. (This is an important property of direct IO and not something we want to change.)
*or by clever use of the primary write mirror ldlmlock, but I think this has issues with primary mirror failure.
One suggestion is having the OSTs maintain a vector of committed writes & extents for Active Writer files. I believe the OSTs would only need to maintain this across a single layout generation. As part of transitioning an immediate mirror back to RDONLY, the server would request the write vectors from all of the OSTs in each mirror. It would then "collapse" these to include only sets of overlapping writes (every overlap would be tracked separately). It would then compare all mirrors to the current primary mirror, and any mirrors which committed overlapping writes in a different order would be marked INCONSISTENT and not-un-staled. These mirrors could be made consistent only by a full data resync (ie, a lfs mirror resync type operation).
4.2. Active Writer Lock matching - Resolved
"If an error occurs, the client must stop using the active writer lock for new writes, but several writes may be in progress."
This is necessary because we need to communicate the error back to the server ASAP, so it can start reacting. This seems straightforward enough - we will start the cancellation process on the client so the lock cannot be matched by new writes. The issue is if writes continue, they will create new AW locks. These would not be blocked until the server asked for the EX lock to do the layout changes. This is probably acceptable, since that would flush any newly created AW locks and would probably prevent any more AW locks from being granted while it was in the waiting queue.
4.3. What if all mirrors fail?
What if write/modify operations fail to all mirrors? This could occur, and the MDS will receive this information. It will then have to decide what to do - marking all mirrors stale renders the file inaccessible, but of course no mirror is entirely "good". This should be rare, so we will probably want to notify the admin, but keep the file accessible (ie, do not mark every mirror stale/inconsistent) - we could take whichever mirror experienced the fewest errors or stick with our existing primary/write target mirror regardless of error counts, and mark all of the others stale + inconsistent.
4.4. Replacing Mirrors/Permanent Failure
How do we decide a mirror is permanently failed and what do we do about it? It seems likely to me the answer is something like "report INCONSISTENT mirrors to lamigo, lamigo will try to resync those mirrors, but if this fails, will give up and add a new mirror". This is a bit tricky since adding new mirrors is only possible while the file is not being modified, so files which are written to often/constantly for a long time could lose durability if they lose a mirror, since we can't add a mirror during writes (without blocking for a long time).
41 Comments
Alex Zhuravlev
> If we have two clients writing the same region of the file with multiple mirrors, how can we ensure we get the same ordering in every case? If A and B are both writes to the same section of file - how can we ensure that if A→B happens on one mirror, so B is the final result, we get the same order (A→B, resulting in B) on other mirrors.
this is why we have primary replica which can order locking and writes.
Patrick Farrell
Yes, but I would like to keep the lockless writes for direct IO, and this would require dropping that entirely. Today, direct IO writes do not require client side locks, the server just requests a dlmlock locally. This is very important for shared file write performance.
We could possibly do always locked first and have lockless + reconciliation as an improvement.
Alex Zhuravlev
depends what exactly you want to save. if it's sync lock enqueue, then OST_WRITE (starting lockless) can return a granted lock to the client so the client can send data to replicas holding that lock and then release that lock asynchronously.
Patrick Farrell
Oh, that's ... interesting. The OST write RPC would return a dlmlock. That is an interesting protocol modification and seems like it would solve this problem.
The issue I wanted to avoid is having to request a dlmlock before every write, which costs a lot of performance, and I also want to avoid expanding dlmlocks, which generates false conflicts. Your suggestion would fix both of them.
Ah, one significant problem:
If the client does not wait for the lock before sending the other writes, it doesn't work. So we cannot send the writes in parallel. That would be painful for standard direct IO, since all writes must be synchronous. (Since we use a userspace buffer, we cannot return from write() until all data is synced because the user might modify the buffer.)
However, if we use the method from unaligned direct IO, that creates a kernel side buffer and is still pretty fast. So we could do only the first write synchronously and the others could be slightly delayed, but asynchronous.
Patrick Farrell
There are some issues in general with primary mirror as "locking leader" when the primary mirror becomes unavailable - it poses some interesting problems we have to think through, but should be doable.
Alex Zhuravlev
upon failure primary can be re-selected via re-enqueue a lock to MDS
Patrick Farrell
Yeah... OK. Interesting. I'll need to give some consideration to the required client changes and implications, but making direct IO asynchronous in some situations has been a tentative goal for some time, so it is not crazy to consider it.
RE: primary re-selected, yes, what I worry about is ordering in that case. I suppose any writes which already have a lock and have written to the successfully would complete, and any writes which do not have a lock yet would not. An interesting case is where the lock on the primary has been acquired but the write to the primary fails.
Alex Zhuravlev
strictly speaking a lock enqueue must be a small fraction of time you need to commit data to disk.
Patrick Farrell
I don't understand why that's relevant here? (Also it's probably not true for ethernet - SSDs can commit writes in O(100 microseconds) or faster if they have a cache, and network latencies can easily be O(1 millisecond))
Alexey Lyashkov
so cl_locks with M:N needs restored? each mirror had return an locks for own region - but LOV operate with file offset range which mapped to the sort locks (depend of mirror/layouts).
so each top level lock should be mapped to sort (:N) of ldlm locks. But such code removed long time ago as part of CLIO simplification project.
As about lock enq - I think it's lower part of total IO time, so it might to be visible with very short IO requests.
lock request order is also question - because it's easy to hit a deadlock in case BRW rpc will arrived to the OST in random order and will return a locks, it's why locks takes in the strict order on client side.
Qian Yingjin
We can choose one mirror as primary and only take the DLM extent lock from the primary mirror. And As implemented in my patch, buffered I/O and direct I/O can all submit to the primary, async pages are also managed by the OSC objects of the primary mirror. Only when build the write I/O RPC (osc_build_rpc()), the client up calls into LOV layer to build I/Os (maybe with high priority to avoid RPC slots and grants) by using the IOV pages prepared by the I/O of the primary object. For buffered I/O, only when all IOs (all mirrors) are finished, can the client release (unpin) the pages in IOV (clear the PG_WRTEIBACK).
This reduces the complexity to manage locks and asynchronous pages for each mirrors, I think.
Alexey Lyashkov
>Only when build the write I/O RPC (osc_build_rpc()), the client up calls into LOV layer to build
this is layering violation. OSC should be don't call anything on LOV because of internal locking and LAYOUT lock iteraction.
> async pages are also managed by the OSC objects of the primary mirror.
Once client had lost connection to the primary mirror we have no way to cancel a locks until connection finished. But lock might expired on the server and different client will take it. Different write started - once mirror don't have a locks, it have no way to verify which write is good, it caused a mix writes and data corruption.
> We can choose one mirror as primary and only take the DLM extent lock from the primary mirror.
It'a brokes an "prolong lock" case, once IO had send to the different server we have no way to extend a lock cancel timeout on server. Which extended naturally with IO under this lock. if you have a mirrored and normal IO for same OST you will be want a larger timeouts so it will make server stay in long wait in case normal IO will have an error. I known about green feature to send a zero-sized rpc for same - but it makes a OST live hard, once so many extra rpc's will needs processed.
Qian Yingjin
"
Once client had lost connection to the primary mirror we have no way to cancel a locks until connection finished. But lock might expired on the server and different client will take it. Different write started - once mirror don't have a locks, it have no way to verify which write is good, it caused a mix writes and data corruption.
"
I don't think this is a problem.
When the client lost connection to the parity mirror, and the write does not success, then I think the eviction of the client will trigger inconsistency check and recovery. And this recovery will involve OSTs/ MDT and evicting client. OST will report the evicted client to the MDT.
And I think we can not avoid the failure recovery to involve OST/MDT/client in a networked env in the IMW design.
And the MDT will coordinate the whole recovery. It will increase the layout version of the layout and send to all OST objects. All writes to OSTs from clients will take the layout version. The non-primary mirrors' write with the old smaller version will be rejected, and then start the data consistency recovery.
Qian Yingjin
"
>Only when build the write I/O RPC (osc_build_rpc()), the client up calls into LOV layer to build
this is layering violation. OSC should be don't call anything on LOV because of internal locking and LAYOUT lock iteraction.
"
Please think about a Layout with parity (such as RAID4 or EC). If we want to support immediate updates from the client for the data and parity, we can not avoid this kind of up call to LOV layer and maybe take the lock form the primary stripe with stripe size aligned?
Alexey Lyashkov
Sure - we can avoid it. CL page will have a special parity page pointer and special osc page for it.
any operations on cl_page will addressed an 'parity' page modification and both extents will submits on stack.
Qian Yingjin
I do not thinks parity page is updated upon each data cl_page write, this means all parity updates are performed in RMW (read modify write) mode which means that each data page write needs to read the old content of the parity and then update the parity by XOR operations.
I think we should be able to do full stripe write to update parity without reading the old content of parity stripe. Maybe when OSC builds a write RPC, up call to LOV to try to merge enough continuous cl_page from other OSC objects and in LOV layer accumulate a full IOV and then update the parity in full stripe write mode.
Alexey Lyashkov
Oh... really ? what about if some stripes in stride isn't read/because don't have access? so no data for rebuild. what about ldlm lock aligning? you may have client with 64k page size, but server will be 4k - so locks will be 4k align, a specially for first/last page in extent.
Andreas Dilger
The server should always align locks on the client PAGE_SIZE. It does this by keeping the same alignment on expanded locks as the requested locks on a per-export basis. So if the client requests a lock with 64KiB alignment, then the server will always grant it at least 64KiB aligned locks during lock expansion, and always at least the requested extent.
Alexey Lyashkov
It's not possible, because server don't know a client page size. It looks you forget - Server able to expand a lock for any range, a specially in case requested region between two other locks. While other clients might be with 4k size and they locks aligned with 4k.
Andreas Dilger
See https://github.com/lustre/lustre-release/blob/4b0407a48e9e5b7f17877e8868080e9f40c238c6/lustre/ldlm/ldlm_extent.c#L171
After lock expansion, the server will align the lock to be a multiple of the start/end alignment of the requested lock. If the client requested a lock aligned on 64KiB multiples, then the server will grant the lock with start/end aligned on 64KiB multiples. The LDLM server doesn't really know the PAGE_SIZE of the client (not until the LNet sparse read patch lands, at least), but it can make an assumption based on the lock request.
Alexey Lyashkov
as about case with "full stripe write' - LOV is natural pace to generate a parity without any upcalls. And anything else also io loop have split an range for a stripes and had take a lock for each stripe one-by-one, I this case ldlm lock for the parity block should be hold before each update to avoid deadlocks. So we might to add extra step to obtain a lock for parity space and reading data if needs (Lustre does an RMW already in case write isn't page aligned - so it have no problem to reuse this code if needs)
main problem in this case - we don't hold a locks for all stripes during IO - otherwise it might cased an evictions. like 10 stripes, first node have write to 1-7 stripes, second node had write to 3-4 stripes, so first node will blocked for unknown time until second will work own stripes.
Qian Yingjin
I still think we should only take DLM lock on the primary mirror, or the primary stripe in parity case (i.e. EC or RAID4).
The reason is that If we must take DLM locks on all mirrors and then perform I/O, that means that we need to take DLM lock from each mirror one by one serially to avoid the possible deadlock (Just like we take locks for truncate operation). The delay to take DLM locks increases linearly as the increase of the mirror count (usually 2X or 3X), I will hurt the performance a lot, I think.
So I suggest for immediate mirror write, we should only take lock from the primary mirror.
In the case of layout pattern with parity, we also suggest to only acquire DLM lock from the primary stripe object with unit of (stripe_size * data_stripe_count).
Patrick Farrell
I agree with Yingjin - it will be a little tricky in CLIO, but I think better than taking locks on all the mirrors.
We will have to be careful with how it handles primary mirror failure, but that should be OK - I think we will have to cancel all of the active writer locks to do the transition to a new primary mirror. But there is some question about active writer lock vs layout lock here, still thinking.
Alexey Lyashkov
>In either case, once the MDS has the active writer lock granted, it will begin transitioning the layout back to RDONLY. Note/TODO: The MDS must do something about evicted clients to ensure they don't write stale data, as they may have writes inflight and will not receive the active writer lock cancellation. This can probably be modeled on the approach taken by mirror resync - if nothing else, the MDS can take data extent locks on the entire file to force flushing. We could probably only do this on eviction, so it would not be too painful.
it's not enough. Client might don't known about eviction because it not able to connect to the MDT, but MDT make eviction by inactivity timeout. So simple lock cancel - will just flush a dirty data and in flight io, but it will be not a barrier for new IO. once client will able to do extent lock enqueue after it. It looks like object epoch will be needs, so client eviction will cancel a client extent locks and increase an object epoch, so new enqueue will be blocked for layouts with old epoch. layout generation can't be used as epoch in this case, once SEL has so much layout generation updates but all of them are compatible with previous one. So you needs to start with describing - what is layout epoch and how it will be updated (likely with some destructive updates like component del, layout swap and client eviction with mirror ?).
Patrick Farrell
Layout epoch is something that exists today for FLR with data versioning - we will reuse that functionality. I will have to make sure I am 100% on exactly how it works, but the same problem exists today for non-immediate mirroring and is handled.
Andreas Dilger
IMHO, it would be better to mark the temporary STALE mirrors with an additional
TEMP_STALEflag and then clear this andSTALEif the writes complete successfully. If the writes fail, then the TEMP_STALE flag is removed, and the STALE flag is kept. This is instead of using INCONSISTENT.the benefit for this is that any existing client can clean up a STALE file (it doesn't care why it was marked stale), but older clients would not understand INCONSISTENT, so they could not clear it away afterward.
Alexey Lyashkov
Alex idea about epoch much better. Once objects marked as next epoch any clients will not able to have access to it until client will finish own write and active epoch increased.
Patrick Farrell
Right, what Andreas is describing is how the epoch will be implemented.
When a write starts, the client takes an active writer lock, which opens an epoch. During the epoch, only the primary mirror (the 'write leader') is readable because mirror states can be inconsistent (we do not know if all the writes have completed to all the mirrors). To do this, we mark the secondary mirrors as stale during the write epoch.
Then when the write epoch closes, the clients report any errors to the MDT (eviction from MDT is considered an error). Any mirrors without errors will have stale removed.
Qian Yingjin
I do think the epoch should be started/ended by a client as there is no any persistent storage to record the epoch. When the entire system is crash, it does not known which files need to be doing data inconsistency recovery. It should determine by MDT during the write open().
As we discussed before in the email, the I/O epoch mechanism works as follows:
Alex Zhuravlev
given old clients can't maintain replicas I'd think MDS should "cancel" current IMW epoch immediately (stopping useless writes from new clients) and mark all non-primary replicas stale.
Ronnie Sahlberg
4.1 write ordering.
One way to guarantee this without any explicit locking could be to introduce two new flags for write, WRITE_PRIMARY adn WRITE_SECONDARY and some additional fields.
For each stripe the OST keeps an in-memory wsn (write sequence number) that is initialized to 0.
When a client writes to the primary mirror(/ost/stripe) it sets WRITE_PRIMARY, then OST writes the data, increments the wsn and returns it to the client in the reply.
Client then sends writes to all the non-primary mirrors with WRITE_SECONDARY plus the wsn for the write.
On the OST, when WRITE_SECONDARY is set, it uses the wsn to ensure all writes occur in the same order as they happened on the primary mirror/ost/stripe.
Of course, missing writes would need additions to handle recover.
Qian Yingjin
This is much like what DAOS does, first write to the primary mirror and then the primary replicates the data to the slaves (which needs OST→ OST for replication data write). We have discussed this, and prefer that the client writes all mirrors immediately at the same time.
Ronnie Sahlberg
I don't mean OST to OST replication but rather something like this:
1, application performs a write
2, lustre client send a WRITE_PRIMARY to the primary/,mirror/ost/stripe
3, OST writes it, increments wsn and returns the new wsn to the client.
4, client creates a work_item to write to all the submirrors and returns to the application.
We return with write being successful to the application as soon as the primary mirror was written to.
This should have no impact on performance as the cost of the write is the same as today. A single write to a single OST.
5, The client has a work-queue and will asynchronously send all the queued writes to the submirrors.
It does so using WRITE_SECONDARY and passing the wsn to the OST.
The wsn allows the submirror OSTs to guarantee that the writes to the stripes happens in the same order as the original writes to the primary in 2,.
6, When the work_queue is empty then we know that all the writes have completed to all the mirrors and the client can return the write lock.
Now, something like this would also open up to nicer granularity of detecting staleness down to the stripe level.
If all writes have completed and if the mirror became stale, we can check the wsn for all stripes in the mirror and compare to the wsn for the stripes on the primary mirror.
Where it differs we know that this stripe is stale. Where they are the same we know that this stripe is in fact up to date and does not need to be re-synced.
That is a nice sideeffect. There being certain situations where we can use the wsn, if it is surfaced via an rpc, to get better granularity on staleness than "the whole mirror is stale".
Patrick Farrell
I've been giving a lot of thought to write duplication over the break, and I wanted to write it down here briefly as I integrate it into the document.
In brief, we have two practical options for sending IOs to all servers:
For direct IO, we can simply repeat the IO to different mirrors within the same write() call, because there is no page cache to worry about. This duplicates some work for compression and encryption, which do significant work on pages before putting them on the wire, but is otherwise fine.
For buffered IO, this is not an option - we can have only one copy of data in the page cache, and each page cache page must belong to only one OSC. (I think one page belonging to multiple OSCs is a very difficult and problematic approach for a variety of reasons.) So we cannot simply repeat the write operation internally.
Yingjin had proposed that the primary uses buffered IO, and then the secondary mirrors receive the IO via direct IO. This works well with the page cache and is fairly clean to implement, but I think has an unacceptable performance cost. A huge advantage of Lustre is our ability to aggregate small writes before sending them to the server, significantly reducing overhead. In this case, we would send even an 8 byte write directly over the network to the secondary mirrors. We could hide this latency from the user, but we could not hide the cost on the servers and network. (Yes, server side ldiskfs writeback cache could reduce the load on the disks, but not the RPC traffic.)
Instead, I'm thinking we should do RPC duplication. For immediate mirrored files, when an RPC is created, after the BRW page array is ready (post compression, post encryption), we generate a callback up to the LOV level to re-use that BRW array in RPCs to the other immediate mirrors. The BRW array would become a reference counted struct and not fully completed until all RPCs were done. This allows the page cache to work normally and also covers direct IO implicitly.
The RPC duplication approach has one major downside: The mirror layout geometry (component extents, stripe counts, stripe sizes) must be the same across all immediate mirrors, otherwise we cannot send the 'same' RPC to each target. (We could in theory split and re-use the BRW array across multiple non-identical RPCs, but this would notably increase the complexity of doing so. If we decide it's critical to allow different mirror geometries, we could do this later.)
So that's my current thinking - curious to hear feedback.
Patrick Farrell
Particularly interested in hearing from Andreas Dilger Sebastien Buisson Alexey Lyashkov and Qian Yingjin
Qian Yingjin
Please see my patch https://review.whamcloud.com/c/ex/lustre-release/+/59958, It does exactly like what you said. For buffered I/O, the primary mirror does the buffered page aggregation. We callback up to the LOV layer (in osc_build_rpc which has already accumulated enough pages for the primary mirror) to re-use BRW PAGE arrary for other mirrors. Moreover, It can simply support Heterogeneous mirror layout (mirror with different stripe size and stripe count). And this I/O engine can be used for both buffered I/O and direct I/O.
Andreas Dilger
Qian Yingjin can you push a patch to master, otherwise Alexey and I cannot review it.
Qian Yingjin
FYI, the master branch version:
https://review.whamcloud.com/c/fs/lustre-release/+/63244 LU-13643 lov: new I/O engine for FLR immediate write
the main logic to build I/O from BRW PAGEs of the primary mirror to other mirrors is in LOV layer (lustre/lov/rf_kintf.c), the code can handle the mirror with heterogeneous layout configuration.
Andreas Dilger
Doing the background writes with DIO sounds very prudent. This would be the same way that delayed mirror writes use DIO to resync secondary mirrors today.
Being able to mirror to different file layouts is fairly important, considering that there may be cases where the allocated file layouts may not be identical even if they are intended to be the same. That said, there may be multiple reasons why it is desirable to have different layouts for mirrors, so if we can avoid that restriction it could be very useful.
Alex Zhuravlev
I was trying to implement page-in-multiple-OSCs for few weeks and decided it's way too tricky with the current design, it's especially tricky to deal with flags like PG_dirty/PG_writeback properly. so finally I gave up and switched to another model where llite tracks changes in form of extents and then a separate thread just tries to to regular DIO (starting from ldlm enqueue, which is supposed to be no-op in majority of times given a lock has been granted already).
as for RPC duplication I think this approach has own problems - IO pipe can be in different state in different OSCs, then we can't use same pages in different OSCs simultaneously due go PG_* flags, grants..
Patrick Farrell
So, the idea about RPC duplication is we would do it DIO style for the secondary mirrors. So most of those problems go away - we use the page cache pages as the source for DIO.