Introduction

Currently the only way to get a redundancy for a file (or have it mirrored) is to run lfs utility. The file is supposed to be closed at that time or the utility interrupts and must be restarted from the beginning. If any application opens a file being replicated then the replication interrupts as well.
Another problem with such offline mirroring is resource consumption as the replication needs to read data from OSTs (disk IOs, then network transfers), then send it back to another OSTs (network transfers, then disk IOs). For any 1MB of data to be replicated we need: 1MB disk/network transfer to write original data, 1MB disk/network transfer to read original data and 1MB disk/network transfer to write a replica - 3 x disk/network bandwidth.
In contrast immediate mirroring should let clients maintain replica(s) while file is being used by regular apps and save 1 disk/network transfer - 2x disk/network bandwidth, not 3x.

Functional Specifications

The basic idea of the design is to let clients send modifications to original objects and its replicas at same time roughly. Up on file close MDS collects results of all modifications and makes a final decision whether the file is fully replicated.
When a replicated file is opened for write, from in-sync replicas we choose primary one and 1+ secondary ones. Depending on results of actual operations, primary and all secondary replicas may become up-to-date replicas again. Primary replica is used for changes and reads. Secondary replicas are used to apply all changes, but not for reads.

Changes to the disk layout

Layout component (representing mirrors) can be additionally flagged with:
● LCME_FL_PRIMARY: component will be used as primary and extent locking will go to this component initially
● LCME_FL_PARTIAL: component is involved in IMW and clients are supposed to send updates to this component; so far all clients reported clients reported all changes succeed on this component
● LCME_FL_FAILED: at least one client reported failed operation on this component

Examples of layout transitions


Per-component state: U – up-to-date (not a real flag, no Stale flag); S – Stale P - partially mirrored; F - mirroring failed.

LOVEA

Componen 1

Component 2

Component 3

Component 4

before open

Uptodate

Uptodate

Uptodate

Stale

after open

/ primary

Stale / secondary

Stale/ secondary

Failed

client 1

Partial

Partial

Partial


client 2

Partial

Failed

Partial


client 3

Partial

Partial

Partial


client 4

Partial

Partial

Failed


After resync

Uptodate

Stale

Stale

Stale



Flags (in form of component layout) are accumulated in LOVEA during mirroring epoch. Initially the file had 3 uptodate mirrors (components 1-3) and 1 stale mirror (component 4). After initial open for write MDS updates the layout so now Component 1 is primary (up-to-date by definition), Component 2 and Component 3 are stale (will be used for immediate replication) and Component 4 is failed in this epoch because it was already stale. 4 clients write to the file and do replication. At some point they report their own result in a form of layout with flagged components. Client1 and client3 successfully wrote Components 1-3, client2 failed to write Component 2, client4 failed to write Component 3. Thus the new accumulated state is a non-replicated file. Given we differentiate the flags we can accumulate updates in LOVEA directly.

More lucky case:

LOVEA

Componen 1

Component 2

Component 3

Component 4

before open

Uptodate

Uptodate

Uptodate

Stale

after open

/ primary

Stale / secondary

Stale/ secondary

Failed

client 1

Partial

Partial

Partial


client 2

Partial

Partial

Partial


client 3

Partial

Partial

Partial


client 4

Partial

Partial

Failed


After resync

Uptodate

Uptodate

Stale

Stale


This time Component 2 was successfully written on all the clients, so the file becomes a replicated one.

Changes to the network protocol

Inodebit locks get an additional bit - MDS_INODELOCK_IO. This is to let resync agent to notify clients involved in immediate mirroring to stop corresponding activity and report their status back.
IL – IO lock, LV – layout version

Client-MDS interaction

When a first client is going to make changes to a mirrored file, it follows the replication protocol and enqueues LAYOUT lock with LAYOUT_INTENT_WRITE. This initiates a new mirror epoch for file:


Another client opening this file finds the layout with primary replica marked, thus requests LAYOUT+IO lock to notify MDS about a new source of changes. To save an enqueue RPC MDS can grant IO lock to client at open when such a file is opened by other client(s) already (like we do for Data-on-MDT).
Client doesn't cancel IO lock on its own, only upon MDS's request. MDS cancels client's IO locks in few cases:


If client can't complete/commit IO before cancel expiration, then it reports current status of replication as is (incomplete means failed) and corresponding replicas are marked stale in the end.
When the last client closes a file MDS initiates a resync (not really resync, a better name is needed). Resync doesn't need to be synchronous in the context of close processing.
Like with FLR, the resync procedure involves the following steps:

\\
Any change to layout (outside of IMW flags) should cause existing epoch to close and a new epoch to open: IL locks are cancelled, current replication status is reported by all involved clients.
\\
\[think which replica we choose up-to-date if all replicas including primary one met a failure\]
\[think what component resync functionality belongs to: 1) works with LOVEA details – LOD 2) works with inodebit/extent locks \]

Changes to CLIO

The basic idea is to push same page (vmpage) into few OSCs and then just let each OSC to operate on those pages as usual (mostly).
When CLIO is processing system calls (e.g. write) it creates so-called subios - originally to break the original IO into smaller IOs corresponding to stripes. We can re-use this mechanism for IMW, but this time LOV will be duplicating IOs for each OSC involved.
The kernel provides CLIO with a set of pages (from pagecache or userspace). Currently LOV pushes pages into corresponding OSCs which are calculated using striping information (number of stripes, stripe size, objects). To support IMW it will be pushing pages into few OSCs and we need to change few structures (e.g. struct cl_page, osc_page) to enable that.
LOV will need to recognize IMW-enabled layouts:

\\
The results of all changes (OST_WRITE, OST_SETATTR, OST_PUNCH, OST_FALLOCATE) are tracked in LOV on per-component basis. Llite is also aware of all changes (mmap?). Later, upon IO lock cancellation, llite will fetch LOVEA from LOV and send it to MDS to report replication status - for this purpose MDS_REINT_SETXATTR is used.
\\
\[in order to speed up development add a support for non-striped LOV_PATTERN_RAID1 pattern - basically just a set of object, no support for replica flags, etc.\]

Compatibility

If an old client opens a mirrored file, then MDS just doesn't start a mirroring epoch. If any client opens a mirrored file with no primary selected (as found in LOVEA), then MDS doesn't start a mirroring epoch. If an old client opens a mirrored file with mirroring epoch started (as found in LOVEA), then MDS should create mfd and initiate mirroring epoch abort immediately: cancel IO bitlock (makes sense to notify clients it's abort so they don't flush data?).

Tasks

MDS

Client-MDS interaction:

Resync functionality

llite


\\
\[probably we can send a shrinked version of LOVEA containing updated components only?\]
\\

CLIO


Utils


Implementation steps


  1. Initial mirroring: single client, no errors, no concurrency
  2. Errors at write, errors at commit
  3. Concurrency: few clients, overlapping changes, open at resync, changes to layout during IO

Use Cases









Problems: