I moved the problems area down to the last page

Instant Mirror Writes high level design

Introduction

Currently the only way to get a redundancy for a file (or have it mirrored) is to run lfs utility. The file is supposed to be closed at that time or the utility interrupts and must be restarted from the beginning. If any application opens a file being replicated then the replication interrupts as well.
Another problem with such offline mirroring is resource consumption as the replication needs to read data from OSTs (disk IOs, then network transfers), then send it back to another OSTs (network transfers, then disk IOs). For any 1MB of data to be replicated we need: 1MB disk/network transfer to write original data, 1MB disk/network transfer to read original data and 1MB disk/network transfer to write a replica - 3 x disk/network bandwidth.
In contrast immediate mirroring should let clients maintain replica(s) while file is being used by regular apps and save 1 disk/network transfer - 2x disk/network bandwidth, not 3x.

...

Layout component (representing mirrors) can be additionally flagged with:
● LCME_FL_PRIMARY: component will be used as primary and extent locking will go to this component initially
● LCME_FL_PARTIAL: component is involved in IMW and clients are supposed to send updates to this component; so far all clients reported clients reported all changes succeed on this component
● LCME_FL_FAILED: at least one client reported failed operation on this component

Examples of layout transitions

Per-component state: U – up-to-date (not a real flag, no Stale flag); S – Stale P - partially mirrored; F - mirroring failed.

LOVEA	Componen 1	Component 2	Component 3	Component 4
before open	Uptodate	Uptodate	Uptodate	Stale
after open	/ primary	Stale / secondary	Stale/ secondary	Failed
client 1	Partial	Partial	Partial


client 2	Partial	Failed	Partial


client 3	Partial	Partial	Partial


client 4	Partial	Partial	Failed


After resync	Uptodate	Stale	Stale	Stale

Flags (in form of component layout) are accumulated in LOVEA during mirroring epoch. Initially the file had 3 uptodate mirrors (components 1-3) and 1 stale mirror (component 4). After initial open for write MDS updates the layout so now Component 1 is primary (up-to-date by definition), Component 2 and Component 3 are stale (will be used for immediate replication) and Component 4 is failed in this epoch because it was already stale. 4 clients write to the file and do replication. At some point they report their own result in a form of layout with flagged components. Client1 and client3 successfully wrote Components 1-3, client2 failed to write Component 2, client4 failed to write Component 3. Thus the new accumulated state is a non-replicated file. Given we differentiate the flags we can accumulate updates in LOVEA directly.

More lucky case:

LOVEA	Componen 1	Component 2	Component 3	Component 4
before open	Uptodate	Uptodate	Uptodate	Stale
after open	/ primary	Stale / secondary	Stale/ secondary	Failed
client 1	Partial	Partial	Partial


client 2	Partial	Partial	Partial


client 3	Partial	Partial	Partial


client 4	Partial	Partial	Failed


After resync	Uptodate	Uptodate	Stale	Stale

This time Component 2 was successfully written on all the clients, so the file becomes a replicated one.

...

MDS changes layout's state from WP to SP (sync pending) to catch new opens
MDS enqueues IO+LAYOUT lock to notify clients that the epoch is closing
the clients send their cached layouts to MDS with LCME_FL_PARTIAL and LCME_FL_FAILED set properly to reflect which replicas are good
MDS accumulates these flags in actual LOV EA

Wiki Markup
the clients cancel their IO locks \[probably we can pack LOVEA into cancel RPC?\]

MDS marks replicas with LCME_FL_FAILED as stale, drops LCME_FL_PRIMARY and increments layout's version - this is epoch close event
MDS does not block new opens during resync procedure, but new changes can't start due to IO lock held during resync procedure
MDS can interrupt resync procedure when IO lock has been acquired and new open is found

Wiki Markup

\\
Any change to layout (outside of IMW flags) should cause existing epoch to close and a new epoch to open: IL locks are cancelled, current replication status is reported by all involved clients.
\\
\[think which replica we choose up-to-date if all replicas including primary one met a failure\]
\[think what component resync functionality belongs to: 1) works with LOVEA details – LOD 2) works with inodebit/extent locks \]

Changes to CLIO

The basic idea is to push same page (vmpage) into few OSCs and then just let each OSC to operate on those pages as usual (mostly).
When CLIO is processing system calls (e.g. write) it creates so-called subios - originally to break the original IO into smaller IOs corresponding to stripes. We can re-use this mechanism for IMW, but this time LOV will be duplicating IOs for each OSC involved.
The kernel provides CLIO with a set of pages (from pagecache or userspace). Currently LOV pushes pages into corresponding OSCs which are calculated using striping information (number of stripes, stripe size, objects). To support IMW it will be pushing pages into few OSCs and we need to change few structures (e.g. struct cl_page, osc_page) to enable that.
LOV will need to recognize IMW-enabled layouts:

use primary replica for reads
use primary replica to order extent locking
use primary and secondary replica(s) for all changes
Find correct components for given offsets

Wiki Markup

\\
The results of all changes (OST_WRITE, OST_SETATTR, OST_PUNCH, OST_FALLOCATE) are tracked in LOV on per-component basis. Llite is also aware of all changes (mmap?). Later, upon IO lock cancellation, llite will fetch LOVEA from LOV and send it to MDS to report replication status - for this purpose MDS_REINT_SETXATTR is used.
\\
\[in order to speed up development add a support for non-striped LOV_PATTERN_RAID1 pattern - basically just a set of object, no support for replica flags, etc.\]

Compatibility

If an old client opens a mirrored file, then MDS just doesn't start a mirroring epoch. If any client opens a mirrored file with no primary selected (as found in LOVEA), then MDS doesn't start a mirroring epoch. If an old client opens a mirrored file with mirroring epoch started (as found in LOVEA), then MDS should create mfd and initiate mirroring epoch abort immediately: cancel IO bitlock (makes sense to notify clients it's abort so they don't flush data?).

...

Before lock cancellation llite requests LOVEA from LOV and sends it to MDS in form of MDS_REINT_SETXATTR. This LOVEA contains per-component flags reflecting replication status as described above.

Wiki Markup
\\ \[probably we can send a shrinked version of LOVEA containing updated components only?\] \\

CLIO

support for LOV_PATTERN_RAID1
- Mostly to speed up development
- Should be replaced with a composite layout supporting striped mirroring
Duplicate PG_dirty and PG_writeback in a form of refcounter in cl_page
lov_io_iter_init() to duplicate IO for each object
lov_io_commit_async() to save list of pages passed in and repeat the call to cl_io_commit_async() with that list
Lov_io_commit_async() to use a special callback and use that callback to maintain per-component flags if IO fails
lov_lock_sub_init() to generate locks
lov_lock_enqueue() to order locks (primary first)
lov_init_raid1() to initialize subobjects and initialize co_slice_off properly to support multi-object IO in cl_page_alloc() and osc_page_init()
lov_attr_get_raid1()
lov_page_init_composite() to initialize same page in few OSCs
osc_extent_make_ready() to recognize a case when page has been added into an RPC in another OSC and increment in-writeback refcounter
osc_completion() to respect in-writeback refcounter
osc_io_commit_async() to maintain cl_page's dirty state
new set of methods similar to composit layout methods with IO/lock duplication
brw_commit() to propagate result to LOV/llite and maintain per-component flagsThis looks good - one comment, remember that direct IO does not use OSC dlmlocks on the client. No problem I think but need to be kept in mind

...

Page tree

Versions Compared

Old Version 2

New Version 3

Key

I moved the problems area down to the last page

Instant Mirror Writes high level design

Introduction

Examples of layout transitions

Changes to CLIO

Compatibility

CLIO

Page tree

Page History

Versions Compared

Old Version 2

New Version 3

Key

I moved the problems area down to the last page

Instant Mirror Writes high level design

Introduction

Examples of layout transitions

Changes to CLIO

Compatibility

CLIO