Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

I moved the problems area down to the last page

Instant Mirror Writes high level design

Introduction

Currently the only way to get a redundancy for a file (or have it mirrored) is to run lfs utility. The file is supposed to be closed at that time or the utility interrupts and must be restarted from the beginning. If any application opens a file being replicated then the replication interrupts as well.
Another problem with such offline mirroring is resource consumption as the replication needs to read data from OSTs (disk IOs, then network transfers), then send it back to another OSTs (network transfers, then disk IOs). For any 1MB of data to be replicated we need: 1MB disk/network transfer to write original data, 1MB disk/network transfer to read original data and 1MB disk/network transfer to write a replica - 3 x disk/network bandwidth.
In contrast immediate mirroring should let clients maintain replica(s) while file is being used by regular apps and save 1 disk/network transfer - 2x disk/network bandwidth, not 3x.

...

Layout component (representing mirrors) can be additionally flagged with:
● LCME_FL_PRIMARY: component will be used as primary and extent locking will go to this component initially
● LCME_FL_PARTIAL: component is involved in IMW and clients are supposed to send updates to this component; so far all clients reported clients reported all changes succeed on this component
● LCME_FL_FAILED: at least one client reported failed operation on this component

Examples of layout transitions


Per-component state: U – up-to-date (not a real flag, no Stale flag); S – Stale P - partially mirrored; F - mirroring failed.

LOVEA

Componen 1

Component 2

Component 3

Component 4

before open

Uptodate

Uptodate

Uptodate

Stale

after open

/ primary

Stale / secondary

Stale/ secondary

Failed

client 1

Partial

Partial

Partial

 


client 2

Partial

Failed

Partial

 


client 3

Partial

Partial

Partial

 


client 4

Partial

Partial

Failed

 


After resync

Uptodate

Stale

Stale

Stale



Flags (in form of component layout) are accumulated in LOVEA during mirroring epoch. Initially the file had 3 uptodate mirrors (components 1-3) and 1 stale mirror (component 4). After initial open for write MDS updates the layout so now Component 1 is primary (up-to-date by definition), Component 2 and Component 3 are stale (will be used for immediate replication) and Component 4 is failed in this epoch because it was already stale. 4 clients write to the file and do replication. At some point they report their own result in a form of layout with flagged components. Client1 and client3 successfully wrote Components 1-3, client2 failed to write Component 2, client4 failed to write Component 3. Thus the new accumulated state is a non-replicated file. Given we differentiate the flags we can accumulate updates in LOVEA directly.

More lucky case:

LOVEA

Componen 1

Component 2

Component 3

Component 4

before open

Uptodate

Uptodate

Uptodate

Stale

after open

/ primary

Stale / secondary

Stale/ secondary

Failed

client 1

Partial

Partial

Partial

 


client 2

Partial

Partial

Partial

 


client 3

Partial

Partial

Partial

 


client 4

Partial

Partial

Failed

 


After resync

Uptodate

Uptodate

Stale

Stale


This time Component 2 was successfully written on all the clients, so the file becomes a replicated one.

...

  • MDS changes layout's state from WP to SP (sync pending) to catch new opens
  • MDS enqueues IO+LAYOUT lock to notify clients that the epoch is closing
  • the clients send their cached layouts to MDS with LCME_FL_PARTIAL and LCME_FL_FAILED set properly to reflect which replicas are good
  • MDS accumulates these flags in actual LOV EA
  • Wiki Markup
    the clients cancel their IO locks \[probably we can pack LOVEA into cancel RPC?\]
  • MDS marks replicas with LCME_FL_FAILED as stale, drops LCME_FL_PRIMARY and increments layout's version - this is epoch close event
  • MDS does not block new opens during resync procedure, but new changes can't start due to IO lock held during resync procedure
  • MDS can interrupt resync procedure when IO lock has been acquired and new open is found
Wiki Markup
\\
Any change to layout (outside of IMW flags) should cause existing epoch to close and a new epoch to open: IL locks are cancelled, current replication status is reported by all involved clients.
\\
\[think which replica we choose up-to-date if all replicas including primary one met a failure\]
\[think what component resync functionality belongs to: 1) works with LOVEA details – LOD 2) works with inodebit/extent locks \]

Changes to CLIO

The basic idea is to push same page (vmpage) into few OSCs and then just let each OSC to operate on those pages as usual (mostly).
When CLIO is processing system calls (e.g. write) it creates so-called subios - originally to break the original IO into smaller IOs corresponding to stripes. We can re-use this mechanism for IMW, but this time LOV will be duplicating IOs for each OSC involved.
The kernel provides CLIO with a set of pages (from pagecache or userspace). Currently LOV pushes pages into corresponding OSCs which are calculated using striping information (number of stripes, stripe size, objects). To support IMW it will be pushing pages into few OSCs and we need to change few structures (e.g. struct cl_page, osc_page) to enable that.
LOV will need to recognize IMW-enabled layouts:

  • use primary replica for reads
  • use primary replica to order extent locking
  • use primary and secondary replica(s) for all changes
  • Find correct components for given offsets
Wiki Markup
\\
The results of all changes (OST_WRITE, OST_SETATTR, OST_PUNCH, OST_FALLOCATE) are tracked in LOV on per-component basis. Llite is also aware of all changes (mmap?). Later, upon IO lock cancellation, llite will fetch LOVEA from LOV and send it to MDS to report replication status - for this purpose MDS_REINT_SETXATTR is used.
\\
\[in order to speed up development add a support for non-striped LOV_PATTERN_RAID1 pattern - basically just a set of object, no support for replica flags, etc.\]

Compatibility

If an old client opens a mirrored file, then MDS just doesn't start a mirroring epoch. If any client opens a mirrored file with no primary selected (as found in LOVEA), then MDS doesn't start a mirroring epoch. If an old client opens a mirrored file with mirroring epoch started (as found in LOVEA), then MDS should create mfd and initiate mirroring epoch abort immediately: cancel IO bitlock (makes sense to notify clients it's abort so they don't flush data?).

...

  • Before lock cancellation llite requests LOVEA from LOV and sends it to MDS in form of MDS_REINT_SETXATTR. This LOVEA contains per-component flags reflecting replication status as described above.
Wiki Markup
\\
\[probably we can send a shrinked version of LOVEA containing updated components only?\]
\\

CLIO


  • support for LOV_PATTERN_RAID1
    • Mostly to speed up development
    • Should be replaced with a composite layout supporting striped mirroring
  • Duplicate PG_dirty and PG_writeback in a form of refcounter in cl_page
  • lov_io_iter_init() to duplicate IO for each object
  • lov_io_commit_async() to save list of pages passed in and repeat the call to cl_io_commit_async() with that list
  • Lov_io_commit_async() to use a special callback and use that callback to maintain per-component flags if IO fails
  • lov_lock_sub_init() to generate locks
  • lov_lock_enqueue() to order locks (primary first)
  • lov_init_raid1() to initialize subobjects and initialize co_slice_off properly to support multi-object IO in cl_page_alloc() and osc_page_init()
  • lov_attr_get_raid1()
  • lov_page_init_composite() to initialize same page in few OSCs
  • osc_extent_make_ready() to recognize a case when page has been added into an RPC in another OSC and increment in-writeback refcounter
  • osc_completion() to respect in-writeback refcounter
  • osc_io_commit_async() to maintain cl_page's dirty state
  • new set of methods similar to composit layout methods with IO/lock duplication
  • brw_commit() to propagate result to LOV/llite and maintain per-component flagsThis looks good - one comment, remember that direct IO does not use OSC dlmlocks on the client. No problem I think but need to be kept in mind

...