Introduction

The Erasure Coding High Level Design describes details of how the EC feature may be implemented, including the user interfaces, how files with EC feature interact with client-IO layer, the RPC interchange between the client and servers.

Design Overview

There are three main components to the EC design:

The user-space interfaces for Lustre-specific command line tools and user library application programming interfaces (APIs)
Changes to the client IO (CLIO) layer IO engine and RPC interfaces to the MDS for accessing and writing composite file layouts
Changes to the MDS server to create, modify, and delete composite files

The design is structured in a top-down manner, starting with the command line interfaces that users, then the user library APIs, the client-side kernel changes, RPCs for creating and modifying composite files, and server-side changes.

User Space Interfaces

The lfs(1) command line interface will be extended to understand and manipulate EC files and their component layouts. User will use it to create files with EC layout, show the layout of existing files, and set default EC layout template on directories which can be inherited by new files and sub-directories created therein.

lfs setstripe

The lfs setstripe command creates a new file with the specified layout parameters, or sets the default layout template for a directory.

lfs setstripe {--component-end|-E END1} [COMP1_OPTIONS] {-L erasure-code} [-k NUMBER_OF_DATA_STRIPES ] {-m NUMBER_OF_PARITY_STRIPES} [SETSTRIPE_OPTIONS] {--component-end|-E END2 } [COMP2_OPTIONS] [{-L erasure-code} ...] {FILENAME|DIRECTORY}

For ease of use, the EC component can be specified with a shorthand, like "-L ec:D+P" to represent the NUMBER_OF_DATA_STRIPES and NUMBER_OF_PARITY_STRIPES more easily.

lfs mirror split -d -L ec FILENAME # delete all EC components

lfs setstripe --comp-del -I EC_COMP_ID FILENAME # delete an single EC component (not recommended)

This will create two separate sets of components in the file. The first set will hold the standard RAID0 data stripes for each of the components, and the second set will hold a matching set of EC parity stripes for each of those components. Each of these component sets will have a different MIRROR_ID values, like a regular FLR mirror, except that the EC parity components will additionally store the LOV_PATTERN_PARITY

Since this is the primary command line interface for users to create new files with Lustre specific layouts, there are a number of existing options can be used. Adding erasure coding specific options allows the same code to create erasure coding enabled composite layout without duplicating a large number of options. The existing --component-end arguments are used to specify the data component, and -L erasure-code arguments following the data component arguments are to specify the parameters of coding component associated with the corresponding data component. -k is used to specify the number of data stripes used to compute the erasure code out of them, without setting this option, all of the associated data component stripe devices will be used for the code calculation. -m specifies the number of code stripes to be used; And the act of encoding takes k data devices, calculates m code devices. The following -i/-o/-p options can be used to specify which OSTs to be used as the code devices.

As an example:

$ lfs setstripe -E 4M -c 4 -L erasure_code -m 2 -E eof -c 32 -L ec:8+2 /mnt/lustre/file

Every code component uses the same stripe size as its corresponding data component’s stripe size, and the parity computation only involves its corresponding component’s data, if the tailing data of the component are not aligned for the k group of stripes, a memory filled with 0 padding will be used for the parity computation.

This creates a file with two data components and two code components

lfs mirror resync [-y] EC_FILENAME

This command is used to resynchronize an out-of-sync erasure-coding file. If there are no stale code components in the EC file and no -y argument is used, this command does nothing. Otherwise, this command will first hold the exclusive lease of the file, and then the file will be resynchronized. During the resynchronization, the file data will be read and generate the parity code, then the parity codes will be written to the OST objects of the code components. After the parity codes are synchronized, this command will change the layout to mark the code component as uptodate and release the lease of the file.

On-disk/wire structure changes for Erasure Coding layout component

layout header

struct lov_comp_md_v1 {
      __u32   lcm_magic;
      __u32   lcm_size;
      __u32   lcm_layout_gen;
      __u16   lcm_flags;
      __u16   lcm_entry_count;
      /* lcm_mirror_count stores the number of actual mirrors minus 1,
       * so that non-flr files will have value 0 meaning 1 mirror.             
       */
      __u16   lcm_mirror_count;
      /* code components count, non-EC file contains 0 ec_count */
+     __u8    lcm_ec_count;
      __u8    lcm_padding3[1];
      __u16   lcm_padding1[2];
      __u64   lcm_padding2;
      struct lov_comp_md_entry_v1 lcm_entries[0];
};

code component entry header

struct lov_comp_md_entry_v1 {
      __u32  lcme_id;
      __u32  lcme_flags;
      struct lu_extent lcme_extent;
      __u32  lcme_offset;
      __u32  lcme_size;
      __u32  lcme_layout_gen;
+     __u8   lcme_dstripe_count; /* data stripe count used in EC, k value */
+     __u8   lcme_cstripe_count; /* code stripe count, m value */
      __u16  lcme_padding_1;
      __u64  lcme_timestamp;
};

data component entry blob is the same as a non-EC plain file layout

struct lov_mds_md_v3 {
      __u32         lmm_magic;          /* LOV_MAGIC_V3 */
      __u32         lmm_pattern;        /* LOV_PATTERN_RAID0 */
      struct ost_id lmm_oi;
      __u32         lmm_stripe_size;
      __u16         lmm_stripe_count;   /* ec stripe count */
      __u16         lmm_layout_gen;
      char          lmm_pool_name[LOV_MAXPOOLNAME + 1];
      struct        lov_ost_data_v1 lmm_objects[0];
};

parity code component entry blob

struct lov_mds_md_v3 {
      __u32         lmm_magic;          /* LOV_MAGIC_V3 */
      __u32         lmm_pattern;        /* LOV_PATTERN_PARITY */
      struct ost_id lmm_oi;
      __u32         lmm_stripe_size;
      __u16         lmm_stripe_count;   /* ec stripe count */
      __u16         lmm_layout_gen;
      char          lmm_pool_name[LOV_MAXPOOLNAME + 1];
      struct lov_ost_data_v1 lmm_objects[0];
};

Memory structure changes for Erasure Coding layout

layout header

struct lov_stripe_md {
...
      u32 lsm_magic;    -> LOV_MAGIC_COMP_EC
...
};

code component entry

struct lov_stripe_md_entry {
      ...
      u32 lsme_magic;   -> LOV_MAGIC_EC
+     u8 lsme_dstripe_count;
+     u8 lsme_cstipe_count;
...
};

The code component has the same extent range as its associated data component, and the code component descriptor is posited after data components.

Layout Component Creation

Creating layout component on demand still uses layout intent RPC to notify the MDS to prepare the file’s components as what PFL/FLR files do.

Write to EC File with Delayed Parity

The first phase of this project is aim to build the infrastructure to support erasure coding in Lustre. To simplify the implementation, this design only supports delayed parity calculating, i.e., the data write does not generate parity code at the same time, it will only mark the code components stale, and write continues on the striped data components. After the file is closed, administrator can use an external tool to generate and write the parity code. Marking code component stale will be recorded in the Lustre ChangeLog so that the resync tool or external policy engine can detect and resync the code components.

With delayed write, the write intent RPC handler in MDS will instantiate destination data and code component, and mark the code component STALE. The layout generation will be increased to notify the other clients to clean up their cache. And these code components will stay stale, and read cannot use them to rebuild missing data until they got resynced later.

Map file offset to code offset

Each stripe of parity code is calculated by a polynomial taking several (say k) stripes of data as its variables, so that every k bytes of data in the file corresponds to a byte of code in one of the corresponding component code stripes.

Suppose the data of offset off locates in component m, we use k_m stripes for code generation in component m, and component m starts at comp_m.start, the stripe size of this component is s_m, so the code offset code_off in the code object corresponding to the data at off can be calculated as follows:

Let off_m = off – comp_m.start

And code_off = off_m/(k_m * s_m) * s_m + off_m % s_m

Erasure Code Resynchronization

After an EC file is written, its code component is not valid to reconstruct normal data if data OST is not available. This phase will need an external mechanism to synchronize the code component, i.e., compute the parity code from normal data and write the code to the code component, then update the layout to indicate that the parity in the code component is valid so that read can leverage them to recover data from possibly failed OSTs.

Read of Degraded File

When reading from an EC file, if the read I/O encountered failed OSTs, the IO framework detects the error. The LOV layer will manage the page to read the remaining data pages in the EC chunk and its associated parity from the code component and re-generate the missed data page for the failed OSTs, so that client can read without noticing OSTs’ unavailable. Since the EC parity is computed on a per-block basis across the striped file, the minimum number of pages needed to reconstruct one missing data page is (d-1)+p, one data page from each remaining stripe plus one parity (though possibly both parities to allow confirming reconstruction correctness).

For large reads (>= stripe_size), it is most efficient to read the full stripe_size of data and parity and reconstruct a whole stripe at a time. It is likely that the additional data stripes would be needed for the application read() call in any case.

Extent lock expand

Depends on the number of code fragments parameter in the erasure-coding, to calculate a data value in certain a file offset, all the data stripes in the relevant code fragments need to be read, which could possibly involve reading the data not in the current IO request range, so that means we also need to take read lock covering that part of data. In the case of an OST failure, then read will be restarted and we would take the extent lock covering the relevant data and code fragments.

IO framework for read

the IO framework retry mechanism is triggered, and the retry created another EC I/O, which is meant to read data stripes of available OSTs and parity code OSTs, and calculate data belonging to failed OSTs.

Page management in LOV

To shield user space tools and LLITE layer from aware of the parity code directly, LOV layer will take charge of the page management of the parity code belonging to the code components. LOV will acquire CR extent lock of the requested code OST object, allocate cl_page associated with the parity OST object, and since the code OST object is obscured from LLITE layer, so that it’s LOV layer’s responsibility to maintain the code pages. These pages can be attached to the code extent lock and/or maintained in an LRU facility. When the lock got canceled, the parity cache can be discarded accordingly.

Requirements

Erasure Code Library

We’d choose an existing library supporting erasure code encode and decode. Intel Intelligent Storage Acceleration Library (ISA-L) (https://github.com/01org/isa-l) supports fast block Reed-Solomon type erasure codes for any encode/decode matrix in GF(2⁸), and we can leverage for the parity code generating and data restore.

User Space Tools

A Lustre user space tool is needed to define and set parity components for a file. We can reuse lfs mirror create to serve this purpose.

Another tool is needed to generate the erasure code for the changed data block and update the parity code. We’d reuse lfs mirror resync to check internally whether the component to resync is a mirror or a EC.

Erasure Coded File Write

Erasure coded file writes will mark the corresponding parity component stale. After the file is closed, resync tool can be used to clear the write extent list and generate/update erase code for corresponding parity component.

Erasure Coded File Read

The Lustre client will do normal reads from the RAID-0 data component, unless there is an OST failure or other error reading from a data stripe, a read recovery will be started, reading erasure code data from parity components and reconstruct the data for the failed OST.

Future Development

Write to EC File with Immediate Parity

In a later development phase, it may be possible to implement Immediate EC Write for the restricted (but fairly common) use case of a single client writing a new file in linear offset order (which is the only IO method supported for most object stores). The EC components would be initialized as "in progress" or "STALE" during initial file writes. The client could compute the code for the EC component in a similar manner as is done for Read of Degraded File, and write it asynchronously to the EC stripes. If the write of the file's data and code components have completed and sync'd without errors (the most common case), then the EC component(s) can be marked uptodate when the file is closed. This is essentially combining the Delayed Write with the Erasure Code Resynchronization step, and could cover a large fraction of the normal use cases.

Page tree

Introduction

Design Overview

User Space Interfaces

On-disk/wire structure changes for Erasure Coding layout component

Memory structure changes for Erasure Coding layout

Layout Component Creation

Write to EC File with Delayed Parity

Erasure Code Resynchronization

Read of Degraded File

Extent lock expand

IO framework for read

Page management in LOV

Requirements

Erasure Code Library

User Space Tools

Erasure Coded File Write

Erasure Coded File Read

Future Development

Write to EC File with Immediate Parity

5 Comments

Andreas Dilger

Qian Yingjin

Andreas Dilger

Qian Yingjin

Patrick Farrell

Page tree

File Level Read Only Erasure Coding - High Level Design

Introduction

Design Overview

User Space Interfaces

On-disk/wire structure changes for Erasure Coding layout component

Memory structure changes for Erasure Coding layout

Layout Component Creation

Write to EC File with Delayed Parity

Erasure Code Resynchronization

Read of Degraded File

Extent lock expand

IO framework for read

Page management in LOV

Requirements

Erasure Code Library

User Space Tools

Erasure Coded File Write

Erasure Coded File Read

Future Development

Write to EC File with Immediate Parity

5 Comments

Andreas Dilger

Qian Yingjin

Andreas Dilger

Qian Yingjin

Patrick Farrell