Introduction
The Erasure Coding High Level Design describes details of how the EC feature may be implemented, including the user interfaces, how files with EC feature interact with client-IO layer, the RPC interchange between the client and servers.
Design Overview
There are three main components to the EC design:
- The user-space interfaces for Lustre-specific command line tools and user library application programming interfaces (APIs)
- Changes to the client IO (CLIO) layer IO engine and RPC interfaces to the MDS for accessing and writing composite file layouts
- Changes to the MDS server to create, modify, and delete composite files
The design is structured in a top-down manner, starting with the command line interfaces that users, then the user library APIs, the client-side kernel changes, RPCs for creating and modifying composite files, and server-side changes.
User Space Interfaces
The lfs(1) command line interface will be extended to understand and manipulate EC files and their component layouts. User will use it to create files with EC layout, show the layout of existing files, and set default EC layout template on directories which can be inherited by new files and sub-directories created therein.
lfs setstripe
The lfs setstripe command creates a new file with the specified layout parameters, or sets the default layout template for a directory.
lfs setstripe {--component-end|-E END1} [COMP1_OPTIONS] {-L erasure-code} [-k NUMBER_OF_DATA_STRIPES ] {-m NUMBER_OF_PARITY_STRIPES} [SETSTRIPE_OPTIONS] {--component-end|-E END2 } [COMP2_OPTIONS] [{-L erasure-code} ...] {FILENAME|DIRECTORY}
For ease of use, the EC component can be specified with a shorthand, like "-L ec:D+P" to represent the NUMBER_OF_DATA_STRIPES and NUMBER_OF_PARITY_STRIPES more easily.
lfs mirror split -d -L ec FILENAME # delete all EC components
lfs setstripe --comp-del -I EC_COMP_ID FILENAME # delete an single EC component (not recommended)
This will create two separate sets of components in the file. The first set will hold the standard RAID0 data stripes for each of the components, and the second set will hold a matching set of EC parity stripes for each of those components. Each of these component sets will have a different MIRROR_ID values, like a regular FLR mirror, except that the EC parity components will additionally store the LOV_PATTERN_PARITY
Since this is the primary command line interface for users to create new files with Lustre specific layouts, there are a number of existing options can be used. Adding erasure coding specific options allows the same code to create erasure coding enabled composite layout without duplicating a large number of options. The existing --component-end arguments are used to specify the data component, and -L erasure-code arguments following the data component arguments are to specify the parameters of coding component associated with the corresponding data component. -k is used to specify the number of data stripes used to compute the erasure code out of them, without setting this option, all of the associated data component stripe devices will be used for the code calculation. -m specifies the number of code stripes to be used; And the act of encoding takes k data devices, calculates m code devices. The following -i/-o/-p options can be used to specify which OSTs to be used as the code devices.
As an example:
$ lfs setstripe -E 4M -c 4 -L erasure_code -m 2 -E eof -c 32 -L ec:8+2 /mnt/lustre/file
Every code component uses the same stripe size as its corresponding data component’s stripe size, and the parity computation only involves its corresponding component’s data, if the tailing data of the component are not aligned for the k group of stripes, a memory filled with 0 padding will be used for the parity computation.
This creates a file with two data components and two code components
lfs mirror resync [-y] EC_FILENAME
This command is used to resynchronize an out-of-sync erasure-coding file. If there are no stale code components in the EC file and no -y argument is used, this command does nothing. Otherwise, this command will first hold the exclusive lease of the file, and then the file will be resynchronized. During the resynchronization, the file data will be read and generate the parity code, then the parity codes will be written to the OST objects of the code components. After the parity codes are synchronized, this command will change the layout to mark the code component as uptodate and release the lease of the file.
On-disk/wire structure changes for Erasure Coding layout component
- layout header
struct lov_comp_md_v1 {
__u32 lcm_magic;
__u32 lcm_size;
__u32 lcm_layout_gen;
__u16 lcm_flags;
__u16 lcm_entry_count;
/* lcm_mirror_count stores the number of actual mirrors minus 1,
* so that non-flr files will have value 0 meaning 1 mirror.
*/
__u16 lcm_mirror_count;
/* code components count, non-EC file contains 0 ec_count */
+ __u8 lcm_ec_count;
__u8 lcm_padding3[1];
__u16 lcm_padding1[2];
__u64 lcm_padding2;
struct lov_comp_md_entry_v1 lcm_entries[0];
};- code component entry header
struct lov_comp_md_entry_v1 {
__u32 lcme_id;
__u32 lcme_flags;
struct lu_extent lcme_extent;
__u32 lcme_offset;
__u32 lcme_size;
__u32 lcme_layout_gen;
+ __u8 lcme_dstripe_count; /* data stripe count used in EC, k value */
+ __u8 lcme_cstripe_count; /* code stripe count, m value */
__u16 lcme_padding_1;
__u64 lcme_timestamp;
};- data component entry blob is the same as a non-EC plain file layout
struct lov_mds_md_v3 {
__u32 lmm_magic; /* LOV_MAGIC_V3 */
__u32 lmm_pattern; /* LOV_PATTERN_RAID0 */
struct ost_id lmm_oi;
__u32 lmm_stripe_size;
__u16 lmm_stripe_count; /* ec stripe count */
__u16 lmm_layout_gen;
char lmm_pool_name[LOV_MAXPOOLNAME + 1];
struct lov_ost_data_v1 lmm_objects[0];
};- parity code component entry blob
struct lov_mds_md_v3 {
__u32 lmm_magic; /* LOV_MAGIC_V3 */
__u32 lmm_pattern; /* LOV_PATTERN_PARITY */
struct ost_id lmm_oi;
__u32 lmm_stripe_size;
__u16 lmm_stripe_count; /* ec stripe count */
__u16 lmm_layout_gen;
char lmm_pool_name[LOV_MAXPOOLNAME + 1];
struct lov_ost_data_v1 lmm_objects[0];
};Memory structure changes for Erasure Coding layout
- layout header
struct lov_stripe_md {
...
u32 lsm_magic; -> LOV_MAGIC_COMP_EC
...
};- code component entry
struct lov_stripe_md_entry {
...
u32 lsme_magic; -> LOV_MAGIC_EC
+ u8 lsme_dstripe_count;
+ u8 lsme_cstipe_count;
...
};The code component has the same extent range as its associated data component, and the code component descriptor is posited after data components.
Layout Component Creation
Creating layout component on demand still uses layout intent RPC to notify the MDS to prepare the file’s components as what PFL/FLR files do.
Write to EC File with Delayed Parity
The first phase of this project is aim to build the infrastructure to support erasure coding in Lustre. To simplify the implementation, this design only supports delayed parity calculating, i.e., the data write does not generate parity code at the same time, it will only mark the code components stale, and write continues on the striped data components. After the file is closed, administrator can use an external tool to generate and write the parity code. Marking code component stale will be recorded in the Lustre ChangeLog so that the resync tool or external policy engine can detect and resync the code components.
With delayed write, the write intent RPC handler in MDS will instantiate destination data and code component, and mark the code component STALE. The layout generation will be increased to notify the other clients to clean up their cache. And these code components will stay stale, and read cannot use them to rebuild missing data until they got resynced later.
Map file offset to code offset
Each stripe of parity code is calculated by a polynomial taking several (say k) stripes of data as its variables, so that every k bytes of data in the file corresponds to a byte of code in one of the corresponding component code stripes.
Suppose the data of offset off locates in component m, we use km stripes for code generation in component m, and component m starts at compm.start, the stripe size of this component is sm, so the code offset code_off in the code object corresponding to the data at off can be calculated as follows:
Let offm = off – compm.start
And code_off = offm/(km * sm) * sm + offm % sm
Erasure Code Resynchronization
After an EC file is written, its code component is not valid to reconstruct normal data if data OST is not available. This phase will need an external mechanism to synchronize the code component, i.e., compute the parity code from normal data and write the code to the code component, then update the layout to indicate that the parity in the code component is valid so that read can leverage them to recover data from possibly failed OSTs.
Read of Degraded File
When reading from an EC file, if the read I/O encountered failed OSTs, the IO framework detects the error. The LOV layer will manage the page to read the remaining data pages in the EC chunk and its associated parity from the code component and re-generate the missed data page for the failed OSTs, so that client can read without noticing OSTs’ unavailable. Since the EC parity is computed on a per-block basis across the striped file, the minimum number of pages needed to reconstruct one missing data page is (d-1)+p, one data page from each remaining stripe plus one parity (though possibly both parities to allow confirming reconstruction correctness).
For large reads (>= stripe_size), it is most efficient to read the full stripe_size of data and parity and reconstruct a whole stripe at a time. It is likely that the additional data stripes would be needed for the application read() call in any case.
Extent lock expand
Depends on the number of code fragments parameter in the erasure-coding, to calculate a data value in certain a file offset, all the data stripes in the relevant code fragments need to be read, which could possibly involve reading the data not in the current IO request range, so that means we also need to take read lock covering that part of data. In the case of an OST failure, then read will be restarted and we would take the extent lock covering the relevant data and code fragments.
IO framework for read
the IO framework retry mechanism is triggered, and the retry created another EC I/O, which is meant to read data stripes of available OSTs and parity code OSTs, and calculate data belonging to failed OSTs.
Page management in LOV
To shield user space tools and LLITE layer from aware of the parity code directly, LOV layer will take charge of the page management of the parity code belonging to the code components. LOV will acquire CR extent lock of the requested code OST object, allocate cl_page associated with the parity OST object, and since the code OST object is obscured from LLITE layer, so that it’s LOV layer’s responsibility to maintain the code pages. These pages can be attached to the code extent lock and/or maintained in an LRU facility. When the lock got canceled, the parity cache can be discarded accordingly.
Requirements
Erasure Code Library
We’d choose an existing library supporting erasure code encode and decode. Intel Intelligent Storage Acceleration Library (ISA-L) (https://github.com/01org/isa-l) supports fast block Reed-Solomon type erasure codes for any encode/decode matrix in GF(28), and we can leverage for the parity code generating and data restore.
User Space Tools
A Lustre user space tool is needed to define and set parity components for a file. We can reuse lfs mirror create to serve this purpose.
Another tool is needed to generate the erasure code for the changed data block and update the parity code. We’d reuse lfs mirror resync to check internally whether the component to resync is a mirror or a EC.
Erasure Coded File Write
Erasure coded file writes will mark the corresponding parity component stale. After the file is closed, resync tool can be used to clear the write extent list and generate/update erase code for corresponding parity component.
Erasure Coded File Read
The Lustre client will do normal reads from the RAID-0 data component, unless there is an OST failure or other error reading from a data stripe, a read recovery will be started, reading erasure code data from parity components and reconstruct the data for the failed OST.
Future Development
Write to EC File with Immediate Parity
In a later development phase, it may be possible to implement Immediate EC Write for the restricted (but fairly common) use case of a single client writing a new file in linear offset order (which is the only IO method supported for most object stores). The EC components would be initialized as "in progress" or "STALE" during initial file writes. The client could compute the code for the EC component in a similar manner as is done for Read of Degraded File, and write it asynchronously to the EC stripes. If the write of the file's data and code components have completed and sync'd without errors (the most common case), then the EC component(s) can be marked uptodate when the file is closed. This is essentially combining the Delayed Write with the Erasure Code Resynchronization step, and could cover a large fraction of the normal use cases.

5 Comments
Andreas Dilger
In general this looks quite reasonable. It would be good to describe the page management in some more detail.
Qian Yingjin
For this design, I still have a question about the file size management.
i.e.
Stripe size = 1M, k=2, m=1.
In this example, the file is stripped over two data object obj1 and obj2, the size of obj1 is 1M and the size of obj2 is 0.5M, the total file size is 1.5M.
As we known, the file size is obtained via glimpse lock according to OST objects' size, If obj2 can not be accessed due to the corresponding OST failure, how to get the correct file size in this case?
Andreas Dilger
In this case it would be possible to partially recover the file size based on only the EC object, by assuming that non-zero file data would be written to EOF, so the rebuild could truncate the reconstructed
obj2'to the last non-zero byte (or block) of data in that object. That wouldn't be consistently accurate, but at least relatively close in most cases. If there is anobj3then the size ofobj2doesn't affect the file size, so this issue only applies to the last object in the file.It would instead be possible to also store the actual size of the file (or the end of the component if file size is larger) directly on each EC object (i.e. 1.5M in this example), instead of the end of the EC itself (i.e. 1M in this example), so that each EC object would be sparse at the end (0.5M in this example). That is probably OK since we only care about the EC blocks that overlap with data blocks during reconstruction, but I don't yet know the details of EC reconstruction to confidently say that we don't need the EC size for anything else.
Qian Yingjin
Ah, Just read the code of FLR, the MDS can store the total correct file size on MDT after erase code resync. So when reconsturct the read in degraded mode, and the erasure code and data stripe is consistent, the file size is correct and can be obtained from MDT. So it maybe not a problem in the phase of Erasure coding HLD.
Patrick Farrell
Yes, exactly - when FLR is in RDONLY, you have full/strict SOM. But it is a good point in general and something to remember for later phases.