Distributed Namespace (DNE) allows the Lustre namespace to be divided across multiple metadata servers. This enables the size of the namespace and metadata throughput to be scaled with the number of servers. An administrator can allocate specific metadata resources for different sub-trees within the namespace.
This project is split into 2 phases
Phase 1
Remote Directories Lustre sub-directories are distributed over multiple metadata targets (MDTs). Sub-directory distribution is defined by an administrator using a Lustre-specific mkdir command.
Phase 2
Striped Directories The contents of a given directory are distributed over multiple MDTs.
This document is concerned exclusively with the first phase: Remote Directories.
This document assumes knowledge of the DNE Phase 1 Solution Architecture the specifies all the requirements and presents the solution proposal.
DNE will be based on the Orion MDS stack. The Orion MDS stack is composed of the following layers:
After the MDS receives a request from a client:
A transaction consists of four stages:
For DNE Phase 1, only create and unlink of the remote directory is supported. For this reason, only create and unlink are included in the the OSP API below. API calls are paired to provide synchronous behavior.
OSP OBJECT API |
Description |
---|---|
|
Create an object remotely on another MDT. |
|
Increase the refcount of a remote object. |
|
Decrease the refcount of a remote object. |
|
Set attribute of a remote object. |
|
Set xattr of a remote object. |
|
Get attr/xattr from a remote object. |
Lustre 2.1 introduced the File ID (FID). The FID uniquely identifies a file or a directory across the entire Lustre filesystem. A FID is a three-field data structure, illustrated in the figure below.
A FID is stored in two places: directory entry and the inode extended attribute (EA). FID in EA will be used by online checking tool (Inode Iterator and OI Scrub Solution Architecture). During lookup, the name entry will be located and the FID will be returned. The FID location database (FLDB) is then interrogated to provide the MDT index: this identifies the MDT on which the object resides. Finally, the object is identified by looking up the object index table on the appropriate MDT. In the case of a remote directory, FID will identify an object residing on another MDT. Every remote entry will have a local agent inode and the FID will be stored in the EA of this local inode. With this design, the FID of remote directory can be stored and retrieved after the filesystem is restored from backup.
The local agent inode will reside in the special directories (AGENT DIRs). The agent inode will be located in different agent directories according to which MDT the remote object is located on. For example, if the remote object is located on MDT3, then the local agent inode will be put to AGENT_DIR.3.
As described in the Architecture section, any operation on MDT is divided into four transaction steps. For remote directory creation (lfs mkdir
) the following will occur:
link
xattr, via the Slave OSP, which updates the RPC for this target.last_rcvd
file to record the Client RPC XID.last_rcvd
slot.Since all server operations are synchronous there is no need for replay. However the Master MDT may still receive resent mkdir requests from the client and it must handle these as follows:
last_rcvd
file. If this matches the XID stored in the last_rcvd
file the entire operation must have completed because the remote directory object was created synchronously on the Slave MDT before the name entry and last_rcvd
were updated atomically. In this case, the Master MDT can reconstruct the RPC reply from the last_rcvd
entry.last_rcvd
file, then the update was not committed on the Master MDT. The Master MDT will resend the same object creation RPC to the Slave MDT using the FID supplied by the Client.
last_rcvd
as in the normal operation.To remove a directory (lfs rmdir
), the following will occur:
last_rcvd
file.In case of failure during the remote unlink operation recover proceeds as follows:
last_rcvd
file and the reply can be reconstructed.Changing mode/attributes of the remote directory does not involve its parent, i.e. it is an operation local to the remote MDT directory, so it is the same process as with a single MDT.
FIDs were designed in a manner that allows existing 1.x filesystems to be upgraded to 2.x while keeping the same inode identifiers. For existing inodes on MDT0 the inode numbers are mapped into the Inode/Generation-in-FID (IGIF) space using the FID sequence range [12-0xffffffff]. Existing OST object IDs are mapped into the ID-in-FID (IDIF) namespace using sequence [0x100000000-0x1ffffffff]. Legacy OST objects are identified by a special surrogate FID called IDIF, which is not unique between OSTs.
In order to maintain compatibility with Lustre 1.x, the current Lustre 2.x implementation limits the filesystem to a maximum of eight MDTs due to the limited space available for the MDT FID numbers that avoid collisions with IGIF and regular FIDs. The FID-on-OST feature needs to be implemented as part of this project in order to allow arbitrary numbers of MDTs. OST FIDs and MDT FIDs will share the same FID space. There is one master FID sequence server in a Lustre filesystem (currently always hosted on MDT0) that will manage the FID allocation and location. The master FID sequence server will sub-assign large ranges of sequence numbers (ranges of 2^30 consecutive sequence numbers called super-sequences) to each server in the filesystem, both MDTs and OSTs, which allows parallel and scalable FID allocations from within the super-sequences.
For OST object allocations, the OST is the FID server that requests super sequence FIDs from FID sequence server (MDT0). For OST objects, each MDT behaves as a FID client that requests object FIDs from the OST on which it wants to allocate objects and assigns them to new files as needed. As is currently done for efficiency during MDT file allocation, the MDT pre-allocates multiple OST object FIDs from each OST and caches them locally to reduce the number of RPCs sent to the OSTs.
With the FID on OST feature, the original object ID (l_object_id and l_object_seq in lov_mds_md) will be mapped to a real FID. The OST on which the object resides will still be cached in l_ost_idx in the lov_mds_md, but the OST index could also be derived from only the FID sequence number. To reduce lock contention within each OST, OST objects may be distributed into several directories by the hash of the FID, though this is an internal OST implementation detail. In the current OST implementation, OST objects are stored under the /O/{seq}/[d0-d31]{object_id} namespace (where currently seq=0
always), and this layout will be maintained with the FID-on-OST feature.
As with OST failover, MDT failover will ensure that the Lustre filesystem remains available in the face of MDS node failure. By allowing multiple MDTs to be exported from one MDS, Lustre can support active-active failover for metadata as it already does for data. This enables all MDS node resources to be exploited during normal operation to share the metadata load and increase throughput.
As with data storage, Lustre delegates durability and resilience to its storage devices. Unrecoverable failures in the storage device will result in permanent loss of affected portions of the filesystem namespace. With DNE, this may also make portions of the namespace accessed via the failed MDT inaccessible. The failure of MDT0 is an extreme case which can make the whole filesystem inaccessible.
If all the subsidiary MDTs are referenced directly from MDT0, then the permanent failure of one of these subsidiary MDTs does not affect the others. If there are multiple levels of remote directories then the failure of one MDT will isolate any of its subsidiary directory trees, even though MDT itself is not damaged. Development planned for LFSCK Phase 3 will allow reattaching orphaned directory trees into the MDT0 namespace under a lost+found-mdtNNNN
directory.
The on-disk format of current Lustre 2.x filesystems will be forward compatible with DNE on MDT0 and on the OSTs, so existing Lustre filesystems can be upgraded to a DNE-capable version of Lustre without affecting the data therein. Just upgrading to a DNE-capable version of Lustre will not affect the on-disk filesystem format. When a DNE-capable version of Lustre is running on both the clients and servers, new MDTs can be added to the existing filesystem while it is mounted and in use. The newly added MDTs will not be utilized until a remote directory is created there. Once a remote directory is created on a subsidiary MDT using lfs mkdir
, it will not possible to directly downgrade to a version of Lustre that does not support DNE. At this point, clients that do not support DNE will be evicted from the filesystem and will no longer be able to access it.
If the user wants to downgrade Lustre on the MDS from a multi-MDT filesystem to an older version of Lustre that does not support DNE (i.e. a single-MDT filesystem), all files and directories must be copied to MDT0 before removing the other MDTs. The configuration must be rewritten without allowing the subsidiary MDTs to reconnect to the MGS. Without these steps, any directories and files not located on MDT0 will be inaccessible from the old Lustre MDS.
mkfs.lustre --reformat –mgsnode=xxx –mdt --index=1 /dev/{mdtn_devn} mount -t lustre -o xxxx /dev/mdtn_devn /mnt/mdtn |
The Implementation phase is divided into the following three milestones, each with an estimated completion date:
Demonstrate working DNE code. The sanity.sh and mdsrate-create tests will pass in a DNE environment. Suitable new regression tests for the remote directory functionality will be added and passed, including functional Use Cases for upgrade and downgrade. (April 2th 2012)
Demonstrate DNE recovery and failover. Suitable DNE-specific recovery and failover tests will be added and passed. (July 30th 2012)
Performance and scaling testing will be run on available testing resources. The Lustre Manual will be updated to include DNE Documentation. (Sep 7th 2012)