Today's HPC systems routinely configure Lustre with tens of thousands of clients. Each client draws resources from a single metadata server (MDS). The efforts of the Lustre community continue to improve the performance of the MDS, however, as client numbers continue to increase the single MDS represents a fundamental limit to filesystem scalability.
The goal of the Distributed Namespace (DNE) project is to a deliver a documented and tested implementation of Lustre that addresses this scaling limit by distributing the filesystem metadata over multiple metadata servers. This is an ambitious engineering project that will take place over a period of two years. It requires considerable engineering and testing resource and will therefore be performed in the two phases described below.
This phase introduces a useful minimum of distributed metadata functionality. The purpose primarily to ensure that efforts concentrate on clean code restructuring for DNE. The phase focuses on extensive testing to shake out bugs and oversights not only in the implementation but also in administrative procedures. DNE brings new usage patterns that must necessarily adapt to manage multiple metadata servers.
The Lustre namespace will be distributed by allowing directory entries to reference sub-directories on different metadata targets (MDTs). Individual directories will remain bound to a single MDT, therefore metadata throughput on single directories will stay limited by single MDS performance, but metadata throughput aggregated over distributed directories will scale.
The creation of non-local subdirectories will initially be restricted to administrators with a Lustre-specific mkdir command. This will ensure that administrators retain control over namespace distribution to guarantee performance isolation.
This phase will introduce sharded directories in which directory entries in a single directory are hashed over multiple MDTs. This will allow metadata throughput within a single directory to scale.
This proposal describes the first phase: Remote Directories. A separate document will describe the Striped Directories phase.
DNE must provide the ability to distribute a single Lustre filesystem over multiple MDTs under administrator control. Metadata throughput aggregated over multiple MDSs should scale as MDSs are added until OSSs reach saturation.
The throughput of metadata operations against any single directory will be limited by the performance of a single MDS during Phase 1. This restriction will be addressed during Phase 2 when a single directory may be sharded over multiple MDTs.
Single MDS performance is addressed in a separate OpenSFS project: Single Server Metadata Performance |
DNE must demonstrate performance isolation between metadata workloads on different subtrees of the namespace located on different MDSs, consistent with the extent to which these subtrees contend for the same OSTs.
Rename and hardlink operations on inodes and directory entries all on the same MDT will function as normal. Otherwise these operations will be a no-op and return EXDEV, as if the caller attempted to use them on different filesystems. This will force utilities to try alternatives - e.g. "mv" will copy the source to the target and remove the source on success.
The administrative command to create a directory on a non-local MDT (i.e. where the child directory resides on a different MDT from its parent), and rmdir run on such directories will update state on both child and parent MDTs. Node failures while these operations are being performed must ensure that the child directory remains accessible and intact while the parent directory reference exists. This will require sequenced synchronous writes to persistent storage which will make these operations relatively slow. All other metadata modifying operations will execute as normal with no performance regressions.
All metadata operations on a single-MDT Lustre filesystem must also work in a Remote Directories environment. When adding a new MDT to a single-MDT filesystem, the contents of the existing MDT must not be affected.
A pre-DNE Lustre 2.x filesystem must be upgradable to a Lustre version that supports DNE.
Within the constraints of this initial development, addition of MDTs must be performed off-line. However, the implementation will allow online MDT addition as more experience is gained with its usage.
An administrator will be able to disable any MDT. Any file operations that access a disabled MDT will receive an IO error. Disabling should only be performed if an MDT is to be made permanently unavailable. If MDT0 (the primary MDT) is disabled the entire filesystem will be inaccessible.
Removing a MDT will not be supported.
The majority of metadata operations will only affect a single MDT. Their implementation and performance will not be changed. Operations that span multiple MDTs will be subject to additional constraints as described above to prevent "dangling" links (i.e. a directory entry for which there is no inode) under all possible node failures or combinations of node failures. It is permissible for such failures to leave an orphan inode which can be cleaned up by a filesystem scan.
The administrative command to create non-local sub-directories will initially only allow this from parent directories on MDT0. This is to ensure that the failure of any MDT other than MDT0 (the primary MDT) does not disconnect the namespace. For experimental purposes, an administrative override will be available that allows specifying non-MDT0 as the parent directory. Normal Lustre recovery should be unaffected.
Multiple MDTs sharing a single MDS must be supported so that MDSs can be configured for active-active failover just like OSSs.
Operations that require updates on multiple MDTs will be executed using a master/slave model. The client will send the request to the master MDT (e.g. the MDT holding the parent directory for the operation) which will then distribute updates to the other slave MDT(s). These updates will be ordered and partially synchronous to ensure that the name space remains intact in the face of all possible failures. In the case of multi-node failures at worst a single reference on the directory will be leaked. This may lead to an empty directory not being deleted, but the namespace will remain usable after recovery. The cleanup of such empty directories will occur in Phase III of the Lfsck project. In DNE Phase I, only remote directory creation and destroy are cross-MDT operations.
During remote directory creation,
lfs mkdir
must specify which MDT the directory will be created on (this is the slave MDT).Given that the entire process is synchronous, replay will not occur during recovery. However, resend may occur under two conditions:
If the master MDT restarts before it sends the reply to the client, the client will resend the request to the master MDT. Since it is the client that allocated the FID for the remote directory, it is guaranteed that the same FID will used during the recovery.
After the master MDT receives the request, the existence of the name entry will be verified. If the entry exists, the master MDT will send a reply to the client with the existing directory FID. If the name does not exist the master MDT will redo the entire operation.
During this process, the slave MDT will verify whether the operation has been executed. This can be resolved by adding the client->master request XID to the last_rcvd file for that client. When the master MDT sends its RPC to the slave MDT, the XID of the client->master request will be included in the slave RPC, and the slave MDT will add the XID to that MDTs export in the last_rcvd file. That will allow the slave MDT to determine whether the request from the master MDT was processed during recovery.
If the slave MDT restarts before it sends reply to the master MDT, the master MDT will resend the request to the slave MDT when it reconnects. This will recreate the object if it does not exist.
During remote directory unlink,
Remote unlink is similar to OST orphan handling. |
If the master MDT restarts before it sends the reply to the client, the client will resend the request to the master MDT. The entire operation will be repeated if the entry still exists, otherwise the master MDT will do nothing but reply to the client as if the operation has been processed correctly.
If the slave MDT restarts during the process, the log sync thread will ensure the remote directory will be finally destroyed on the slave MDT.
With DNE, MDS nodes may be configured as active-active failover pairs using shared storage. If one MDS in the pair fails, the other MDS will take over the failed MDS's MDT to maintain the accessibility of the complete namespace. Such an active-active failover configuration is illustrated below:
The notification and failover mechanism is provided by an external High Availability (HA) daemon that monitors the status of the MDS nodes. This HA daemon implements a mechanism to power off (STONITH) a failing MDS, and to import the shared storage to the backup server.
Functional testing will create a directory on every MDT and run the full set of standard non-failover tests within each directory concurrently and from multiple clients. All test must pass with the same behavior observed with a single MDT.
Create two directories on all MDTs except MDT0. Populate the directories with subtrees using depth and file counts randomly chosen from within specified limits. Run multiple threads on multiple clients, each randomly picking a pair of directories from all the subtrees, creating a file in one directory and attempting to rename(2) or link(2) between directories. These operations must fail with EXDEV if the directories chosen were on different MDTs and succeed if they were on the same MDT.
Create a directory on every MDT and run racer and lfs mkdir under these directories.
lfs mkdir --i $((RANDOM % MDTCOUNT)) $((RANDOM % MAX))
under the directory on MDT0 and no errors will be observed.A remote directory must inherit default ACL and other necessary attributes of the parent directory correctly.
mkdir /mnt/lustre/parent
on MDT0./mnt/lustre/parent
.lfs mkdir -i 1 /mnt/lustre/parent/dir_mdt1
# create a cross-ref dir on MDT1/mnt/lustre/parent/dir_mdt1
is the same as the one in /mnt/lustre/parent
.rmdir
remote directoryrmdir /mnt/lustre/parent/dir_mdt1
# unlink a cross-ref dir on MDT1.Normal recovery is defined as recovering to a functional filesystem after client or server failure.
lfs mkdir
(used for creating cross-ref directory in DNE) on one MDT, so that each MDT will have two directories.lfs mkdir
and rmdir
during the test.Restart the slave MDT during cross-MDT updates,
lfs mkdir
to create a remote directory. The master MDT will send a synchronous create request to the slave MDT.Restart the Master MDT during cross-MDT updates,
lfs mkdir
to create a remote directory. The master MDT will send synchronous create request to the slave MDT.Demonstrate active-active failover with separate MDTs.
Upgrade single MDT to DNE,
mkfs.lustre
and add them to the filesystem, which now includes 4 MDTs.Performance tests should show acceptable scaling for workloads operating in different directories hosted on separate MDTs provided OSSs are not saturated.
mdsrate (--lookup --stat)
on each directory parallel (thread_count = n) and measure the time taken for the entire process: t look+stat.mdsrate (--lookup --stat)
on each directory in parallel (thread_count = n) and record the time taken: t'lookup+stat. Assert t'lookup+stat < t lookup+stat.mdsrate (--unlink)
on each directory parallel (thread_count = n) and measure the time taken for the entire process: t unlink.mdsrate (--unlink)
on each directory in parallel (thread_count = n) and record the time taken: t'unlink. Assert t'unlink < t unlink.MDS - MetaData Server (the node on which the metadata operations are done)
MDT - MetaData Target (the underlying storage and filesystem in which the metadata resides)
MDT0 - The MDT containing the root of the filesystem.
Master MDT - The MDT that coordinates a distributed MDT operation.
Slave MDT - The MDT controlled by the master MDT in a distributed MDT operation.
XID - The unique ID for a ptlrpc request within a ptlrpc connection.