Increasing performance without re-engineering applications is highly desirable for HPC Users. OpenSFS and Whamcloud are engaged in the removal of a known bottleneck that will deliver this benefit to users on the Lustre filesystem. This document presents the limitation and a proposed solution to provide a step-wise performance boost to a wide class of HPC applications.
Single directory performance is a critical for HPC workloads. In a typical use case an application creates a separate output file for each node and task in a job. As nodes and tasks increase, hundreds of thousands of files may be created in a single directory within a short window of time. Also, on the Lustre OSS the internal object directories can grow to have millions of entries, which can result in significant overhead under some workloads.
Today, both filename lookup and file system modifying operations (such as create and unlink) are protected with a single lock for an entire ldiskfs directory. OpenSFS and Whamcloud will remove this bottleneck by introducing a parallel locking mechanism for entire ldiskfs directories. This work will enable multiple application threads to simultaneously lookup, create and unlink in parallel: Parallel Directory Operations (PDO).
The characteristics of successful implementation are described in this section. Parallel Directory Operations (PDO) is concerned with a single component of the Lustre filesysetem: ldiskfs. The Solution Requirements ensure that work on a single component can be verified against the strategic goals of the Lustre community.
Ldiskfs is responsible for storing data to disk and is part of the application stack that a user assumes is completely reliable. In addition, new performance must not come at cost to the system somewhere else: performance must be preserved for single thread operations. Finally, new code will be freely licensed under the GNU GPL, and to be of value to the community it must be easy to maintain.
Ldiskfs uses a hashed-btree (htree) to organize and locate directory entries, which is protected by a single mutex lock. The single lock protection strategy is simple to understand and implement, but is also a performance bottleneck because directory operations must obtain and hold the lock for their duration. The PDO project implements a new locking mechanism that ensures it is safe for multiple threads to concurrently search and/or modify directory entries in the htree. PDO means MDS and OSS service threads can process multiple create, lookup, and unlink requests in parallel for the shared directory. Users will see performance improvement for these commonly performed operations.
It should be noted that the benefit of PDO may not be visible to applications if the files being accessed are striped across many OSTs. In this case, the performance bottleneck may be with the MDS accessing the many OSTs, and not necessarily the MDS directory.
Htree directories with parallel directory operations will provide optimal performance for large directories. However, within a directory the minimum unit of parallelism is a single directory block (on the order of 50-100 files, depending on filename length). Parallel directory operations will not show performance scaling for modifications within a single directory block, but should not degrade in performance.
In order to be practically useful, any new locking mechanism should should maintain or reduce resource consumption compared to the previous mechanism. To measure this, performance of PDO with a single application thread should be similar to that of the previous mechanism.
The existing htree implementation is well tested and in common usage. To avoid deviating from this state, it is not desirable to significantly restructure the htree implementation of ldiskfs. Ideally, the new lock implementation would be completely external to ldiskfs. In reality, this is not possible but a maintainable PDO implementation will minimize in-line changes and maximize ease of maintenance.
Alice writes an application for her 100000 core machine. Each hour her application creates a checkpoint file using the job name, thread rank, and timestep number as components the checkpoint filenames. Each thread writes it’s output to the same directory in the Lustre filesystem ‘/parallel/alice/app/checkpoint/’. This can quickly cause a directory to have millions of entries, and can reduce application efficiency.
MDS service threads need to modify the same directory to create the checkpoint files. Without PDO, all of these threads will be serialized on the single directory lock, significantly increasing the time that each checkpoint takes. This is due to threads waiting to create their checkpoint file, as well as underutilizing the IO bandwidth for writing the actual checkpoint data until enough threads have created new files.
The PDO project will:
There are two types of blocks in an ldiskfs directory:
|
An Ldiskfs directory stores all name entries in an leaf block. The leaf blocks can contain between 15 - 340 entries, depending on the filename length (between 4 and 256 characters), but typically around 100 entries. It is not possible to parallelize locking at a smaller granularity than a single leaf block.
For directories with only a single leaf block, there is no tree block. When the directory grows and name entries overflow one leaf block, ldiskfs will mark the directory as “indexed directory”. A this point, name entries will be sorted by hash-value and the leaf block will split into two leaf blocks at the median hash value. A tree block will be allocated to store the [hash, block-ID] for these two leaf blocks. At this point, the structure is described as a htree.
Leaf blocks are split again and again as the size of the directory grows. Tree blocks can be split as well if the number of leaf blocks increases to the point where their indices overflow one tree block (more than 510 indices).
Today, ldiskfs locks the whole htree whenever a thread needs to modify any block in the tree. The locking mechanism for PDO is designed to lock at the granularity of a block-ID or hash-value for the htree. This new locking mechanism is called the htree-lock.
The htree lock is characterized as follows:
If ldiskfs is called directly from VFS without Lustre, htree-lock will be NULL and ldiskfs will assume the directory is well protected by mutex in VFS. This behavior makes PDO ldiskfs gracefully degrade to single directory operations when accessed via the VFS interface (e.g. if using ldiskfs to mount the MDT locally).
Very large directories can contain many millions of files. Today, current htree of ldiskfs only has two levels, and can have at most about 15 million files. To enable PDO changes to htree, the ldiskfs implementation will be made to support N-levels of htree and even larger directories. This is not in the requirements of PDO, but has been included in scope as it is judged by OpenSFS and Whamcloud to be a worthwhile addition to PDO work.
Least Recently Used (LRU) Buffer is a per-CPU cache used in Linux for fast searching of buffers. The default LRU size is 8. This default value is considered too small for Lustre where support N-level htree for very large directories.
PDO with very large directories as N-level htree is expected to consume LRU for quick reference buffers. As a consequence, in-use buffers may be purged from the LRU cache prematurely if the cache is too small, leading to repeated lookup of the buffer in the block device cache.
Purging of an active buffer will significantly degrade performance as a slow and expensive buffer searching path will need to be traversed. To avoid this scenario, an additional patch to configure the LRU buffer size will be provided and a default value of 16 will be set.
A tunable for ldiskfs to allow the user to turn on/off htree-lock at runtime will greatly simplify testing and benchmarking of this feature. This tunable is available for testing purposes only, and is not expected to be useful to administrators, and so will not be extensively documented. PDO is an enhancement that will be enabled by default and no use cases have been anticipated that will require it to be disabled, though this may be used as a short-term workaround in the unlikely case that this feature causes instability.
Verify PDO performance improvement by mdt-survey.
The mdt-survey tool allows testing multi-threaded metadata operations directly on the MDS without the use of Lustre clients so that it is possible to measure performance of the PDO feature even with a single MDS server. Tests should be run that separately create, lookup+stat, and lookup+unlink a fixed total number of files under a single directory with M={1,2,4,...,1024} threads. Between each of the phases, the MDT filesystem should be unmounted to flush the cache so that the performance of the ldiskfs filesystem is being measured (where PDO is implemented) and not the higher-level cache.
Verify PDO performance improvement by mdtest or mdsrate.
The mdtest or mdsrate benchmarks should be able to verify performance improvement with the whole filesystem stack, including clients. Benchmark runs should create/lookup+stat/lookup+unlink a fixed total number of files under a single directory with M={256,512,1024} MDS service threads and N={1,1024} clients (M <= N). These tests can also be repeated with C={0,1,2,4} stripes per file to measure the impact the stripe count has on PDO.
Verify no performance regression for any other cases by mdtest or mdsrate.
We can use mdtest to verify whether there is performance regression for operations under small (non-indexed) directories, also we can verify whether there is any extra overhead for htree-lock by running mdtest with single thread in a single directory.
Verify no functionality regression by acc-small.
PDO patch will be accepted as properly functioning if:
Ldiskfs: ldiskfs is an ext4 filesystem with additional patches applied specifically to enable use as a backend Lustre filesystem and to improve performance.