Why Use Lustre™
April 1, 2011
High Performance Computing (HPC) clusters are created to provide extreme computational power to large scale applications. This computational power often results in the creation of very large amounts of data as well as very large individual files. For quite some time now , the speed of processors and memory have risen sharply, but the performance of I/O systems has lagged behind.
While processors and memory have improved in cost/performance exponentially over the last 20 years, disk drives still essentially spin at the same speeds, and drive access times are still measured in numbers of milliseconds. As such, poor I/O performance can severely degrade the overall performance of even the fastest of clusters. This is especially true of today’s multi-petabyte clusters.
The Lustre file system is a parallel file system used in a wide range of HPC environments, small to large, such as AI/ML, oil and gas, seismic processing, the movie industry, and scientific research to address a common problem they all have and that is the ever increasing large amounts of data being created and needing to be processed in a timely manner. In fact it is the most widely used file system by the world’s Top 500 HPC sites.
With Lustre in use it’s common to see end-to-end data throughput over GigE 100GigE networks in excess of 110 MB10 GB/sec , InfiniBand double data rate (DDR) and InfiniBand EDR links reach bandwidths up to 1.5 10 GB/sec, InfiniBand Quad Data Rate (QDR) links reach over 2.5 GB/sec, and 10GigE interfaces provide end-to-end bandwidth of over 1 GB/sec. Lustre can scale to tens of thousands of clients. At Oak Ridge National Laboratory and their production Spider file system, Spider, runs Lustre runs with over 25,000 clients, over 10PB 20PB of storage, and achieves an a peak aggregate IO throughput of 240 GB2TB/sec.
This white paper page addresses the many advantages that the Lustre file system offers to High Performance Computing clusters and how Lustre, a parallel file system, improves the overall scalability and performance of HPC clusters.
When designing a High Performance Computing (HPC) cluster, the HPC architect has three common file system options for providing access to the storage hardware. Perhaps the most common is the Network File System (NFS). NFS is a standard component in cloud computing environments. NFS is commonly included in what is known as a NAS, or Network Attached Storage architectures. A second option available to choose is SAN file systems, or Storage Area Network file systems. And last, but not least, are parallel file systems. Lustre is the most widely used file system on the top 500 fastest computers in the world. Lustre is the file system of choice on 7 out of the top 10 fastest computers in the world today, over 60% 70% of the top 100, and also for over 50% 60% of the top 500. This paper describes why Lustre dominates the top 500 (www.top500.org) and why you would use it for your high performance IO requirements.
SAN file systems are capable of very high performance, but are extremely expensive to scale up since they are implemented using Fibre Channel and therefore, each node that connects to the SAN must have a Fibre Channel card to connect to the Fibre channel switch.
Distributed Parallel File System)
The main advantage of Lustre, a global parallel file system, over NFS and SAN file systems is that it provides; wide scalability, both in performance and storage capacity; a global name space, and the ability to distribute very large files across many nodes. Because large files are shared across many nodes in the typical cluster environment, a parallel file system, such as Lustre, is ideal for high end HPC cluster I/O systems. This is why it is in such widespread use today, and why at at Whamcloud we have over 30 senior we have an organization of Lustre engineers dedicated to its support, service, and continued feature enhancements.
Lustre is implemented using only a handful of nodes connected to the actual storage hardware. These are known as Lustre server nodes, or sometimes as Lustre I/O nodes, because they serve up all the data to the rest of the cluster, which are typically known as compute nodes and often referred to as Lustre clients.
Lustre is used primarily for Linux based HPC clusters. Lustre is an open source file system and is licensed under the GPL GPLv2. There are two main Lustre server components of a Lustre file system; Object Storage Servers (OSS) nodes and Meta Data Servers (MDS) nodes. File system Meta data is stored on the Lustre MDS nodes and file data is stored on the Object Storage Servers. The data for the MDS server is stored on a Meta Data Target (MDT), which essentially corresponds to any LUN being used to store the actual Meta data. The data for the OSS servers are stored on hardware LUNs called Object Storage Targets (OSTs). The OST ldiskfs targets can currently be a maximum size of 16TB1PB. Since we usually configure OSS/OST data LUNs in an 8+2 RAID6 RAID-6 configuration, the a common LUN consists of ten 2TB SATA drives, or in the case of 1TB drives ten 1TB SATA drivesconfiguration is ten 16 TB SATA drives. These are the most common configurations, but some large sites do use SAS or Fibre Channel NVMe drives for their OSTs.
For the most part, most all Meta data information and transactions are maintained by and on the Meta Data Servers. So, all common file system name space changes and operations are handled by the MDS node. This would include common things such as looking up a file, creating a file, most file attribute lookups and changes. When a file is opened, the client contacts the MDS server, which in turn looks up the file and returns to the client the location of that file on any and all OSS servers and their corresponding OSTs. Once the client has the location of the file data the client can do any and all I/O directly between the client and any OSS nodes without having to interact with the MDS node again. This is the primary advantage that Lustre has over a file system such as NFS where all I/O has to go through a single NFS server or head node. Lustre allows multiple clients to access multiple OSS nodes at the same time independent of one another, thereby allowing the aggregate throughput of the file system to scale with simply the addition of more hardware. For example, Lustre holds the world record for file system throughput speed at approximately 240 GBthroughput exceeds 2 TB/sec set at Oakridge Oak Ridge National Labs (ORNL). This performance number is essentially limited only by the amount and characteristics of the storage hardware available. In this case, the file system size was 10.7 Petabytes.
The Lustre architecture is used for many different kinds of clusters, but it is best known for powering seven of the ten largest high-performance computing (HPC) clusters in the world, with some systems supporting over ten thousands clients, many tens or hundreds of petabytes (PB) of storage and many of these systems nearing or over with hundreds of gigabytes per second (GB/sec) or even terabytes per second (TB/sec) of I/O throughput.
A great deal of HPC sites use Lustre as a site-wide global file system, servicing dozens of clusters on an unprecedented scale. IDC shows Lustre as being the file system with the largest market share in HPC. (Source: IDC’s HPC User Forum Survey, 2007 HPC Storage and Data Management: User/Vendor Perspectives and Survey Results)
The scalability offered by the Lustre file system has made it a popular choice in the oil and gas, manufacturing, and finance sectors.
The Lustre file system software is open source software, and because of this it keeps the pricing competitive because no one has a monopoly and its continued existence is not dependent upon the economic success of any single company. Since the file system itself is free and one pays just for support, this keeps the pricing competitive as you would pay only for the support of your Lustre file system. Over 10,000 downloads of Lustre software occurs every month and thousands of Linux clusters are supported.
Lustre Networking (
In a cluster using a Lustre file system, the Lustre network is the network connecting the OSS and MDS servers and the clients. The disk storage backing the MDS and OSS server nodes in a Lustre file system is connected to these Lustre I/O server nodes usually using traditional SAN technologies, however the breakthrough architecture with Lustre is that it does not need a SAN to extend or connect to Lustre clients. LNET LNet is only used over the Lustre system network, wherein it provides the entire communication infrastructure needed by the Lustre file system.
Though TCP/IP over Ethernet is certainly supported, one of the key features of the LNET LNet driver is its support of RDMA when such capability is supported by the underlying networks such as InfiniBand, ElanRDMA over Converged Ethernet (RoCE), and MyrinetOmniPath Architecture (OPA). This provides near full bandwidth utilization of the underlying link versus other architectures. LNET LNet also supports many common network types such as IP and InfiniBand OFED along with simultaneous support of multiple network types connected together for which LNET LNet will route packets between the different types of networks. LNET LNet also provides Multi-Rail for performance and high availability and recovery components that enable transparent recovery when used with Lustre failover servers.
Lustre Networking (LNETLNet) performance is extremely high. As pointed out earlier, it is very common to see end-to-end throughput over 100 GigE networks in excess of 110 MB10 GB/sec, InfiniBand double data rate (DDR) EDR links reach bandwidths up to 1.5 GB/sec, InfiniBand Quad Data Rate (QDR) links reach over 2.5 GB/sec, and 10GigE interfaces provide end-to-end bandwidth of over 1 over 10 GB/sec. These are data rates of the bandwidth delivered to clients from just one HCA interface on a Lustre server, higher performance is available with multiple interfaces!
Servers in a Lustre cluster are often equipped with a very large number of storage devices which Lustre can serve and handle up to tens of thousands of clients. In fact, over 1429,000 clients have been handled in the largest production Lustre system so far. Any cluster file system should be able to handle server reboots or failures transparently through a high-availability mechanism such as failover to remove or reduce any single points of failure in the system. When a server fails for example, applications should merely see a delay in the response to their system calls executed in accessing the file system.
The Lustre parallel file system is well suited for large HPC cluster environments and has capabilities that fulfill important I/O subsystem requirements. The Lustre file system is designed to provide cluster client nodes with shared access to file system data in parallel. Lustre enables high performance by allowing system architects to use any common storage technologies along with high-speed interconnects. Lustre file systems also can scale well as an organization’s storage needs grow. And by providing multiple paths to the physical storage, the Lustre file system can provide high availability for HPC clusters. And maybe best of all, Lustre is now supported by Whamcloud and and its many partners.