Motivation
The current implementation of the Lustre POSIX copytool (lhsmtool_posix) is intended for demonstration purposes and not production use. Several items appear to be included that are interesting features but not fully realized for production use:
Creating a shadow namespace. The copytool saves the name of each archived files in a “shadow tree,” but there is no mechanism to keep them in sync with changes made in the filesystem. In the case of hard links, only one name is saved.
File striping data is copied and stored with the object in the archive, and is used when file is restored. However, there is no mechanism to allow the user to change the striping data before the file is restored.
The copytool uses FIDs to identify objects in the archive. FIDs are not globally unique identifiers and, like inode numbers, are not intended to be used to identify files outside of the filesystem.
The Lustre HSM interface and data movement functionality is tightly coupled in the code, and provides no abstraction for different data movers.
The copytool produces overly verbose logging, and does not capture performance metrics.
New Design for HSM
This proposal aims address all of the above issues of the current copytool with a new design and implementation.
This design divides a "copytool" into two separate components, a low level Agent and backend-specific data movers. In this approach, the data movers do not interact directly with Lustre. Each mover registers to the locally running agent, and the agent delivers actions for the mover to perform. The agent manages the incoming requests from the HSM coordinator, and forwards the state updates back to the coordinator. The agent also stores the keys provided by the mover for each archived file, and then provides this key in further actions on this file.
The movers are standalone processes, and communicate with the agent using gRPC, an RPC protocol built using Google's Protocol Buffers. When the agent starts, it starts the configured movers automatically.
Agent
The Agent interacts with the Lustre coordinator and forwards messages to the data movers. It also provides additional functionality that is needed for all movers:
- Stores the HSM key provided by the mover. Currently the agent stores the key as opaque data in the
trusted.hsm_file_id
extended attribute. In the future this key might be stored in the layout or else where in the file’s metadata, though this is transparent to the data movers. - Restore file striping metadata. When a restore operation is received, the agent ensures the striping information on the original file is used when the data is restored. Although Lustre does not support this currently, in the future users (or a policy engine) will be update the striping data before restoring the file.
- Capture and log metric data on operation of the agent and data movers.
RPC Interface
service DataMover { rpc Register(Endpoint) returns (Handle); rpc GetActions(Handle) returns (stream ActionItem); rpc StatusStream(stream ActionStatus) returns (Empty); }
Register(Endpoint) returns (Handle)
message Endpoint { string fs_url = 2; uint32 archive = 1; }
The mover register a specific archive id with the agent. The fs_url is the identifier of the filesystem (such as mntent.fsname). This is used to confirm the agent and mover are referring to the same filesystem because all paths passed between the agent and movers are relative paths from the root of the filesystem.
Register returns a handle for a virtual connection with the agent. This handle is used with all the communication with the agent related to with this archive id.
GetActions(Handle) returns (stream ActionItem)
message ActionItem { uint64 id = 1; // Unique indentifier for this action, must be used in status messages Command op = 2; string primary_path = 3; // Path to primary file (for metadata or reading) string write_path = 4; // Path for writing data (for restore) uint64 offset = 5; // Start IO at offset uint64 length = 6; // Number of bytes to copy bytes file_id = 7; // Archive ID of file (provided with Restore command) bytes data = 8; // Arbitrary data passed to action. Data Mover specific. }
GetActions() returns a stream that will provide ActionItems. Each action include a command and parameters to perform on the file referenced in primary_path. If the action requires writing to the Lustre filesystem (i.e. Restore), then the data must be written to write_path. If a file_id is associated with this file, then it will be included in the ActionItem.
The op can be one of these Commands.
ARCHIVE
An Archive command indicates the mover must copy length data starting at offset from the primary_path to the backend. An identifier used to refer to data stored in the backend can be returned as a file_id in the final ActionStatus message.
RESTORE
A Restore command indicates the mover must copy length data starting at offset from the backend to the write_path. The file_id provided by the mover after the Archive is included.
REMOVE
A Remove command indicates the file in primary_path is no longer managed HSM and the copied data should be removed from the backend.
StatusStream(stream ActionStatus) returns (Empty)
message ActionStatus { uint64 id = 1; // Unique identifier for action bool completed = 2; // True if this is last update int32 error = 3; // Non-zero indicates an error uint64 offset = 4; uint64 length = 5; Handle handle = 6; bytes file_id = 7; // Included with completion of Archive int32 flags = 8; // Additial flags (used for errors only?) }
StatusStream() creates a stream that is used to send ActionStatus messages. The messages are used to send completion notificationand (for long running actions) progress updates.
When completed is 0, the message is a progress update for the message id, and the length value will include the amount of data that has been copied since the previous update. This is useful for long running transactions to provide update for monitoring systems, and also prevents the action from being timed out and restarted. It is important for movers to send these messages periodically even when no data is being copied (such waiting for a tape to be loaded) to ensure the action does not time out.
When completed is non-0, the message indicates that message referenced by ID has completed. If error is non-zero, then the action has failed with given error code. If error is 0, then the action has succeed, and the offset and length range of data has been copied by the mover. If file_id is included, then the agent will store this and include this with future actions referring to this file.
POSIX Mover
The POSIX mover allows another filesystem to be used a second storage tier for Lustre. The mover copies data between Lustre and another local filesystem. The POSIX mover configuration includes the archive ID and path for each archive backend managed by the mover.
Each time a file is submitted to the mover for archiving, the mover generates a new UUID which is used to identify the data object in the archive directory. The path to the object is <archiveDir>/objects/xx/yy/UUID
, where:
xx = UUID[0:2] yy = UUID [2:4]
Once the archive is complete, the mover returns the UUID to the agent. The UUID is used during restore operation to locate the data object in the archive directory.
S3 Mover
The S3 mover allows an S3-style object storage system to be used a second tier for Lustre. The mover copies data between Lustre and an S3 bucket. The configuration includes the archive ID and bucket to be used to store data. A UUID is generated for each file, and the object is stored in the bucket with similar path as the POSIX mover. Once the archive is complete, however, a URL for the object is returned to the Agent:
s3://<bucketName>/<prefix>/o/UUID
The S3 mover can also be used to import existing data set from S3 into Lustre. To do this, the “lhsm import” command is used to replicate the path names from S3 into the Lustre filesystem, and the import also adds the s3 URL to each file so the mover can retrieve the data.
Storing Large Files in S3 (Proposed)
There is a 5TB size limit in AWS S3 (TODO: check limits for other S3 implementations) , so larger files will need to be sharded over multiple S3 objects. The metadata of the shard UUIDs and their respective ranges would be stored in a metadata object, and the name of this object is returned to the agent as a key. To distinguish "shard metadata" objects from normal objects, the name includes al ".shm" extension after the UUID.
s3://<bucketName>/<prefix>/o/UUID.shm