Introduction

If files are accidentally or maliciously deleted from a file system, the user data may be permanently lost. The Trash Can is a useful feature in file systems that acts as a temporary holding area, allowing users to store deleted files for a short time before they are permanently deleted. It provides a mechanism to restore or retrieve deleted files if needed, and automatically deletes the files once they become too old or the filesystem is too full.

When the Trash Can feature is enabled, when a user deletes a file from a file system, it is not immediately deleted but moved to the Trash Can.  Deleted files and directories are temporarily stored in the Trash Can. Files and directories in the Trash Can may be restored or retrieved individually or in bulk if they are still available. The Trash Can may be manually emptied, or once the filesystem is nearly full the system will automatically empty files from the Trash, taking into account which users and projects are consuming the most space.

The Trash Can should including the following functionalities:

Deleted files can no longer be restored from the Trash Can when:

Design and Implementation

The design for the Trash Can feature in Lustre is relatively straight forward.

On the server side, the MDS implements the basic functionalities such as moving the "deleted" files into the Trash Can, and the interface how to traverse them. On the client side, it implements the basic utility tools to interact with the Trash Can ("lfs trash set|clear|list|delete|restore FILE|DIR"), including:

Our mechanism only moves the regular files into the Trash Can upon its last unlink.

It borrows lots of ideas from orphan and volatile files in Lustre (which stores in "ROOT/PENDING" directory on each MDT). During the format and setup, each MDT creates a "ROOT/TRASH" directory as a Trash Can to store "undeleted" files.

The POSIX API is used to traverse the files under the Trash Can on a given MDT. First, a client can get the FID of Trash Can directory ROOT/TRASH on the MDT. Then the client can get the file handle via FID open: dir_fd=llapi_open_by_fid(). After that, the "undeleted" files within the Trash Can can be traversed via readdir(dir_fd); it can open by openat(dir_fd, ent->d_name) and obtain the "trusted.unrm" XATTR, which contains the necessary information to resotre, via fgetxattr(fd, "trusted.recyclebin"); The client can even read the data or swap layouts of the "undeleted" file on the Trash Can for restore: opendir()/readddir()/openat()/fgetxattr("trusted.recyclebin")/close()/closedir().

The workflow for the Trash Can is as follows:

     # lfs trash set $file|$dir
     # lfs trash clear $file|$dir
     struct ll_trash_xattr {
__u32 ltx_flags;
__u32 ltx_uid; // UID of the deleting file, used for quota accounting
__u32 ltx_gid; // gid of the deleting file, used for quota accounting
__u32 ltx_projid; // projid of the deleting file, used for quota accounting
__u64 ltx_timestamp; // Timestamp that the file moved into the Trash Can, maybe we could use ctime here
};

Where ltx_uid/ltx_gid/ltx_projid are the original UID/GID/PROJID of the deleted file, mainly used for quota accounting for the restore operation; @ltx_timestamp is the time that the file was moved into the Trash Can. It is used to determine whether the file is expired for the specified retention period and thus should be removed from the Trash Can finially (maybe we could also use the inode ctime for this purpose instead of storing a separate timestamp?). During deleting the file, we can get the full path information via the way similar to fid2path().

     # lfs trash {list|ls} [DIR|FILE]
MDT index: 1
uid gid size delete time FID Fullpath
0 0 4096 Nov 14 08:11 [0x200034021:0x1:0x0]->/mnt/lustre/f1
0 0 32104 Nov 14 08:07 [0x200034021:0x2:0x0]->/mnt/lustre/dir/f2
...

Internally, the lfs trash list command is looking up the FID and MDT of the current directory, or the directory specified by DIR, and then listing the respective directory under $MOUNT/.lustre/trash/MDTxxxx/DIRFID/ or the directory file descriptor returned via llapi_recycle_fid_get(MNTPT, mdt) if the .lustre/trash directory is not available.


where any files deleted from this directory would be moved.


     # lfs trash {delete|rm} [DIR/]FILE ...
     # lfs trash clear DIR ...
     # lfs trash {restore|unrm} [DIR/]FILE ...

     # lfs trash find -ctime +time [DIR]
     # ls /mnt/lustre/.lustre/trash/MDT0002
     0x200034021:0x1:0x0
0x200034021:0x2:0x0
...
# lfs trash ls /mnt/lustre/.lustre/recycle/MDT0002/0x200034021:0x1:0x0
0 0 4096 Nov 14 08:11 [0x200034021:0x1:0x0]->/mnt/lustre/f1
# lfs trash list /mnt/lustre/.lustre/recycle/MDT0002
MDT index: 1
uid gid size delete time FID Fullpath
0 0 4096 Nov 14 08:11 [0x200034021:0x1:0x0]->/mnt/lustre/f1
0 0 32104 Nov 14 08:07 [0x200034021:0x2:0x0]->/mnt/lustre/dir/f2
...

Clean up files from the trash

It needs to automatically clean up files from the trash can when the filesystem becomes full. It cannot be that the user has to delete every file twice, and it cannot be that the filesystem is allowed to get 100% full (or even 90% full) due to files in the trash. There needs to be an automatic mechanism to clean up the trash to ensure that the filesystem performance does not degrade when users though they deleted files.

It can assign the UID/GID/PROJID to a trash user so that this quota is not accounted against the end user, and keep the original UID/GID/PROJID in a the XATTR "trusted.unrm".

In our design, it does not depend on a userspace utility for such a critical function to clean up files from the trash when FS is nearly full, since that utility may never be started, or the client is evicted, or similar. If that happens, the filesystem would become full and unusable, even though the user had already deleted files from the filesystem. This needs to be bulletproof and run automatically when the OSTs (or MDTs) are getting full.

The MDS is already monitoring the OST fullness every 5s to make object allocation decisions, so it can also make decisions about files to delete.Thus MDT can periodically monitor the space usage of the trash user (quota) and space usage for the entire file system with the additional consideration of the retention period and deleted timestamp for the files, choose the candidates to be deleted permanently to free up the space.

Also, there needs to be some accounting of files in the trash, so that "df" does not show the filesystem as 100% or 90% full all the time, but rather show only the non-trash space usage (= real usage - trash usage).

Per-User Trash Can

A per-user Trash/MDTxxxx/UID/ directory that is owned by that UID and mode 0600 should always be created in the top-level directory to avoid world readable access to deleted files, and to de-conflict files/directories of the same name created by users (e.g. tmp/ or data/ or Documents/ or similar. That avoids exposing files to other users that may be private, and also allows tracking space usage more clearly for each UID, so that a user's data can be found and purged more quickly if they are exceeding their quotas.

Per-Tenant Trash Can

Files and directories deleted from within a subdirectory mount of a Nodemap should be stored in a Trash/MDTxxxx/NODEMAP/UID/ directory to isolate the files/directories from different tenants.  The NODEMAP/ directoryname is the configured name of the nodemap for that tenant, and can be found from the client export used to perform the final unlink operation. The UID/ directory name should be the unmapped ID of the user, so that the visible directory name matches the user expectation.  The UID directory ownership should be the mapped ID of the user, so that proper file access controls can be maintained.  By having the multi-level NODEMAP/UID/ naming, it isolates the UID directory names from other tenants that may have the same mapped UID directory name.

The Nodemap for a tenant should allow configuring the UID/GID/PROJID to which files in the Trash are assigned.  These IDs should be within the ID offset range of the Tenant (e.g. 99999) so that they can be accessed and mapped correctly, but are unlikely to cause conflicts with other IDs used by the Tenant.  This will also allow the Tenant project quota group to account for all space used by the tenancy, while still separating Trash Can usage for the regular UID/GID/PROJID of the Tenant users.

In a multi-tenant environment, it would be desirable to have a more sophisticated policy engine to manage Garbage Collection of files within the trash, in order to provide maximum utility to each Tenant.  For example, if Tenant 1 has deleted files an hour ago, but Tenant 2 has written and deleted TB of data since that time, the Tenant 1 files may have expired out of the Trash Can.  Developing a complex policy engine to manage GC in an MTFL environment is out of scope for the initial TCU implementation.  We likely want to leverage and enhance the lpurge utility from Hot Pools to actively monitor the space usage of tenants on different OSTs to decide which objects (files) should be removed.

Repeated deletion of same filename

If the same filename is repeatedly created and deleted within the same parent directory, then the deleted files will have conflicts when moved into the pFID directory in the Trash Can.  To disambiguate the files in Trash, the conflicting filenames should be disambiguated by appending a timestamp to the filename, like filename.2025-04-03-00:11:24, possibly adding .microseconds if there is still a conflict.  It isn't totally clear whether it would be better to use the timestamp from when the file was deleted, or when the file was created.  Both have some value to help users distinguish between the different versions.

JobID of process deleting a file

In LU-13031 the JobID of the process that first creates a file is stored in the user.job xattr on the MDT inode, for diagnostic purposes and to allow determining provenance of each file later on.  For the Trash Can, it would be useful to also store the JobID of the process that is deleting the file, for diagnostics such as determining rogue processes that are deleting files in the filesystem.  Something like user.del would be a reasonable default xattr name.  The actual xattr name can be configured with the mdt.*.job_xattr_del parameter.

Trash support for a striped directory

It would useful to implement a virtual ".Trash" subdirectory accessible in each directory in the filesystem that can be used to browse files/directories in the Trash Can and access them for recovery.

The FID of the ".Trash" directory is derived from the FID of the parent directory (pFID), by looking up the corresponding "stub" directory with the FID-named directory: ".lustre/trash/MDTXXXX/UID/pFID". Essentially ".Trash" under each normal directory is just a virtual shortcut to the stub directory (if the parent is not a striped dir) that is accessible in each directory if specified by name ".Trash". The files/directories under ".Trash" directory can be access via normal POSIX file system API such as via readdir()/stat()/getxattr() so that it can be used by normal tools such as "ls -l .Trash/" or "find .Trash" to locate files for restoration or permanent removal.  If there are no deleted files under a specific directory, then the virtual .Trash directory will not be accessible, and will return -ENOENT for any lookup.

For a striped directory, its ".Trash" directory is also a vitual striped directory with each stripe on the same location (MDTs) where the shard FID is the FID of the corresponding stub directory on that MDT. If the stub directory on a certain MDT does not exist (or not create yet), the client lookup() or readdir() under ".Trash" directory should skip the stripe. The master FID of the virtual ".Trash" directory could be same with the FID of the parent directory but with f_ver setting with 1 (FID_VERSION_TRASH = 1) to distinguish them.

To avoid the inconsistent problem, each access on the virtual striped ".Trash" should check and revalidate the virtual stripe LMV EA. For example, It should add the new shards into the stripe EA after a new stub directory on a certain MDT was created.

It should handle the case that a directory was restriped and the LMV layout was changed. In this case, the files under the directory will be migrated to another MDT. To simplify the implementation, we do not migrate the files according to the new LMV layout in the Trash Can. This may result in the lookup() operation will be issued to a wrong MDT and return -ENOENT  wrongly (after files in the trash can are restored). However, the readdir() operations will still return all the dir entries in the striped trash even if the parent LMV layout was re-striped and changed, since the parent directory FID (pFID) will remain the same as before restriping. Maybe it needs to migrate the files restored from the trash can to the appropriate shard according to their name hash once the LMV layout has been changed.

Orphans in Trash Can

For an orphan file, it means the file is still opened (but not closed) by a certain user. Upon its last unlink, it can directly move into the trash can and mark with LUSTRE_ORPHAN_FL | LUSTRE_TRASH_FL. And the orphans file can not be permanently deleted from the trash can until its last close().