Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

On the server side, the MDS implements the basic functionalities such as moving the "deleted" files into the Trash Can, and the interface how to traverse them. On the client side, it implements the a basic utility tools to interact with the Trash Can ("lfs trash set|clear|list|delete|restore FILE|DIR"), are available for debugging and low-level operations on the Trash Can including:

  • Set or clear the TRASH flag on a given file or directory (this is implemented with the FS_UNRM_FL inode attribute so that deleted files can be identified with lsattr in the filesystem).
  • list files in the Trash Can on a given MDT
  • Permanently delete a file or directory within the Trash Can on a given MDT
  • Empty the Trash Can on a given MDT
  • Restore a file within the Trash Can on a give MDT

Our mechanism only moves the regular files into the Trash Can upon its last unlink.

It borrows lots of ideas from orphan and volatile files in Lustre (which stores in "ROOT/PENDING" directory on each MDT). During the format and setup, each MDT creates a "ROOT/.lustre/Trash" directory as a Trash Can to store "undeleted" files.

The POSIX API is used to traverse the files under the Trash Can on a given MDT. First, a client can get the FID of Trash Can directory ROOT/.lustre/Trash on the MDT. Then the client can get the file handle via FID open: dir_fd=llapi_open_by_fid(). After that, the "undeleted" files within the Trash Can can be traversed via readdir(dir_fd); it can open by openat(dir_fd, ent->d_name) and obtain the "trusted.unrm" XATTR, which contains the necessary information to resotre, via fgetxattr(fd, "trusted.unrm"); The client can even read the data or swap layouts of the "undeleted" file on the Trash Can for restore: opendir()/readddir()/openat()/fgetxattr("trusted.unrm")/close()/closedir().

The workflow for the Trash Can is as follows:

  • An administrator can enable/disable Trash Can feature on a specified MDT via: lctl set_param mdd.*.trash_can_enable

  • An adminstrator can enable/disable Trash Can feature on a specified directory or a file via the file flag: FS_UNRM_FL; All sub files under a directory flagged with FS_UNRM_FL can inherit this flag;    

     # lfs trash set $file|$dir
     # lfs trash clear $file|$dir

These low-level commands will generally not be used by the end user, since the virtual .Trash directory will be more readily accessible to users, including those in restricted subdirectory mounts in a tenant namespace.

This implementation only moves the regular files or subdirectories into the Trash Can upon its last unlink.This means that deleted hard links will not be preserved by default, in order to minimize the overhead of managing the Trash Can. 

Deleted subdirectories are only preserved if they have deleted files currently in the Trash Can. Empty subdirectories are not currently preserved since they can be trivially recreated, though this could potentially be implemented with a tunable parameter if there was a need for it.

The implementation borrows ideas from orphan and volatile files in Lustre, which normally stores deleted files in the "ROOT/PENDING" directory on each MDT. After the initial setup and mount, each MDT creates a "ROOT/.lustre/Trash/MDTxxxx" directory as a Trash Can to store deleted files, if it does not already exist.

Configuration for the Trash Can

  • An administrator can enable/disable Trash Can feature globally on a specified MDT via: lctl set_param mdd.*.trash_can_enable

  • The UID/GID/PROJID of files in the Trash Can are configured globally via mdd.*.trash_can_uid, mdd.*.trash_can_gid, and mdd.*.trash_can_projid, see Space and Quota Accounting below for details

Delete a file into the Trash Can

When a file or empty subdirectory is deleted (last link in namespace is removed) a number of steps are performed for the file.  Some of them are "one time only" for the user or directory, while others are done for each file

  • Create a subdirectory in the .lustre/Trash/MDTxxxx directory for the user to hold all of their deleted files.  This is done only once per UID in the filesystem.  See Per-User Trash Can below for details.
  • Create a stub subdirectory for the parent directory where the file is being deleted.
    • The stub directory is named with the FID of the parent directory (pFID) where the file is being deleted from.  The stub directory will have its own FID (stFID) that is unrelated to pFID, but needs to be unique to ensure this directory can be accessed by the client independently of the parent directory.
    • To reduce repeated lookups of the ROOT/.
    Move a deleting file into the Trash Can. When delete a regular file while Trash Can is enabled will mark it with FS_UNRM_FL upon its last unlink, and move the file into the Trash Can directory "ROOT/.
    • lustre/Trash/MDTxxxx/UID/pFID
    ". Then set a "trusted.unrm" XATTR on the deleted file on the Trash Can. The XATTR contains the following information:
    • directory when deleting many files from a parent directory, the original parent object may cache the stFID object (by FID or pointer?) since the mapping to the stub directory will remain unchanged for the lifetime of the parent.
    • It would be desirable to copy the parent directory name from its trusted.link xattr to the stub, so that it is possible to generate a user-friendly name for the stub directory when it is shown in the .Trash/ subdirectory.  It is likely best if the trusted.link xattr contain the FID of the UID/ directory as the parent FID rather than make an exact copy of trusted.link, so that LFSCK does not try to re-link the stub into the parent directory.
  • The deleted file will be moved into the stub directory.  As a part of this transaction, several changes will be applied to the deleted inode:
    • the inode will be marked with the FS_UNRM_FL attribute to indicate that it was deleted.  This allows it to be identified as "in the Trash" by scanning utilities, regardless of the UID/GID/PROJID assigned.
    • the UID/GID/PROJID will be changed to the trash_can_uid, trash_can_gid, and trash_can_projid to remove it from the quota accounting of the original user.  See Space and Quota Accounting below for details.
    • a "trusted.unrm" XATTR is added on the deleted file. The XATTR contains the following information:
     struct ll_trash_xattr {     struct ll_trash_xattr {
__u32 ltx_flags; /* for future usage */
__u32 ltx_uid; /* original UID of the file, used to restore on unrm */
__u32 ltx_gid; /* original GID of the file, used to restore on unrm */
__u32 ltx_projid; /* original PROJID of the file, used to restore on unrm */
__u64u32 ltx_timestampflags; /* time the file moved/* intofor thefuture Trash Can, or we could use ctime here? */
};

Where ltx_uid/ltx_gid/ltx_projid are the original UID/GID/PROJID of the deleted file, mainly used for the restore operation; @ltx_timestamp is the time that the file was moved into the Trash Can. It is used to determine whether the file is expired for the specified retention period and thus should be removed from the Trash Can finally (maybe we could also use the inode ctime for this purpose instead of storing a separate timestamp?). During deleting the file, we can get the full path information via the way similar to fid2path().

  • List "undeleted" files within a Trash Can.  By default it will list files/directories deleted relative to the current working directory.  If DIR is provided, then list deleted files/directories relative to that directory, in the same format as ls:

     # lfs trash {list|ls} [DIR|FILE]
MDT index: 1
uid gid size delete time FID Fullpath
0 0 4096 Nov 14 08:11 [0x200034021:0x1:0x0]->/mnt/lustre/f1
0 0 32104 Nov 14 08:07 [0x200034021:0x2:0x0]->/mnt/lustre/dir/f2
...

Internally, the lfs trash list command is looking up the FID and MDT of the current directory, or the directory specified by DIR, and then listing the respective directory under $MOUNT/.lustre/trash/MDTxxxx/pFID/ or the directory file descriptor returned via llapi_recycle_fid_get(MNTPT, mdt) if the .lustre/trash directory is not available.  This is mainly for debugging, since users will generally use the virtual .Trash directory to interact with the Trash Can and restore files.

  • Deleting a file or directory in the Trash Can will remove the temporary file under "ROOT/.lustre/Trash" and free the data space on Lustre OSTs permanently:

     # lfs trash {delete|rm} [DIR/]FILE ...
  • Empty a Trash Can:

     # lfs trash clear DIR ...
  • Restore a file in the Trash Can on a given MDT. It will restore the file and its content according to the saved full path and then delete the stub on the Trash Can.

     # lfs trash {restore|unrm} [DIR/]FILE ...
  • A utility periodically scans the files under Trash Can directory "ROOT/TRASH" and delete the file with grace time expiration.

  • Provide the functionality to scan files in the trash on all MDTs that exceed the specified age manually:

     # lfs trash find -ctime +time [DIR]
  • Provide the functionality to restore/delete all files within a given directory. This can be achieved by using the command combination of "lfs trash list" and "lfs trash restore" or "lfs trash delete" to filter the files with the full path attribute under a given directory.

  • Provide .lustre/trash/MDTnnnn (where nnnn  is the MDT index) filesystem namespace. By this way, users can access the "undeleted" files with readonly mode under the Trash Can directory on a given MDTnnnn via POSIX file system API. However, we can not access these files from fileset sub directory mount. We can perform the following commands from a Lustre namespace (mount point of "/mnt/lustre") on a client:

     # ls /mnt/lustre/.lustre/Trash/MDT0002
usage */
__u32 ltx_uid; /* original UID of the file, used to restore on unrm */
__u32 ltx_gid; /* original GID of the file, used to restore on unrm */
__u32 ltx_projid; /* original PROJID of the file, used to restore on unrm */
__u64 ltx_timestamp; /* time the file moved into the Trash Can, or we could use ctime here? */
};

Where ltx_uid/ltx_gid/ltx_projid are the original UID/GID/PROJID of the deleted file, mainly used for the restore operation. ltx_timestamp is the time that the file was moved into the Trash Can. It is used to determine whether the file is expired for the specified retention period and thus should be purged from the Trash Can.  It may be to use the inode ctime for this purpose instead of storing a separate timestamp to reduce the size of the xattr.

Delete a directory into the Trash Can

When a directory is deleted into the Trash Can, it is desirable to preserve the directory hierarchy of the original directory tree, so that accidental "rm -rf" (or equivalent) does not result in millions of files or directories in the top level of the .lustre/Trash/MDTxxxx/UID directory.  The directory deletion will perform the following actions:

  • As with regular file deletion, a stub directory is created with the name of the FID of the directory's parent (ppFID). 
  • The directory's own FID is looked up in the .lustre/Trash/MDTxxxx/UID/ directory to determine if a stub directory with the pFID name already exists or not.
  • The xattrs on the deleting directory (e.g ACLs, selinux, etc.) are copied over to the pFID stub
  • If the deleting directory's pFID stub exists then it is renamed to match the name of the directory and moved into the new ppFID stub directory of its parent.

By renaming and moving each pFID stub directory into the corresponding ppFID stub directory, this preserves the hierarchy of a deleted directory tree.  The "rm -rf" process is processing files and subdirectories in a "bottom up" manner, so it needs to build up the deleted directory hierarchy incrementally as files are deleted. 

There is no single "delete directory tree" command in POSIX, since that may normally take a very long time to complete while processing billions of files.  With Trash Can it may be desirable to offer such an interface, since the whole directory tree could be moved into the Trash Can in a single operation.(it would necessitate background operations to annotate files with the FS_UNRM_FL attribute, store the original UID/GID/PROJID into the trusted.unrm xattr, and change the UID/GID/PROJID into the trash_can_* equivalents.  This would still be more efficient than deleting (renaming) thousands or millions of individual files and subdirectories.

List "undeleted" files within a Trash Can

  • The .lustre/trash/MDTxxxx (where xxxx is the hexadecimal MDT index) directory tree is local to each MDT. By this way, users can access the "undeleted" files with readonly mode under the Trash Can directory on a given MDTnnnn via POSIX file system API. However, we can not access these files from fileset sub directory mount. We can perform the following commands from a Lustre namespace (mount point of "/mnt/lustre") on a client:
     # ls /mnt/lustre/.lustre/Trash/MDT0002
     0x200034021:0x1:0x0
0x200034021:0x2:0x0
...
# lfs trash ls /mnt/lustre/.lustre/Trash/MDT0002/0x200034021:0x1:0x0
0 0 4096 Nov 14 08:11 [0x200034021:0x1:0x0]->/mnt/lustre/f1
# lfs trash list /mnt/lustre/.lustre/Trash/MDT0002
MDT index: 1
uid gid size delete time FID Fullpath
0 0 4096 Nov 14 08:11 [0x200034021:0x1:0x0]->/mnt/lustre/f1 0x200034021:0x1:0x0
0x200034021:0x2:0x0
...
# lfs trash ls /mnt/lustre/.lustre/Trash/MDT0002/0x200034021:0x1:0x0
0 0 409632104 Nov 14 08:1107 [0x200034021:0x10x2:0x0]->/mnt/lustre/f1
# lfs trash list /mnt/lustre/.lustre/Trash/MDT0002dir/f2
  • By default it will list files/directories deleted relative to the current working directory.  If DIR is provided, then list deleted files/directories relative to that directory, in the same format as ls:

     # lfs trash {list|ls} [DIR|FILE]
MDT index: 1
uid gid size delete time FID Fullpath
0 0 4096 Nov 14 08:11 [0x200034021:0x1:0x0]->/mnt/lustre/f1
0 0 32104 Nov 14 08:07 [0x200034021[0x200034021:0x2:0x0]->/mnt/lustre/dir/f2
...

Clean up files from the trash

/mnt/lustre/dir/f2
...

Internally, the lfs trash list command is looking up the FID and MDT of the current directory, or the directory specified by DIR, and then listing the respective directory under $MOUNT/.lustre/trash/MDTxxxx/pFID/ or the directory file descriptor returned via llapi_open_by_fid() if the .lustre/trash directory is not available.  This is mainly for debugging, since users will generally use the virtual .Trash directory to interact with the Trash Can and restore files.



Deleting a file or directory in the Trash Can

  •  To remove the temporary file under "ROOT/.lustre/Trash" and free the data space on Lustre OSTs permanently:

     # lfs trash {delete|rm} [DIR/]FILE ...

Empty a Trash Can:

     # lfs trash clear DIR ...

Restore a file from the Trash Can

  •  on a given MDT. It will restore the file and its content according to the saved full path and then delete the stub on the Trash Can.

     # lfs trash {restore|unrm} [DIR/]FILE ...

  • A utility periodically scans the files under Trash Can directory "ROOT/TRASH" and delete the file with grace time expiration.

  • Provide the functionality to scan files in the trash on all MDTs that exceed the specified age manually:

     # lfs trash find -ctime +time [DIR]
  • Provide the functionality to restore/delete all files within a given directory. This can be achieved by using the command combination of "lfs trash list" and "lfs trash restore" or "lfs trash delete" to filter the files with the full path attribute under a given directory.


Space and Quota Accounting

In order to separate space and quota accounting for a user, group, or project's files, the original UID, GID, and PROJID of the file cannot be used for files in the Trash Can.  Otherwise, there would be confusion on the part of the user when they delete files and their quota usage does not decrease.  Similarly, the free and used space and inodes reported by df should not contain the space consumed by files in the Trash Can, since users would be confused by the fact that deleting files does not reduce the amount of space used in the filesystem.

Instead, when files are moved into the Trash Can, their UID/GID/PROJID should be changed to the defined trash_can_uid, trash_can_gid, and trash_can_projid values, and the original values preserved in the trusted.unrm xattr on the file.  This will "hide" the quota usage from the files in the Trash Can, while still allowing the actual usage of files in Trash to be easily determined via "lfs quota -u/-g/-p" commands.

The df output should add the space used by the trash_can_projid quota to the statfs.st_bfree and statfs.st_bavail, and add the inodes used by trash_can_projid to statfs.st_ffree returned by the statfs() syscall.  This will avoid the issue that "df" always shows the filesystem as 90% full (or whatever the space usage threshold is configured to be).

A new option "lfs df --trash" should show the actual space usage for the filesystem, so that it is possible for an administrator to diagnose issues with the Trash Can space usage.

Clean up files from the Trash Can

A mechanism is needed to automatically clean up files from the trash can when the filesystem becomes full. It cannot be that the user has to delete every file twice, and it cannot be that the filesystem is allowed to get 100% full (or even 90% full) due to files in the trash. There needs to be an automatic mechanism to clean up the trash to ensure that the filesystem performance does not degrade when users though they deleted files.

In our design, it does not depend on a userspace utility for such a critical function to It needs to automatically clean up files from the trash can when the filesystem becomes full. It cannot be that the user has to delete every file twice, and it cannot be that the filesystem is allowed to get 100% full (or even 90% full) due to files in the trash. There needs to be an automatic mechanism to clean up the trash to ensure that the filesystem performance does not degrade when users though they deleted files.

It can assign the UID/GID/PROJID to a trash user so that this quota is not accounted against the end user, and keep the original UID/GID/PROJID in a the XATTR "trusted.unrm".

In our design, it does not depend on a userspace utility for such a critical function to clean up files from the trash when FS is nearly full, since that utility may never be started, or the client is evicted, or similar. If that happens, the filesystem would become full and unusable, even though the user had already deleted files from the filesystem. This needs to be bulletproof and run automatically when the OSTs (or MDTs) are getting full.

The MDS is already monitoring the OST fullness every 5s to make object allocation decisions, so it can also make decisions about files to delete. Therefore, the MDT can periodically monitor the space usage of the trash user (quota) and space usage for the entire file system with the additional consideration of the retention period and deleted timestamp for the files, choose the candidates to be deleted permanently to free up the space.

Also, there needs to be some accounting of files in the Trash Can, so that "df" does not show the filesystem as 90% full all the time, but only show the non-trash space usage (= real usage - trash usage).

Per-User Trash Can

when FS is nearly full, since that utility may never be started, or the client is evicted, or similar. If that happens, the filesystem would become full and unusable, even though the user had already deleted files from the filesystem. This needs to be bulletproof and run automatically when the OSTs (or MDTs) are getting full.

The MDS is already monitoring the OST fullness every 5s to make object allocation decisions, so it can also make decisions about files to delete. Therefore, the MDT can periodically monitor the space usage of the trash user (quota) and space usage for the entire file system with the additional consideration of the retention period and deleted timestamp for the files, choose the candidates to be deleted permanently to free up the space.

Per-User Trash Can

A per-user Trash/MDTxxxx/UID/ directory that is owned by that UID and mode 0700 should always be created in the top-level directory to avoid world readable access to deleted files, and to de-conflict files/directories of the same name created by users (e.g. tmp/ or data/ or Documents/ or similar. That avoids exposing files to other users that may be private, and also allows tracking space usage more clearly for each UID, so that a user's data can be found and purged more quickly if they are exceeding their quotas.

In some uncommon cases, it may be that a parent directory has files owned by multiple different users (different UIDs).  This would likely only happen for top-level directories like scratch/ or home/ that contain directories from multiple users.  When those directories were deleted they created separate .lustre/Trash/MDTxxxx/UID/pFID/ stub directories.  In this case, the deleted parent directory should only be created in the .../UID/ directory with the inode->i_uid of the parent directoryA per-user Trash/MDTxxxx/UID/ directory that is owned by that UID and mode 0600 should always be created in the top-level directory to avoid world readable access to deleted files, and to de-conflict files/directories of the same name created by users (e.g. tmp/ or data/ or Documents/ or similar. That avoids exposing files to other users that may be private, and also allows tracking space usage more clearly for each UID, so that a user's data can be found and purged more quickly if they are exceeding their quotas.

Per-Tenant Trash Can

Files and directories deleted from within a subdirectory mount of a Nodemap should be stored in a Trash/MDTxxxx/NODEMAP/UID/ directory to isolate the files/directories from different tenants.  The NODEMAP/ directoryname directory name is the configured name of the nodemap for that tenant, and can be found from the client export used to perform the final unlink operation. The The UID/ directory name should be the unmapped ID client UID of the user, so that the visible directory name matches the user expectation.  The UID directory ownership should be the mapped ID server UID of the user, so that proper file access controls can be maintained.  By having the multi-level NODEMAP/UID/ naming, it isolates the UID directory names from other tenants that may have the same mapped UID directory name.

The Nodemap for a tenant should allow configuring the UID/GID/PROJID to which files in the Trash are assignedGID/PROJID to which files in the Trash are assigned, via per-nodemap trash_can_uid, trash_can_gid, and trash_can_projid parameters.  These IDs should be within the ID offset range of the Tenant (e.g. 99999) so that they can be accessed and mapped correctly, but are unlikely to cause conflicts with other IDs used by the Tenant.  This will also allow the Tenant project quota group to account for all space used by the tenancy, while still separating Trash Can usage for the regular UID/GID/PROJID of the Tenant users.

In a multi-tenant environment, it would be desirable to have a more sophisticated policy engine to manage Garbage Collection of files within the trash, in order to provide maximum utility to each Tenant.  For example, if Tenant 1 has deleted files an hour ago, but Tenant 2 has written and deleted TB of data since that time, the Tenant 1 files may have expired out of the Trash Can.  Developing a complex policy engine to manage GC in an MTFL environment is out of scope for the initial TCU implementation.  We likely want to leverage and enhance the lpurge utility from Hot Pools to actively monitor the space usage of tenants on different OSTs to decide which objects (files) should be removedfrom Hot Pools to actively monitor the space usage of tenants on different OSTs to decide which objects (files) should be removed.

When running the df command, the statfs() output should add in the space used by the trash_can_projid for the nodemap, so that the space and inode usage reported does not reflect the space used by the Trash Can.

Repeated deletion of same filename

If the same filename is repeatedly created and deleted within the same parent directory, then the deleted files will have conflicts when moved into the pFID directory in the Trash Can.  To disambiguate the files in Trash, the conflicting filenames should be disambiguated by appending a timestamp to the filename, like filename.2025-04-03-00:11:24, possibly adding .microseconds if there is still a conflict.  It isn't totally clear whether it would be better to use the timestamp from when the file was deleted, or when the file was created.  Both have some value to help users distinguish between the different versions.

In order to avoid overwhelming the Trash Can with files that are rapidly created and deleted (e.g. short-lived temporary files), it would be desirable to impose an upper limit on the number of versions that will be saved in the trash can.  Some complexity exists in implementing this, because the MDS shouldn't need to do a full directory listing to determine if there are multiple versions of a file in the trash.

Avoid preserving temporary files

Files that only exist for a very short time (e.g. temporary files) should not necessarily be preserved in the Trash Can, or they can quickly overwhelm the available capacity of the filesystem, and result in important files being purged from trash and/or filling the trash faster than files can be cleaned up. Files marked with the I_LINKABLE flag on the MDS (from O_TMPFILE or Lustre Volatile files, see LU-18844) should not be preserved in the Trash Can.  It would be useful to have a tunable parameter that sets a minimum age for files to be preserved in the Trash Can (e.g. 65 minutes?) so that files that are frequently created and deleted are not preserved since they could consume a considerable amount of space.

JobID of process deleting a file

In LU-13031 the JobID of the process that first creates a file is stored in the user.job xattr on the MDT inode, for diagnostic purposes and to allow determining provenance of each file later on.  For the Trash Can, LU-17648 describes storing the JobID of the process that is deleting the file, for diagnostics such as determining rogue processes that are deleting files in the filesystem.  Something like user.del would be a reasonable default xattr name.  The actual xattr name can be configured with the mdt.*.job_xattr_del parameter.

...

The FID of the ".Trash" directory is derived from the FID of the parent directory (pFID), by looking up the corresponding "stub" directory with the FID-named directory: ".lustre/trash/MDTXXXX/UID/pFID". Essentially ".Trash" under each normal directory is just a virtual shortcut to the stub directory (if the parent is not a striped dir) that is accessible in each directory if specified by name ".Trash". The files/directories under ".Trash" directory can be access via normal POSIX file system API such as via readdir()/stat()/getxattr() so that it can be used by normal tools such as "ls -l .Trash/" or "find .Trash" to locate files for restoration or permanent removal.  If there are no deleted files under a specific directory, then the virtual .Trash directory will not be accessible, and will return -ENOENT for any lookup.

.Trash pFID name lookup

The FID-based names of stub directories stored as .lustre/trash/MDTxxxx/UID/pFID directory are needed for efficient lookup of the parent FID during unlink.  However, these directory names are not very user-friendly when browsing the virtual .Trash directory in the filesystem namespace.  Rather than showing the pFID name to users during readdir() calls (ls, find, etc.) it would be better to look up the actual parent directory name via the FID→trusted.link xattr on the parent and return this to clients.  The FID number of the directory entry would be the FID of the stub directory itself, not the pFID that is used internally for identification.  While copying the trusted.link xattr over to the stub directory at creation would simplify this lookup, there is some risk that the name in the trusted.link xattr would become stale if the parent directory is renamed.  On the other hand, this may also be useful to preserve the original name of the directory in case some automated tool is renaming the original directory to a temporary name before deletion?

...