Right now, in Lustre, sparse files are kept without explicitly saving pages filled with zeroes. But, when a sparse file is requested from a client, hole pages are sent on the network by sending pages filled with zeroes. Thus, having holes does not spare bandwidth.
In case of compression, the problem is quite impacting. Indeed, when a file is compressed, it is split in blocks (called compression chunks) and each chunk is compressed with holes at the end to have the same layout for the blocks in the file.
We tackle this issue in LNet: we skip holes and send only useful data.
The modifications in the code concern the OSS and the client. To do that, first, we have to retrieve sparseness information in LNet layer.
Detection of holes and report to LNet
Hole information is read from osd_do_bio and saved in field of struct niobuf_local. This field is accessible from tgt_brw_read and transmitted through struct ptlrpc_bulk_desc to target_bulk_io. In ptlrpc_start_bulk_transfer, when md (struct lnet_md) is built, we set new fields, nempty and empty_list, to describe the list of holes. In lnet_md_build, we copy the new fields from md into umd (struct lnet_md *).
Thus, from LND code, we are able to get information about holes and run specific code depending on the LND. This information will be retrieved using msg_md (struct lnet_libmd *) field from lntmsg (struct lnet_msg *).
Modification on LNet
To handle holes and skip them, several modifications in LNet are needed.
Client requests data
If the client has hole support, the server will not send holes using pages filled with zeroes, but the client first needs to notify the server it can handle such sparse send. It is done when it sends the PUT request to the OSS.
OSS sends data
If the OSS gets such a request, it will transmit data without holes. In that purpose, some modifications are required in lnet_md_build. We read nempty and empty_list fields from md (struct lnet_md). And, when lmd (struct lnet_libmd *) is built, hole pages are skipped, and the list of holes is saved in two fields (md_nempty and md_empty_list).
The other modification is in LNetPut, where we use LNET_MSG_PUT_SPARSE as message type if we have md_nempty that is strictly positive. This new message type is needed for the client to parse correctly received data
since we will use new header to save the list of holes.
The code is different depending on the LND, in order to have it optimized for the corresponding fabric. There is no particular issue in doing that, since the shared code in the same, and the code that is specific to an LND is only in the LND code itself.
For socklnd
In ksocknal_send, if the message type is LNET_MSG_PUT_SPARSE, then we first send a header containing the number of hole pages and its list (bitmap of present/absent pages). The list is retrieved using msg_md→md_empty_list.
Then, we send data with no needed modification, since our lmd already skips the holes.
For o2iblnd
RDMA allows us to write to a remote address only in a continuous way. Behind that, the memory can be fragmented, but we cannot change the mapping. So we cannot skip holes in client memory.
One way to deal with that is to write data to the remote address normally. In the client, memory will have all data in sequence without any hole. So, the client will have to remap the pages by copying the data at the right place and by filling the holes with zeroes.
The drawback of this method is that we have an extra copy of the data. This will definitely save some bandwidth, but it is not sure if we could get a performance gain because of the added processing time on the client.
In case of holes coming from compression in Lustre, this solution could be optimal if the chunk size of the compression were a multiple of the bulk IO RPC size. In that case, holes will be only at the end of data to send and not between pages in the middle of the data of an RPC. Thus, the client won't have to shift the pages, but only to fill holes with zeroes after the received data.
Another solution is to split the send into multiple work requests. One work request represents contiguous data blocks between holes. In these work requests, the remote address will have the right offset to be mapped correctly to the right page on the destination. In the context of the compression, the number of work requests will be equal to the number of holes, that will be the bulk IO RPC size divided by the compression block size. One question for this approach is the overhead of using several work requests.
Client receives data
As for the OSS, the modification of the code for the client will depend on the LND.
For socklnd
When an LNET_MSG_PUT_SPARSE message arrives, the client receives the header containing hole list before receiving the requested data. Then, we modify lmd (struct lnet_libmd *) to fill holes with zeroes and skip them when receiving data. After that, ksocknal_recv_kiov can be called normally.
For RDMA
Depending on the implementation, the client may need to copy data at the right location. In all cases, it will have to fill hole pages with zeroes.
2 Comments
Andreas Dilger
Why use an array of page numbers at the start of the socklnd header instead of a bitmap of present/absent pages? That is going to pack much more densely, about 32 bytes for a 1MiB transfer.
Cyril Bordage
Sorry for the mistake, I wrote an array of 256 integers, but I do know why… As discussed previously, it will be a bitmap of present/absent pages. I will fix that.