Queue Depth and Credit Management Logic
- When a connection is established on the passive or active side
conn→ibc_creditsis set to the negotiated queue depth- For the passive side this happens in
kiblnd_passive_connect() - For the active side this happens in
kiblnd_check_connreply()
- For the passive side this happens in
- When posting a transmit,
kiblnd_post_tx_locked()is passed a credit value.- The credit value depends on the message. It's either 0 or 1.
IBLND_MSG_PUT_NAK, IBLND_MSG_PUT_ACK, IBLND_MSG_PUT_DONE, IBLND_MSG_GET_DONE, IBLND_MSG_NOOP(for v2 protocol) require no credits- All other messages consume 1 credit.
- If we fail to post the message we return the credit to the connection
- In
kiblnd_handl_rx()ifmsg→ibm_creditsis not == 0, then we add these credits to the connection and attempt to send queued messages. - After posting the rx we increment
conn->bc_outstanding_creditsby one if need be. This indicates the credits we are returning to the other side. - If we have any messages to transmit over the connection, then we return these credits,
conn->ibc_outstanding_creditsto the peer. - These credits are then added to the peer's
conn→ibc_creditsand used to send further messages.
- It is expected that the credits will remain <=
conn->ibc_queue_depth- Unfortunately, the way it's been implemented the code is not clear about the connection between
conn→ibc_credits, conn→ibc_queue_depth, and conn→ibc_outstanding_credits. - Wouldn't it make more sense to simply make sure that the number of credits on the connection is consistent on both peers, instead of passing back and forth the number of credits to accumulate? (Needs some thinking)
- Unfortunately, the way it's been implemented the code is not clear about the connection between
Comments on Alexy's changes
Recently Alexy submitted the following patch:
https://review.whamcloud.com/#/c/28850
This changes the IBLND_SEND_WRS as follows:
# original code #define IBLND_SEND_WRS(c) \ ((c->ibc_max_frags + 1) * kiblnd_concurrent_sends(c->ibc_version, \ c->ibc_peer->ibp_ni)) # modified code #define IBLND_SEND_WRS(c) \ ((c->ibc_max_frags + 1) * c->ibc_queue_depth)
The assumption that's being made here is that it is enough to set the init_qp_attr→cap.max_send_wr to the negotiated number of fragments * the queue depth. This is the wrong assumption, since the WRS are actually consumed per fragment you're sending over, not per ib_post_send(). This might work for fastreg or FMR because they only have one fragment. And with the latest changes for FMR that is no longer true.
However, we still need to support the global memory region case, where we can have up to 256 fragments to send over and we can send up to kiblnd_concurrent_send() at the same time, therefore if we make the change as noted above, we'll end up overflowing the Work Request Queue pretty quickly.
Furthermore, the change in kiblnd_create_conn() to reduce both the init_qp_attr→cap.max_send_wr and init_qp_attr→cap.max_recv_wr is problematic because we end up reducing a negotiated value. This will cause a disconnect between the negotiated value and what will be used for the queue pair being created. Both peers can use a different value for their side of the queue pair. I'm unsure what impact that'll have, or if we open ourselves to corner cases we have not considered.
I verified this issue on latest master with the following simple patch.
Two OPA nodes with map_on_demand = 32
diff --git a/lnet/klnds/o2iblnd/o2iblnd_cb.c b/lnet/klnds/o2iblnd/o2iblnd_cb.c
index 35502b6..492b1fc 100644
--- a/lnet/klnds/o2iblnd/o2iblnd_cb.c
+++ b/lnet/klnds/o2iblnd/o2iblnd_cb.c
@@ -1082,11 +1082,11 @@ kiblnd_init_rdma(kib_conn_t *conn, kib_tx_t *tx, int type,
if (tx->tx_nwrq >= conn->ibc_max_frags) {
CERROR("RDMA has too many fragments for peer_ni %s (%d), "
- "src idx/frags: %d/%d dst idx/frags: %d/%d\n",
+ "src idx/frags: %d/%d dst idx/frags: %d/%d tx_nwrq = %d\n",
libcfs_nid2str(conn->ibc_peer->ibp_nid),
conn->ibc_max_frags,
srcidx, srcrd->rd_nfrags,
- dstidx, dstrd->rd_nfrags);
+ dstidx, dstrd->rd_nfrags, tx->tx_nwrq);
rc = -EMSGSIZE;
break;
}
@@ -1127,6 +1127,8 @@ kiblnd_init_rdma(kib_conn_t *conn, kib_tx_t *tx, int type,
wrq_sge++;
if (wrq_sge == *kiblnd_tunables.kib_wrq_sge || dstidx != prev) {
tx->tx_nwrq++;
+ CDEBUG(D_NET, "AMIR tx->tx_nwrq = %d, wrq_sge = %d, kib_wrq_sge = %d\n",
+ tx->tx_nwrq, wrq_sge, *kiblnd_tunables.kib_wrq_sge);
wrq->wr.num_sge = wrq_sge;
wrq_sge = 0;
}
you'll see that for each rdma frag you're using a wr.
# This error appears on the console LNetError: 58590:0:(o2iblnd_cb.c:1089:kiblnd_init_rdma()) RDMA has too many fragments for peer_ni 172.16.1.3@o2ib (32), src idx/frags: 32/256 dst idx/frags: 32/256 tx_nwrq = 32
The above patches were made only with mlx5 in mind. Both mlx4 and OPA will use FMR. And potentially mlx4 on older kernels, prior to RHEL 7.4, will use global memory regions.
Conclusion
Some of Alexy's patches are useful, but the above patch and https://review.whamcloud.com/#/c/28851/5 are problematic because of the reasons I outlined below
Tasks
- Resolve the following patches:
- Consider simplifying the Credit Management system in a separate patch.