Queue Depth and Credit Management Logic

When a connection is established on the passive or active side conn→ibc_credits is set to the negotiated queue depth
- For the passive side this happens in kiblnd_passive_connect()
- For the active side this happens in kiblnd_check_connreply()
When posting a transmit, kiblnd_post_tx_locked() is passed a credit value.
- The credit value depends on the message. It's either 0 or 1.
- IBLND_MSG_PUT_NAK, IBLND_MSG_PUT_ACK, IBLND_MSG_PUT_DONE, IBLND_MSG_GET_DONE, IBLND_MSG_NOOP (for v2 protocol) require no credits
- All other messages consume 1 credit.
- If we fail to post the message we return the credit to the connection

In kiblnd_handl_rx() if msg→ibm_credits is not == 0, then we add these credits to the connection and attempt to send queued messages.
After posting the rx we increment conn->bc_outstanding_credits by one if need be. This indicates the credits we are returning to the other side.
If we have any messages to transmit over the connection, then we return these credits, conn->ibc_outstanding_credits to the peer.
These credits are then added to the peer's conn→ibc_credits and used to send further messages.

It is expected that the credits will remain <= conn->ibc_queue_depth
- Unfortunately, the way it's been implemented the code is not clear about the connection between conn→ibc_credits, conn→ibc_queue_depth, and conn→ibc_outstanding_credits.
- Wouldn't it make more sense to simply make sure that the number of credits on the connection is consistent on both peers, instead of passing back and forth the number of credits to accumulate? (Needs some thinking)

Comments on Alexy's changes

Recently Alexy submitted the following patch:

https://review.whamcloud.com/#/c/28850

This changes the IBLND_SEND_WRS as follows:

# original code
#define IBLND_SEND_WRS(c)	\
	((c->ibc_max_frags + 1) * kiblnd_concurrent_sends(c->ibc_version, \
							  c->ibc_peer->ibp_ni))
 
# modified code
#define IBLND_SEND_WRS(c)	\
	((c->ibc_max_frags + 1) * c->ibc_queue_depth)

The assumption that's being made here is that it is enough to set the init_qp_attr→cap.max_send_wr to the negotiated number of fragments * the queue depth. This is the wrong assumption, since the WRS are actually consumed per fragment you're sending over, not per ib_post_send(). This might work for fastreg or FMR because they only have one fragment. And with the latest changes for FMR that is no longer true.

However, we still need to support the global memory region case, where we can have up to 256 fragments to send over and we can send up to kiblnd_concurrent_send() at the same time, therefore if we make the change as noted above, we'll end up overflowing the Work Request Queue pretty quickly.

Furthermore, the change in kiblnd_create_conn() to reduce both the init_qp_attr→cap.max_send_wr and init_qp_attr→cap.max_recv_wr is problematic because we end up reducing a negotiated value. This will cause a disconnect between the negotiated value and what will be used for the queue pair being created. Both peers can use a different value for their side of the queue pair. I'm unsure what impact that'll have, or if we open ourselves to corner cases we have not considered.

I verified this issue on latest master with the following simple patch.

Two OPA nodes with map_on_demand = 32

diff --git a/lnet/klnds/o2iblnd/o2iblnd_cb.c b/lnet/klnds/o2iblnd/o2iblnd_cb.c
index 35502b6..492b1fc 100644
--- a/lnet/klnds/o2iblnd/o2iblnd_cb.c
+++ b/lnet/klnds/o2iblnd/o2iblnd_cb.c
@@ -1082,11 +1082,11 @@ kiblnd_init_rdma(kib_conn_t *conn, kib_tx_t *tx, int type,
 
                if (tx->tx_nwrq >= conn->ibc_max_frags) {
                        CERROR("RDMA has too many fragments for peer_ni %s (%d), "
-                              "src idx/frags: %d/%d dst idx/frags: %d/%d\n",
+                              "src idx/frags: %d/%d dst idx/frags: %d/%d tx_nwrq = %d\n",
                               libcfs_nid2str(conn->ibc_peer->ibp_nid),
                               conn->ibc_max_frags,
                               srcidx, srcrd->rd_nfrags,
-                              dstidx, dstrd->rd_nfrags);
+                              dstidx, dstrd->rd_nfrags, tx->tx_nwrq);
                        rc = -EMSGSIZE;
                        break;
                }
@@ -1127,6 +1127,8 @@ kiblnd_init_rdma(kib_conn_t *conn, kib_tx_t *tx, int type,
                wrq_sge++;
                if (wrq_sge == *kiblnd_tunables.kib_wrq_sge || dstidx != prev) {
                        tx->tx_nwrq++;
+                       CDEBUG(D_NET, "AMIR tx->tx_nwrq = %d, wrq_sge = %d, kib_wrq_sge = %d\n",
+                              tx->tx_nwrq, wrq_sge, *kiblnd_tunables.kib_wrq_sge);
                        wrq->wr.num_sge = wrq_sge;
                        wrq_sge = 0;
                }

you'll see that for each rdma frag you're using a wr.

# This error appears on the console
LNetError: 58590:0:(o2iblnd_cb.c:1089:kiblnd_init_rdma()) RDMA has too many fragments for peer_ni 172.16.1.3@o2ib (32), src idx/frags: 32/256 dst idx/frags: 32/256 tx_nwrq = 32

The above patches were made only with mlx5 in mind. Both mlx4 and OPA will use FMR. And potentially mlx4 on older kernels, prior to RHEL 7.4, will use global memory regions.

Conclusion

Some of Alexy's patches are useful, but the above patch and https://review.whamcloud.com/#/c/28851/5 are problematic because of the reasons I outlined below

Tasks

Resolve the following patches:
Consider simplifying the Credit Management system in a separate patch.

Space shortcuts

Page tree

Queue Depth and Credit Management Logic

Comments on Alexy's changes

Conclusion

Tasks

Space shortcuts

Page tree

o2iblnd Queue Depth and Credit Management

Queue Depth and Credit Management Logic

Comments on Alexy's changes

Conclusion

Tasks