We've seen IB_WC_WR_FLUSH_ERR errors on some sites. One explanation for these errors is recorded here:

https://www.ibm.com/support/knowledgecenter/SSYKE2_7.0.0/com.ibm.java.lnx.70.doc/diag/problem_determination/rdma_jverbs_qp_error.html

Relevant section is pasted below:

-----

If  the send and receive buffer sizes do not match, a send or receive request can result in a queue pair error that cannot be recovered. You can add steps to the communication process to avoid this problem.

This problem can occur if the size of the receive buffer does not match the size of the send buffer. The send operation fails with the operation code IBV_WC_LOC_LEN_ERR on the receiver side and IBV_WC_REM_INV_REQ_ERR on the sender side. Such an error causes the queue pair to move to an error state and further send or receive requests result in the operation code IBV_WC_WR_FLUSH_ERR. The queue pair cannot be recovered.

To avoid this problem, ensure that the length of the posted receive buffer is large enough to hold the send request. If the length is not known, you can program your application to communicate the required length before the send. On the other side, a buffer can be prepared with the appropriate size to receive.

-----

On Lustre 2.10 and prior, I suspect this could happen because we reduce the max_send_wr for the QP during creation without managing the queue depth. This could lead to the QP being created with different buffer sizes on both ends.