These are notes for a debugging session I had with Patrick Farrell and Oleg Drokin to debug evictions which occur when testing bond link failover.
This example is with dlmtrace and rpctrace dk logs on the client & server.
So first off, the client doesn't know why it's evicted - it just gets "hey, you've been evicted" the next time it tries to contact the server.
So the server knows what happened, so we look on the server for the eviction - looking for "evicting", and specifically "evicting client":
15247049 00010000:00020000:35.1F:1566497870.915689:0:0:0:(ldlm_lockd.c:334:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 10.0.15.157@ o2ib10 ns: filter-lustre-OST0000_UUID lock: ffff8889a979b000/0xb105f2d3186c659d lrc: 3/0,0 mode: PW/PW res: [0x6cdc98:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709 551615] (req 0->18446744073709551615) flags: 0x60000480010020 nid: 10.0.15.157@o2ib10 remote: 0xa1a3a6cd8b8b7474 expref: 5 pid: 29927 timeout: 4448893984 lvb_type: 0
That's the lock that caused the eviction. So we find the previous instances of it in the server log, looking for either the "remote:" field or the second one in "lock:" ffff8889a979b000/0xb105f2d3186c659d <-- that second one is the lock cookie and is generally not reused (maybe never?), whereas the first is a pointer and so subject to reuse later.
Here's the last two messages with it on the server before the eviction message:
15241545 00010000:00010000:35.0:1566497770.692303:0:29965:0:(ldlm_lockd.c:919:ldlm_server_blocking_ast()) ### server preparing blocking AST ns: filter-lustre-OST0000_UUID lock: ffff8889a979b000/0xb105f2d3186c659d lrc: 3/0,0 mode: PW/PW res: [0x6cdc98:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x50000080010020 nid: 10.0.15.157@o2ib10 remote: 0xa1a3a6cd8b8b7474 expref: 5 pid: 29927 timeout: 0 lvb_type: 0 15241546 00010000:00010000:35.0:1566497770.692315:0:29965:0:(ldlm_lockd.c:476:ldlm_add_waiting_lock()) ### adding to wait list(timeout: 100, AT: on) ns: filter-lustre-OST0000_U UID lock: ffff8889a979b000/0xb105f2d3186c659d lrc: 4/0,0 mode: PW/PW res: [0x6cdc98:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->1844674407370955161 5) flags: 0x70000480010020 nid: 10.0.15.157@o2ib10 remote: 0xa1a3a6cd8b8b7474 expref: 5 pid: 29927 timeout: 4448893984 lvb_type: 0
It's prepping to send the bl_ast. So if you look at that thread (:29965:)
and see what it does after that second message, you can find the RPC:
15241547 00000100:00100000:35.0:1566497770.692328:0:29965:0:(client.c:1620:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ll_ost00_005:lustre-OST0000_UUID:2996 5:1642590389685184:10.0.15.157@o2ib10:104
(:104 at the end is the op code, this one means it's a BL callback RPC)
In this case, I'm guessing the RPC was received, so I'm going to start by seeing if the client got the BL ast.
The "remote:" field on the server is the client cookie, and on the client, remote: is the server cookie, so we use that to jump between client & server.
So in this case, 0xa1a3a6cd8b8b7474. Client log will have a lot of stuff with this lock in it, so it's usually best to start from the end of the log and work your way back. Otherwise you'll walk through lock creation & every time it was used, which could be a lot of times.
The first messages you'll see (going backwards) are generally these two, which is the lock being destroyed:
00010000:00010000:34.0:1566497895.562428:0:10111:0:(ldlm_lockd.c:1788:ldlm_handle_bl_callback()) ### client blocking callback handler END ns: lustre-OST0000-osc-ffff99a0c2532000 lock: ffff99993c3d5800/0xa1a3a6cd8b8b7474 lrc: 1/0,0 mode: --/PW res: [0x6cdc98:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x4d6f480000000 nid: local remote: 0xb105f2d3186c659d expref: -99 pid: 10482 timeout: 0 lvb_type: 1 00010000:00010000:34.0:1566497895.562433:0:10111:0:(ldlm_lock.c:196:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: lustre-OST0000-osc-ffff99a0c2532000 lock: ffff99993c3d5800/0xa1a3a6cd8b8b7474 lrc: 0/0,0 mode: --/PW res: [0x6cdc98:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x4d6f480000000 nid: local remote: 0xb105f2d3186c659d expref: -99 pid: 10482 timeout: 0 lvb_type: 1
But if you trace it back a bit, this is happening because of the eviction. So we have to look for when the BL ast was first received & processed...
So if we go back, we pretty quickly find this:
00010000:00010000:34.0F:1566497771.701364:0:10111:0:(ldlm_lockd.c:1766:ldlm_handle_bl_callback()) ### client blocking AST callback handler ns: lustre-OST0000-osc-ffff99a0c2532000 lock: ffff99993c3d5800/0xa1a3a6cd8b8b7474 lrc: 3/0,0 mode: PW/PW res: [0x6cdc98:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x420080000000 nid: local remote: 0xb105f2d3186c659d expref: -99 pid: 10482 timeout: 0 lvb_type: 1
(You'll notice it's 124 seconds before the last messages about the lock, so it's about right.)
So that's the eviction tracing itself. Now we have to try to figure out what the client did after it got that message.
On the client:
00000008:00010000:12.0:1566497895.350514:0:10070:0:(osc_cache.c:863:osc_extent_finish()) ### extent: ffff99a117b9cbd0 ns: lustre-OST0000-osc-ffff99a0c2532000 lock: ffff99993c3d5800/0xa1a3a6cd8b8b7474 lrc: 4/0,0 mode: PW/PW res: [0x6cdc98:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x429480000000 nid: local remote: 0xb105f2d3186c659d expref: -99 pid: 10482 timeout: 0 lvb_type: 1
we see our lock pointer appear as part of cleaning up an OSC extent. Looking at other operations on ths OSC object in question, we find:
00000008:00000020:34.0F:1566497895.350825:0:10111:0:(osc_cache.c:3004:osc_cache_wait_range()) obj ffff999fcae4e140 ready 0|-|- wr 0|-|- rd 0|- sync file range.
From the code, it appears like it was waiting for part of the cache to be written out, and it didn't finish waiting until after the eviction.
That's a good example of general eviction debug. This next part is more detailed for the specific issue.
In this particular scenario it appears like the following scenario is occuring
client --> server: I need to write some dirty pages server --> client: ok I setup my buffers go ahead client --> server: RDMA write dirty pages server --> client: RPC write complete
The last write complete is not received by the client, because the link flogs at this point. LNet does not resend this message because it has already been put on the wire.
On the server we see:
15218519 00010000:00000200:5.0:1566497737.519449:0:29948:0:(ldlm_lib.c:2956:target_send_reply_msg()) @@@ sending reply req@ffff8889703c5850 x1642590398998528/t339302430759(0) o4->4f9e24 30-faba-42c7-7483-4a9a7ed5444e@10.0.15.157@o2ib10:412/0 lens 608/448 e 0 to 0 dl 1566497817 ref 1 fl Interpret:/0/0 rc 0/0
The code in target_send_reply_msg() looks like:
rs = req->rq_reply_state; if (rs == NULL || !rs->rs_difficult) { / no notifiers / target_send_reply_msg (req, rc, fail_id); EXIT; return; } in reply_out_callback() 400 if (!rs->rs_difficult) { 401 /* 'Easy' replies have no further processing so I drop the 402 net's ref on 'rs' / 403 LASSERT (ev->unlinked); 404 ptlrpc_rs_decref(rs); 405 EXIT; 406 return; 407 }
It looks like if !rs->rs_difficult then nothing is done in the reply. I think this is the case I'm looking at.
Difficult relies ia mechanism where we get extra confirmation for delivery of our message to the other side even whn our message is actually a reply itself
So reply to write is not difficult. if it is lost, the client would never know the write did not finish, so it's supposed to retry once timeout hits on the client side
On the client
00000100:00000400:12.0F:1566497877.442617:0:10070:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1566497737/real 1566497737] req@ffff99985abd2f80 x1642590398998528/t0(0) o4->lustre-OST0000-osc-ffff99a0c2532000@10.0.10.133@o2ib10:6/4 lens 608/448 e 0 to 1 dl 1566497877 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 00000100:00000200:12.0:1566497877.442626:0:10070:0:(client.c:2625:ptlrpc_unregister_reply()) unregister reply mdh = 1250849 00000400:00000200:12.0:1566497877.442627:0:10070:0:(lib-md.c:544:LNetMDUnlink()) Unlinking md ffff99a117bb7e00 00000400:00000200:12.0:1566497877.442628:0:10070:0:(lib-md.c:550:LNetMDUnlink()) calling event CB for md ffff99a117bb7e00 00000100:00000200:12.0:1566497877.442630:0:10070:0:(events.c:93:reply_in_callback()) @@@ type 6, status 0 req@ffff99985abd2f80 x1642590398998528/t0(0) o4->lustre-OST0000-osc-ffff99a0c2532000@10.0.10.133@o2ib10:6/4 lens 608/448 e 0 to 1 dl 1566497877 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 00000100:00000200:12.0:1566497877.442633:0:10070:0:(events.c:114:reply_in_callback()) @@@ unlink req@ffff99985abd2f80 x1642590398998528/t0(0) o4->lustre-OST0000-osc-ffff99a0c2532000@10.0.10.133@o2ib10:6/4 lens 608/448 e 0 to 1 dl 1566497877 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 00000400:00000200:12.0:1566497877.442635:0:10070:0:(lib-md.c:69:lnet_md_unlink()) Unlinking md ffff99a117bb7e00 00000100:02000400:12.0:1566497877.442637:0:10070:0:(import.c:179:ptlrpc_set_import_discon()) lustre-OST0000-osc-ffff99a0c2532000: Connection to lustre-OST0000 (at 10.0.10.133@o2ib10) was lost; in progress operations using this service will wait for recovery to complete 00000100:00080000:12.0:1566497877.442640:0:10070:0:(import.c:188:ptlrpc_set_import_discon()) ffff99b8dbb14800 lustre-OST0000_UUID: changing import state from FULL to DISCONN 00000100:00080000:12.0:1566497877.442642:0:10070:0:(import.c:418:ptlrpc_pinger_force()) lustre-OST0000_UUID: waking up pinger s:DISCONN
and
00000100:00000200:29.0:1566497895.348533:0:10047:0:(events.c:93:reply_in_callback()) @@@ type 6, status 0 req@ffff99b838a86880 x1642590399001200/t0(0) o8->lustre-OST0000-osc-ffff99a0c2532000@10.0.10.133@o2ib10:28/4 lens 520/544 e 0 to 0 dl 1566497942 ref 1 fl Rpc:RN/0/ffffffff rc 0/-1 00000100:00000200:29.0:1566497895.348535:0:10047:0:(events.c:114:reply_in_callback()) @@@ unlink req@ffff99b838a86880 x1642590399001200/t0(0) o8->lustre-OST0000-osc-ffff99a0c2532000@10.0.10.133@o2ib10:28/4 lens 520/544 e 0 to 0 dl 1566497942 ref 1 fl Rpc:RN/0/ffffffff rc 0/-1 00000400:00000200:29.0:1566497895.348538:0:10047:0:(lib-md.c:69:lnet_md_unlink()) Unlinking md ffff99b915c37700 00000100:00080000:29.0:1566497895.348541:0:10047:0:(import.c:1054:ptlrpc_connect_interpret()) lustre-OST0000-osc-ffff99a0c2532000: connect to target with instance 86 00000008:00000020:29.0:1566497895.348544:0:10047:0:(osc_request.c:1029:osc_init_grant()) lustre-OST0000-osc-ffff99a0c2532000, setting cl_avail_grant: 0 cl_lost_grant: 0.chunk bits: 12 cl_max_extent_pages: 8192 00000080:00080000:29.0:1566497895.348549:0:10047:0:(lcommon_misc.c:71:cl_init_ea_size()) updating def/max_easize: 72/96 00000100:00080000:29.0:1566497895.348551:0:10047:0:(import.c:1254:ptlrpc_connect_interpret()) @@@ lustre-OST0000-osc-ffff99a0c2532000: evicting (reconnect/recover flags not set: 4) req@ffff99b838a86880 x1642590399001200/t0(0) o8->lustre-OST0000-osc-ffff99a0c2532000@10.0.10.133@o2ib10:28/4 lens 520/384 e 0 to 0 dl 1566497942 ref 1 fl Interpret:RN/0/0 rc 0/0 00000100:00080000:29.0:1566497895.348554:0:10047:0:(import.c:1257:ptlrpc_connect_interpret()) ffff99b8dbb14800 lustre-OST0000_UUID: changing import state from CONNECTING to EVICTED 00000100:02020000:29.0:1566497895.348555:0:10047:0:(import.c:1489:ptlrpc_import_recovery_state_machine()) 167-0: lustre-OST0000-osc-ffff99a0c2532000: This client was evicted by lustre-OST0000; in progress operations using this service will fail.
the time difference between the time client sent the announce to the server to the client saying it's timed out and it's 140 seconds
so the client attemts to reconnect 140 seconds later, but the server evicts it after 100s
Summary
- The client requests to write dirty pages
- Server sets up buffers to receive the RDMA and sends a GET to the client
- Client RDMAs into the server buffers
- Link fails
- Server tries to send a write complete RPC, but fails
- Client waits for the reply to indicate write is complete.
- Server doesn't hear from client and evicts after 100s
- Client attempts to reconnect and retry the write after 140s, but on retry it's informed server has evicted it.