Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

These are notes for a debugging session I had with Patrick Farrell and Oleg Drokin to debug evictions which occur when testing bond link failover.

This example is with dlmtrace and rpctrace dk logs on the client & server.

So first off, the client doesn't know why it's evicted - it just gets "hey, you've been evicted" the next time it tries to contact the server.

So the server knows what happened, so we look on the server for the eviction - looking for "evicting", and specifically "evicting client":

Code Block

15247049 00010000:00020000:35.1F:1566497870.915689:0:0:0:(ldlm_lockd.c:334:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 10.0.15.157@ o2ib10 ns: filter-lustre-OST0000_UUID lock: ffff8889a979b000/0xb105f2d3186c659d lrc: 3/0,0 mode: PW/PW res: [0x6cdc98:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709 551615] (req 0->18446744073709551615) flags: 0x60000480010020 nid: 10.0.15.157@o2ib10 remote: 0xa1a3a6cd8b8b7474 expref: 5 pid: 29927 timeout: 4448893984 lvb_type: 0

...

Code Block
15241547 00000100:00100000:35.0:1566497770.692328:0:29965:0:(client.c:1620:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ll_ost00_005:lustre-OST0000_UUID:2996 5:1642590389685184:10.0.15.157@o2ib10:104

(:104 at the end is the op code, this one means it's a BL callback RPC)

...

we see our lock pointer appear as part of cleaning up an OSC extent.  Looking at other operations on ths OSC object in question, we find:

00000008:00000020:34.0F:1566497895.350825:0:10111:0:(osc_cache.c:3004:osc_cache_wait_range()) obj ffff999fcae4e140 ready 0|-|- wr 0|-|- rd 0|- sync file range.

From the code, it appears like it was waiting for part of the cache to be written out, and it didn't finish waiting until after the eviction.


That's a good example of general eviction debug.  This next part is more detailed for the specific issue.


In this particular scenario it appears like the following scenario is occuring

...