Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Overview

Need the correct vmlinux and vmcore (don't need the systemmap if you have the correct vmlinux)

-> system.map depends on how the kernel is compiled.

-> debug kernel info

Debugging Kernel Crash dump

crash /boot/System.map-2.6.32.lustremaster vmlinux vmcore
--> vmlinux is located in ./BUILD/kernel-2.6.32.lustremaster/vmlinux
--> /var/crash/*/vmcore

Using Crash

It helps to have the debug info for the modules. This will allow crash to display source code line numbers as well as enable it to know the Lustre/LNet structures, which can then be printed.

This is accomplished by installing the debug info rpm: lustre-debuginfo-*.rpm

Then you load the kernel modules (including Lustre/LNet modules):

Code Block
mod -S /usr/lib/debug/usr/lib/modules/

Once modules are loaded you cand perform the following commands

Code Block

bt -a # gives you stack trace for all the CPUs

ps - task list in condensed form

bt -f

mod -S /usr/lib/debug/usr/lib/modules/

to disassemble function

dis <function name>

...

# Display stack trace for crashed task
bt
 
# gives you stack trace for all the CPUs
bt -a
 
# gives you task list in condensed form
ps
 
# give you more info on each call, including stack addresses.
bt -f
 
# print stack traces for all tasks
foreach bt | less
 
# print the stack trace for wanted task
bt [<PID> | <task pointer>]
 
# to examine type definitions
whatis <type name>
 
# EXAMPLE:
crash> whatis the_lnet
lnet_t the_lnet
 
crash> whatis lnet_t
typedef struct {
    ....
} lnet_t;
 
# examining global variables
crash> the_lnet
the_lnet = $1 = {
  ln_cpt_table = 0xffff883cecde6940, 
  ln_cpt_number = 16, 
  ln_cpt_bits = 4, 
  ln_res_lock = 0xffff883ce8a8a1a0,
  ...
}
 
# examining local variables
<struct name> <address>
# EXAMPLE (more details below)
lnet_peer_ni <address>

Disassembling functions to find structure pointers and print

It is often necessary to print certain structures and their values for testing. In order to do that we need to find the pointer to the structure memory. To accomplish that we need some understanding of AMD64 assembly and registry usage:

reference http://www.x86-64.org/documentation/abi.pdf

First, disassemble function

Code Block
dis <function name>

Second, trace down the pointer. Best way to demonstrate is through an example.

Disassemble and debug

We have the following assert triggered in lnet_destroy_peer_ni_locked().

Code Block
lnet_destroy_peer_ni_locked(struct lnet_peer_ni *lpni)

The stack trace

Code Block
PID: 107343  TASK: ffff883cee985c00  CPU: 50  COMMAND: "socknal_sd05_00"
 #0 [ffff883ce36dbb38] machine_kexec at ffffffff81051beb
 #1 [ffff883ce36dbb98] crash_kexec at ffffffff810f2602
 #2 [ffff883ce36dbc68] panic at ffffffff8162eb21
 #3 [ffff883ce36dbce8] lbug_with_loc at ffffffffa0912ddb [libcfs]
 #4 [ffff883ce36dbd08] lnet_destroy_peer_ni_locked at ffffffffa09a2f96 [lnet]
 #5 [ffff883ce36dbd28] lnet_return_tx_credits_locked at ffffffffa0993cec [lnet]
 #6 [ffff883ce36dbd68] lnet_msg_decommit at ffffffffa0987630 [lnet]
 #7 [ffff883ce36dbd98] lnet_finalize at ffffffffa0987e19 [lnet]
 #8 [ffff883ce36dbe00] ksocknal_tx_done at ffffffffa087aed4 [ksocklnd]
 #9 [ffff883ce36dbe30] ksocknal_scheduler at ffffffffa087fc92 [ksocklnd]
#10 [ffff883ce36dbec8] kthread at ffffffff810a5acf
#11 [ffff883ce36dbf50] ret_from_fork at ffffffff81645998

 

reference http://www.x86-64.org/documentation/abi.pdf

first disassemble the code

rbx: the passed in parameter, but it could be overwritten later on.

We would like to print out the passed in parameter: lpni

According to the reference above (Figure 3.4: Register Usage):

Code Block
%rbx: callee-saved register; optionally used as base pointer
%rdi: used to pass 1st argument to functions

Our task becomes to track down through the disassembled code the usage of rdi and rbx

First disassemble the code for lnet_destroy_peer_ni_locked()So the next

Code Block
crash> dis lnet_destroy_peer_ni_locked
0xffffffffa09a2cb0 <lnet_destroy_peer_ni_locked>:       nopl   0x0(%rax,%rax,1) [FTRACE NOP]
0xffffffffa09a2cb5 <lnet_destroy_peer_ni_locked+5>:     push   %rbp
0xffffffffa09a2cb6 <lnet_destroy_peer_ni_locked+6>:     mov    %rsp,%rbp
0xffffffffa09a2cb9 <lnet_destroy_peer_ni_locked+9>:     push   %r12
0xffffffffa09a2cbb <lnet_destroy_peer_ni_locked+11>:    push   %rbx
0xffffffffa09a2cbc <lnet_destroy_peer_ni_locked+12>:    mov    0xb8(%rdi),%edx
0xffffffffa09a2cc2 <lnet_destroy_peer_ni_locked+18>:    mov    %rdi,%rbx
0xffffffffa09a2cc5 <lnet_destroy_peer_ni_locked+21>:    test   %edx,%edx

We can see the instruction

Code Block
mov %rdi, %rbx

This stores the content of %rdi into %rbx. But %rbx probably gets reused down the call stack. But if so, then its contents will need to be stored by the callee on the stack.

Therefore, lbug_with_lock will definitely save the rbx on the stack, so we go there to find the address. disassemble lbug_with_lock

...

Code Block
bt -f
 #3 [ffff883ce36dbce8] lbug_with_loc at ffffffffa0912ddb [libcfs]
    ffff883ce36dbcf0: ffff8fbcec316010 ffff8abccf727e00 
    ffff883ce36dbd00: ffff883ce36dbd20 ffffffffa09a2f96 

To interpret the stack.  Bottom :  Bottom of the stack (bottom right corner) is the first entry pushed. So order of pushed items on the stack would be

  1. ffffffffa09a2f96
  2. ffff883ce36dbd20
  3. ffff8abccf727e00
  4. ffff8fbcec316010

The first entry pushed on the stack is done by the call instruction which will push the return address on the stack. In the above example

Code Block
ffffffffa09a2f96 (sym <return address> : designated by fffff -> shows the location in the function to which the caller would return after it's done)

...


0xffffffffa0912d35 <lbug_with_loc+5>:   push   %rbp ---> ffff883ce36dbd20

...


0xffffffffa0912d42 <lbug_with_loc+18>:  push   %rbx ---> ffff8abccf727e00

then, knowing the type of the structure we can print it out by providing the address

Code Block
#> struct lnet_peer_ni ffff8abccf727e00

...

Code Block
#> struct lnet_peer_ni.<fieldname> <address> 

To print all numerical untyped values in hex:

Code Block
#> set radix 16

crash 'help' command should be helpful for provides further information.