You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 7 Next »

Configuring kdump

If kdump is not configured on your system. Below is an article explaining how to configure it.

EL 6

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Deployment_Guide/s2-kdump-configuration-cli.html

EL 7

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Kernel_Crash_Dump_Guide/sect-kdump-config-cli.html

Crash Overview

Need the correct vmlinux and vmcore (don't need the systemmap if you have the correct vmlinux)

-> system.map depends on how the kernel is compiled.

-> debug kernel info

Debugging Kernel Crash dump

crash /boot/System.map-2.6.32.lustremaster vmlinux vmcore
--> vmlinux is located in ./BUILD/kernel-2.6.32.lustremaster/vmlinux
--> /var/crash/*/vmcore

Using Crash

It helps to have the debug info for the modules. This will allow crash to display source code line numbers as well as enable it to know the Lustre/LNet structures, which can then be printed.

This is accomplished by installing the debug info rpm: lustre-debuginfo-*.rpm

Then you load the kernel modules (including Lustre/LNet modules):

mod -S /usr/lib/debug/usr/lib/modules/

Once modules are loaded you cand perform the following commands

# Display stack trace for crashed task
bt
 
# gives you stack trace for all the CPUs
bt -a
 
# gives you task list in condensed form
ps
 
# give you more info on each call, including stack addresses.
bt -f
 
# print back trace with line numbers
bt -l
 
# print stack traces for all tasks
foreach bt | less
 
# print the stack trace for wanted task
bt [<PID> | <task pointer>]
 
# to examine type definitions
whatis <type name>
 
# EXAMPLE:
crash> whatis the_lnet
lnet_t the_lnet
 
crash> whatis lnet_t
typedef struct {
    ....
} lnet_t;
 
# examining global variables
crash> the_lnet
the_lnet = $1 = {
  ln_cpt_table = 0xffff883cecde6940, 
  ln_cpt_number = 16, 
  ln_cpt_bits = 4, 
  ln_res_lock = 0xffff883ce8a8a1a0,
  ...
}
 
# examining local variables
<struct name> <address>
# EXAMPLE (more details below)
lnet_peer_ni <address>

Disassembling functions to find structure pointers and print

It is often necessary to print certain structures and their values for testing. In order to do that we need to find the pointer to the structure memory. To accomplish that we need some understanding of AMD64 assembly and registry usage:

reference http://www.x86-64.org/documentation/abi.pdf

First, disassemble function

dis <function name>

Second, trace down the pointer. Best way to demonstrate is through an example.

Disassemble and debug

We have the following assert triggered in lnet_destroy_peer_ni_locked().

lnet_destroy_peer_ni_locked(struct lnet_peer_ni *lpni)

The stack trace

PID: 107343  TASK: ffff883cee985c00  CPU: 50  COMMAND: "socknal_sd05_00"
 #0 [ffff883ce36dbb38] machine_kexec at ffffffff81051beb
 #1 [ffff883ce36dbb98] crash_kexec at ffffffff810f2602
 #2 [ffff883ce36dbc68] panic at ffffffff8162eb21
 #3 [ffff883ce36dbce8] lbug_with_loc at ffffffffa0912ddb [libcfs]
 #4 [ffff883ce36dbd08] lnet_destroy_peer_ni_locked at ffffffffa09a2f96 [lnet]
 #5 [ffff883ce36dbd28] lnet_return_tx_credits_locked at ffffffffa0993cec [lnet]
 #6 [ffff883ce36dbd68] lnet_msg_decommit at ffffffffa0987630 [lnet]
 #7 [ffff883ce36dbd98] lnet_finalize at ffffffffa0987e19 [lnet]
 #8 [ffff883ce36dbe00] ksocknal_tx_done at ffffffffa087aed4 [ksocklnd]
 #9 [ffff883ce36dbe30] ksocknal_scheduler at ffffffffa087fc92 [ksocklnd]
#10 [ffff883ce36dbec8] kthread at ffffffff810a5acf
#11 [ffff883ce36dbf50] ret_from_fork at ffffffff81645998

We would like to print out the passed in parameter: lpni

According to the reference above (Figure 3.4: Register Usage):

%rbx: callee-saved register; optionally used as base pointer
%rdi: used to pass 1st argument to functions

Our task becomes to track down through the disassembled code the usage of rdi and rbx

First disassemble the code for lnet_destroy_peer_ni_locked()

crash> dis lnet_destroy_peer_ni_locked
0xffffffffa09a2cb0 <lnet_destroy_peer_ni_locked>:       nopl   0x0(%rax,%rax,1) [FTRACE NOP]
0xffffffffa09a2cb5 <lnet_destroy_peer_ni_locked+5>:     push   %rbp
0xffffffffa09a2cb6 <lnet_destroy_peer_ni_locked+6>:     mov    %rsp,%rbp
0xffffffffa09a2cb9 <lnet_destroy_peer_ni_locked+9>:     push   %r12
0xffffffffa09a2cbb <lnet_destroy_peer_ni_locked+11>:    push   %rbx
0xffffffffa09a2cbc <lnet_destroy_peer_ni_locked+12>:    mov    0xb8(%rdi),%edx
0xffffffffa09a2cc2 <lnet_destroy_peer_ni_locked+18>:    mov    %rdi,%rbx
0xffffffffa09a2cc5 <lnet_destroy_peer_ni_locked+21>:    test   %edx,%edx

We can see the instruction

mov %rdi, %rbx

This stores the content of %rdi into %rbx. But %rbx probably gets reused down the call stack. But if so, then its contents will need to be stored by the callee on the stack.

Therefore, lbug_with_lock will definitely save the rbx on the stack, so we go there to find the address. disassemble lbug_with_lock

crash> dis lbug_with_loc
0xffffffffa0912d30 <lbug_with_loc>:     nopl   0x0(%rax,%rax,1) [FTRACE NOP]
0xffffffffa0912d35 <lbug_with_loc+5>:   push   %rbp
0xffffffffa0912d36 <lbug_with_loc+6>:   xor    %eax,%eax
0xffffffffa0912d38 <lbug_with_loc+8>:   mov    $0xffffffffa092fe94,%rsi
0xffffffffa0912d3f <lbug_with_loc+15>:  mov    %rsp,%rbp
0xffffffffa0912d42 <lbug_with_loc+18>:  push   %rbx <<<<<<<<< pushes it on the stack
0xffffffffa0912d43 <lbug_with_loc+19>:  mov    %rdi,%rbx
0xffffffffa0912d46 <lbug_with_loc+22>:  sub    $0x8,%rsp
0xffffffffa0912d4a <lbug_with_loc+26>:  movl   $0x1,0x4ca54(%rip)        # 0xffffffffa095f7a8 <libcfs_catastrophe>

View the stack for lbug_with_loc()

bt -f
 #3 [ffff883ce36dbce8] lbug_with_loc at ffffffffa0912ddb [libcfs]
    ffff883ce36dbcf0: ffff8fbcec316010 ffff8abccf727e00 
    ffff883ce36dbd00: ffff883ce36dbd20 ffffffffa09a2f96 

To interpret the stack:  Bottom of the stack (bottom right corner) is the first entry pushed. So order of pushed items on the stack would be

  1. ffffffffa09a2f96
  2. ffff883ce36dbd20
  3. ffff8abccf727e00
  4. ffff8fbcec316010

The first entry pushed on the stack is done by the call instruction which will push the return address on the stack. In the above example

ffffffffa09a2f96 (sym <return address> : designated by fffff -> shows the location in the function to which the caller would return after it's done)
0xffffffffa0912d35 <lbug_with_loc+5>:   push   %rbp ---> ffff883ce36dbd20
0xffffffffa0912d42 <lbug_with_loc+18>:  push   %rbx ---> ffff8abccf727e00

then, knowing the type of the structure we can print it out by providing the address

#> struct lnet_peer_ni ffff8abccf727e00

To print a field in the structure you can:

#> struct lnet_peer_ni.<fieldname> <address> 

To print all numerical untyped values in hex:

#> set radix 16

crash 'help' command provides further information.

More Crash Commands

# show where the kernel memory is allocated
crash> kmem -s
 
...
ffff88007fa809c0 idr_layer_cache          544        294       301     43     4k
ffff88007fa60980 size-4194304(DMA)    4194304          0         0      0  4096k
ffff88007fa50940 size-4194304         4194304          0         0      0  4096k
ffff88007fa40900 size-2097152(DMA)    2097152          0         0      0  2048k
ffff88007fa308c0 size-2097152         2097152          1         1      1  2048k
ffff88007fa20880 size-1048576(DMA)    1048576          0         0      0  1024k
ffff88007fa10840 size-1048576         1048576         64        64     64  1024k
ffff88007fa00800 size-524288(DMA)      524288          0         0      0   512k
ffff88007f9f07c0 size-524288           524288          0         0      0   512k
ffff88007f9e0780 size-262144(DMA)      262144          0         0      0   256k
ffff88007f9d0740 size-262144           262144         64        64     64   256k
ffff88007f9c0700 size-131072(DMA)      131072          0         0      0   128k
ffff88007f9b06c0 size-131072           131072          7         7      7   128k
ffff88007f9a0680 size-65536(DMA)        65536          0         0      0    64k
ffff88007f990640 size-65536             65536          3         3      3    64k
ffff88007f980600 size-32768(DMA)        32768          0         0      0    32k
ffff88007f9705c0 size-32768             32768         26        26     26    32k
ffff88007f960580 size-16384(DMA)        16384          0         0      0    16k
ffff88007f950540 size-16384             16384         24        26     26    16k
ffff88007f940500 size-8192(DMA)          8192          0         0      0     8k
ffff88007f9304c0 size-8192               8192        839       844    844     8k
ffff88007f920480 size-4096(DMA)          4096          0         0      0     4k
ffff88007f910440 size-4096               4096        702       735    735     4k
ffff88007f900400 size-2048(DMA)          2048          0         0      0     4k
ffff88007f8f03c0 size-2048               2048        791       862    431     4k
ffff88007f8e0380 size-1024(DMA)          1024          0         0      0     4k
ffff88007f8d0340 size-1024               1024       1966      2188    547     4k
ffff88007f8c0300 size-512(DMA)            512          0         0      0     4k
ffff88007f8b02c0 size-512                 512    2326695   2326704 290838     4k
ffff88007f8a0280 size-256(DMA)            256          0         0      0     4k
ffff88007f890240 size-256                 256    1162648   1162650  77510     4k
ffff88007f880200 size-192(DMA)            192          0         0      0     4k
ffff88007f8701c0 size-192                 192       3900      6340    317     4k
ffff88007f860180 size-128(DMA)            128          0         0      0     4k
ffff88007f850140 size-64(DMA)              64          0         0      0     4k
ffff88007f840100 size-64                   64      12403     13983    237     4k
ffff88007f8300c0 size-32(DMA)              32          0         0      0     4k
ffff88007f810080 size-128                 128     295891    295920   9864     4k
ffff88007f800040 size-32                   32    1181468   1181488  10549     4k
ffffffff81ad3620 kmem_cache             32896        240       240    240    64k


# Show all the memory blocks which are 
crash> kmem -S <memory address>
 
# example:
crash> kmem -S ffff88007f8d0340
CACHE            NAME                 OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE
ffff88007f8d0340 size-1024               1024       1966      2188    547     4k
SLAB              MEMORY            TOTAL  ALLOCATED  FREE
ffff880037e52e40  ffff880059b90000      4          0     4
FREE / [ALLOCATED]
   ffff880059b90000  (shared cache)
   ffff880059b90400  (shared cache)
   ffff880059b90800  (shared cache)
   ffff880059b90c00
SLAB              MEMORY            TOTAL  ALLOCATED  FREE
ffff880044f9a600  ffff880044f62000      4          1     3
FREE / [ALLOCATED]
   ffff880044f62000  (shared cache)
   ffff880044f62400
  [ffff880044f62800]
   ffff880044f62c00  (shared cache)


# Each address listed is the beginning of an allocation. Potentially you can print the memory at this address to see what it contains.
# example:
crash> lnet_msg_t ffff880044f62800
 
# or
 
# print memory at 64 byte increments starting at address and print 23 64 bytes.
crash> rd -64 ffff88007f82a800 23
 
# You can pipe the output to 'tail' to see the tail end of the output.
 

Resources

Below are some resources that explain the registers and the architecture.

 

 

  • No labels