...
- For client-node sourced crashes we typically use distro packaging so you will just need to obtain the matching kernel-debuginfo rpm (not sure what's the story on the deb based distros like ubuntu). It's easy to tell it's a non-DDN kernel by doing uname -r to get the kernel version, it'll look something like this: 4.18.0-553.40.1.el8_10.x86_64 and observing that the result does not have lustre and/or ddn in the resulting output. Provide the (accessible to devs) link to this kernel-debuginfo rpm file to the devs (might not be a bad idea to test it first as per below instructions to make sure it matches).
- On the exascaler VMs and on dedicated Lustre servers the kernels are built by DDN. if you do uname -r the result would be something like 5.14.0-427.31.1_lustre.el9.x86_64 - those could be obtained on VPN at https://fse01-co-es.datadirect.com/artifacts/exascaler
- Sometimes the currently running kernel does not match what was running when the crash dump was generated (Esp. if you are looking for this info after a while post crash) so the most robust way to confirm the corect version is by checking inside the vmcore-dmesg.txt or inside the vmcore itself.
- In the vmcore-dmesg if you are lucky the very first line reads something like
Linux version 5.14.0-427.31.1_lustre.el9.x86_64 (jenkins@onyx-202-el9-x8664-1.onyx.whamcloud.com) (gcc (GCC) 11.4.1 20231218 (Red Hat 11.4.1-3), GNU ld version 2.35.2-43.el9) #1 SMP PREEMPT_DYNAMIC Sat Nov 16 02:13:15 UTC 2024 - the time is particularly important because we sometimes build several otherwise same versioned kernels at different times and so they might not match!
- to get the same thing from the vmcore you will run (in your linux terminal):
file vmcore | vmcore grep -B 1 SMP and that will output you two strings like below, press ^C after you got them. one is the version adn the other is the build date.
4.18.0-553.22.1.el8_lustre.ddn17.x86_64
#1 SMP Sat Oct 5 00:58:15 UTC 2024
- select the release you thin kthe think the node was running at the time (kinda important) and after going to that directory you will see directory named lustre and inside it is going to be repo. It could be that (esp. for older releases) the structure is different and you'll have to aimlessly wander around random directories trying to find where the kernels live. For the exa 6.3.1 matching the above output the proper kernel-debuginfo is going to be in either http://fse01.co-es.datadirectnet.com/artifacts/exascaler/6.3.1/lustre/lustre_repo/rhel8.10/lustre/ or in the http://fse01.co-es.datadirectnet.com/artifacts/exascaler/6.3.1/lustre/lustre_repo/rocky8.10/lustre/ pick the kernel based o nthe on the version string and see if the date is roughly the same as what you see from above if the date is too much off it's likely the wrong kernel (will save you effort trying to see that from the crash tool)
- Once you got that package, unpack it somewhere and grab the vmlinux file from inside (typically the path is /usr/lib/debug/lib/modules/kernelversion/vmlinux
- make sure you have crash installed on the node you are doing this on and run crash vmcore vmlinux
- you should not see any errors and warnings and the output would look something like this:
For help, type "help".
Type "apropos word" to search for commands related to "word"...KERNEL: vmlinux [TAINTED]
DUMPFILE: nbp27-srv10/crash/127.0.0.1-2025-02-27-09:39:47/vmcore [PARTIAL DUMP]
CPUS: 20
DATE: Thu Feb 27 12:40:03 EST 2025
UPTIME: 00:13:44
LOAD AVERAGE: 13.87, 12.88, 7.60
TASKS: 2453
NODENAME: nbp27-srv10
RELEASE: 4.18.0-553.22.1.el8_lustre.ddn17.x86_64
VERSION: #1 SMP Sat Oct 5 00:58:15 UTC 2024
MACHINE: x86_64 (2099 Mhz)
MEMORY: 150 GB
PANIC: ""
PID: 43038
COMMAND: "mdt_rdpg00_003"
TASK: ffff94fe54834000 [THREAD_INFO: ffff94fe54834000]
CPU: 0
STATE: TASK_RUNNING (PANIC)crash>
- if you see a warning about kernel and vmcore mismatch be it date or whatever, even if it ultimately loaded - this is the wrong kenrel kernel and oyu you need to keep looking, as otherwise results might not match up in the end.
- Once you arrived at the proper kernel-debuginfo set it aside and remember a link - you'll need to provide that in the ticket, but since not all devs can be on the VPN, they might ask you for the rpm itself.
...