Basics
For a vmcore to be useful it needs to be a full vmcore (not incomplete, those are mostly useless) and it needs to come with debuginfo packages for both the kernel the system was running and for Lustre modules running at the time of crash.
There are several potential sources for the kernel debuginfo.
- For client-node sourced crashes we typically use distro packaging so you will just need to obtain the matching kernel-debuginfo rpm (not sure what's the story on the deb based distros like ubuntu). It's easy to tell it's a non-DDN kernel by doing uname -r to get the kernel version, it'll look something like this: 4.18.0-553.40.1.el8_10.x86_64 and observing that the result does not have lustre and/or ddn in the resulting output. Provide the (accessible to devs) link to this kernel-debuginfo rpm file to the devs (might not be a bad idea to test it first as per below instructions to make sure it matches).
- On the exascaler VMs and on dedicated Lustre servers the kernels are built by DDN. if you do uname -r the result would be something like 5.14.0-427.31.1_lustre.el9.x86_64 - those could be obtained on VPN at https://fse01-co-es.datadirect.com/artifacts/exascaler
- Sometimes the currently running kernel does not match what was running when the crash dump was generated (Esp. if you are looking for this info after a while post crash) so the most robust way to confirm the corect version is by checking inside the vmcore-dmesg.txt or inside the vmcore itself.
- In the vmcore-dmesg if you are lucky the very first line reads something like
Linux version 5.14.0-427.31.1_lustre.el9.x86_64 (jenkins@onyx-202-el9-x8664-1.onyx.whamcloud.com) (gcc (GCC) 11.4.1 20231218 (Red Hat 11.4.1-3), GNU ld version 2.35.2-43.el9) #1 SMP PREEMPT_DYNAMIC Sat Nov 16 02:13:15 UTC 2024 - the time is particularly important because we sometimes build several otherwise same versioned kernels at different times and so they might not match!
- to get the same thing from the vmcore you will run (in your linux terminal):
file vmcore | grep -B 1 SMP and that will output you two strings like below, press ^C after you got them. one is the version adn the other is the build date.
4.18.0-553.22.1.el8_lustre.ddn17.x86_64
#1 SMP Sat Oct 5 00:58:15 UTC 2024
- select the release you think the node was running at the time (kinda important) and after going to that directory you will see directory named lustre and inside it is going to be repo. It could be that (esp. for older releases) the structure is different and you'll have to aimlessly wander around random directories trying to find where the kernels live. For the exa 6.3.1 matching the above output the proper kernel-debuginfo is going to be in either http://fse01.co-es.datadirectnet.com/artifacts/exascaler/6.3.1/lustre/lustre_repo/rhel8.10/lustre/ or in the http://fse01.co-es.datadirectnet.com/artifacts/exascaler/6.3.1/lustre/lustre_repo/rocky8.10/lustre/ pick the kernel based on the version string and see if the date is roughly the same as what you see from above if the date is too much off it's likely the wrong kernel (will save you effort trying to see that from the crash tool)
- Once you got that package, unpack it somewhere and grab the vmlinux file from inside (typically the path is /usr/lib/debug/lib/modules/kernelversion/vmlinux
- make sure you have crash installed on the node you are doing this on and run crash vmcore vmlinux
- you should not see any errors and warnings and the output would look something like this:
For help, type "help".
Type "apropos word" to search for commands related to "word"...
KERNEL: vmlinux [TAINTED]
DUMPFILE: nbp27-srv10/crash/127.0.0.1-2025-02-27-09:39:47/vmcore [PARTIAL DUMP]
CPUS: 20
DATE: Thu Feb 27 12:40:03 EST 2025
UPTIME: 00:13:44
LOAD AVERAGE: 13.87, 12.88, 7.60
TASKS: 2453
NODENAME: nbp27-srv10
RELEASE: 4.18.0-553.22.1.el8_lustre.ddn17.x86_64
VERSION: #1 SMP Sat Oct 5 00:58:15 UTC 2024
MACHINE: x86_64 (2099 Mhz)
MEMORY: 150 GB
PANIC: ""
PID: 43038
COMMAND: "mdt_rdpg00_003"
TASK: ffff94fe54834000 [THREAD_INFO: ffff94fe54834000]
CPU: 0
STATE: TASK_RUNNING (PANIC)
crash>
- if you see a warning about kernel and vmcore mismatch be it date or whatever, even if it ultimately loaded - this is the wrong kernel and you need to keep looking, as otherwise results might not match up in the end.
- Once you arrived at the proper kernel-debuginfo set it aside and remember a link - you'll need to provide that in the ticket, but since not all devs can be on the VPN, they might ask you for the rpm itself.
Lustre modules
Now that we got our kernel - hopefully you know what exact Lustre version is run. it's going to look something like 2.14.0-ddn185 and if you are lucky for a kernel crash it's going to sit right next to the kernel you are looking for a file that's named kmod-lustre-debuginfo and then the version appended. you will also need all other kmod-lustre-*-debuginfo for the same version with a possible exception of the kmod-lustre-tests-debuginfo. you will need to provide location or these files to the devs as well.
For client node crashes - typically customers build them themselves so we do not have them on our servers unlike the server builds, so you might have to locate it at the customer site to provide. the naming is going to be like per above. if dkms is in use, the artifacts are likely in /var/lib/dkms somewher too
Maloo
For autotest/maloo crashes the client and kernel artifacts are on https://build.whamcloud.com and also might be linked from the maloo testrun page itself, though not even lustre build rebuilds the kernel so some searching might be required. Pay attention to the node name/role to pick the right packages.
Note that customer installs never run maloo builds so even if you do see a same version package on the regular jenkins that's accessible outside the vpn - it's the wrong package, don't use it.