Sometimes customers report that Lustre mount fails on reboot. Upon investigation in many cases this can be tracked down to slow IB interface initialization and consequently LNet failing to start up properly due to the assigned interfaces being down.

This article https://support.hpe.com/hpesc/public/docDisplay?docId=a00040272en_us&docLocale=en_US| provides some background at least on one possible cause of the issue and recommends setting POST_START_DELAY=90 in /etc/openibd.conf as a workaround.

Another workaround option is using lnet.service. This service looks like this:

lnet.service
/etc/systemd/system/multi-user.target.wants/lnet.service:

[Service]
Type=oneshow
ReamainAfterExit=true
ExecStartPre=/usr/bin/sleep 90
ExecStart=/sbin/modprobe lnet
ExecStart=/usr/sbin/lnetctl lnet configure
ExecStart=/usr/sbin/lnetctl import /etc/lnet.conf
ExecStop=/usr/sbin/lustre_rmmod ptlrpc
ExecStop=/usr/sbin/lnetctl lnet unconfigure
ExecStop=/usr/sbin/lustre_rmmod libcfs ldiskfs

Note "ExecStartPre=/usr/bin/sleep 90" which has been added to make sure that network interfaces have enough time to come up before lnet tries to use them on start-up. This can be improved by replacing the long "sleep" with a short "sleep" and periodic polling of the interfaces, expiring on a timeout.

The workaround relies on "/etc/lnet.conf" containing valid configuration for the node. The default lnet.conf typically appears to contain commented out output of "lnetctl export" command. Many users prefer to have a simplified version of lnet.conf which doesn't list anything but interfaces to use per lnet, for example:


lnet.conf
net:
    - net type: o2ib1
      local NI(s):
        - interfaces:
              0: ib0
    - net type: o2ib4
      local NI(s):
        - interfaces:
              0: ib1



  • No labels