Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

It has been observed that mount could hang for a long time if discovery ping is not responded to. This could happen if an OST is down while a client mounts the File System. In this case it does not make sense to hold up the mount procedure while discovery is taking place. For some cases like discovery the algorithm would specify a different timeout other than what's configured.

Other cases where a timeout can be specified which overrides the configured timeout is router ping and manual ping.

One issue to consider is currently the LND transmit timeout defaults to 50s. So if we do retry up to five times we could be held up for 2500s, which would be unacceptable.

The question to answer is, does it make sense for the LND transmit timeout to be set to 50s? Even though the IB/TCP/GNI timeout can be long, it might make more sense to pre-empt that communication stack and attempt to resend the message from the LNet layer on a different interface, or even reuse the same interface if only on is available.

Resiliency vs. Reliability

...