Item | Notes |
---|
Bench-marking | - stabilization of the network takes a lot of effort
- IOR -> poor performance -> disk analysis -> network between server client -> check each server one by one (determine the problematic server)
- many clients to one server
- network bandwidth problem. Where's the bottle neck?
- network testing one by one (1:1 client to server) - N:N clients:server more difficult to find what the problem is.
- each of the server and client bandwidth separately can be helpful. helps pin point the problem path.
- When traffic is not distributed very well on the different ports of the switch. Although this is below LNet, but usually there is a lot of back and forth with the network provider to get someone to own the problem. Would be useful to have tools, which provide evidence of where the problem is.
- expose performance information per connection
- tools to figure out more general networking issues.
|
After Production | - unexpected client eviction -> timeout between server and client
- collect more info for that specific link on eviction
|
Features | - LNet top, to show time and rate performance
- Ability to capture traffic on a specific connection
|