More Details | |
---|---|
LU-12080 lnet: recovery event handling broken Don't increment health on unlink event. If a SEND fails an unlink will follow so no need to do any special processing on SEND event. If SEND succeeds then we wait for the reply. When queuing a message on the NI recovery queue only do so if the MT thread is still running. Test-Parameters: forbuildonly Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: I4877caebcac5cdfc35a59a18a3e3451b1f23cb0d | This should be ported to b2_12 |
https://review.whamcloud.com/#/c/34477 | |
LU-12080 lnet: clean mt_eqh properly There is a scenario where you have a peer on your recovery queue that's down. So you keep pinging it, but every ping times out after 10 seconds. In the middle of these 10 seconds you perform a shutdown. First you try to do the rsp_tracker_clean. It goes through and calls MDUnlink on the MD related to that ping. But because the message has a ref count on the MD, it doesn't go away. The MD gets zombied. And just waits for lnet_md_unlink to be called in lnet_finalize(). Then you hit clean_peer_ni_recovery. We see the peer on the queue, we try to call Unlink on it, but when we lookup the MD using lnet_handle2md() we can't find it. Afterwards we try to clean up the EQ and it asserts. Even if we remove the assert we end up with a resource leak since the EQ is not actually freed since we won't call LNetEQFree() again. The solution is to pull the EQ create in the LNetNIInit() and deletion happens in lnet_unprepare. By this point all the remaining messages would've been finalized and all references on the EQ are gone, allowing us to clean it up properly Test-Parameters: forbuildonly Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: I7fd6018ee2e57f82c649fc3658352e89a4309986 | This should be ported to b2_12 |
https://review.whamcloud.com/#/c/34967 | |
LU-12344 lnet: handle remote health error When a peer is dead set the health status to REMOTE_DROPPED in order to handle health properly for the peer. When dropping a routed message set REMOTE_ERROR. Routed messages are dropped when the routing feature is turned off which could be considered a configuration error if it happens in the middle of traffic. Therefore, it's better to flag this issue at this point without resending the message. Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: I131263215a68fc8607582643a47007ce4d04abbc | This should be ported to b2_12 |
https://review.whamcloud.com/#/c/34252 | |
LU-11816 lnet: setup health timeout defaults Enable health feature by default. Setup transaction timeout to a default 10 seconds and retry count to 3 when health is enabled. When health is disabled set default transaction timeout to 50. When toggling between health enabled/disabled the defaults will always kick in. Test-Parameters: forbuildonly Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: I153c2822898b44e33871ec827de7e61f153bb1db | This patch turns on the health feature since in 2.12 it was off by default. The MR routing feature and related health went through significant testing on Cray HW, thanks to Chris Horn, and some fixes were made to the Health feature in the process. This should be ported to b2_12 |
https://review.whamcloud.com/#/c/34607 | |
LU-12163 lnet: fix cpt locking In lnet_select_pathway() the call to lnet_handle_send_case_locked() can result in sd_cpt being changed. If this function returns REPEAT_SEND, we'll go back to the again label. It is possible at this time to initiate discovery, which will unlock the cpt. If the local cpt isn't updated we could potentially be manipulating the wrong cpt resulting in some form of corruption or dead lock. Test-Parameters: forbuildonly Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: Ifd39b0d84f8cce859151f7cc900a082481dd7218 | This should be ported to b2_12 |
https://review.whamcloud.com/#/c/34770/ | |
LU-12201 lnet: detach response tracker We need to unlink the response tracker from MDs even if the corresponding message failed to send. Test-Parameters: forbuildonly Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: I4f320274576790e3332f66f30aad5c2b3450b955 | This should be ported to b2_12 |
https://review.whamcloud.com/#/c/34771 | |
LU-11297 lnet: invalidate recovery ping mdh For cleanliness, ensure that recovery ping mdh is invalidated when an peer ni or a local ni are allocated Test-Parameters: forbuildonly Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: If06448b1602b3680831244923b6b982a555159ea | This should be ported to b2_12 |
https://review.whamcloud.com/#/c/34778 | |
LU-12249 lnet: fix list corruption In shutdown the resend queues are cleared and freed. The monitor thread state is set to shutdown. It is possible to get lnet_finalize() called after the queues are freed. The code checks for ln_state to see if we're shutting down. But in this case we should really be checking ln_mt_state. The monitor thread is the one that matters in this case, because it's the one which allocates and frees the resend queues. Test-Parameters: forbuildonly Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: Ia077cec7a52ef5cd2e1b231437c6265ba9416b1b | This should be ported to b2_12 |
https://review.whamcloud.com/#/c/34796 | |
LU-12254 lnet: correct discovery LNetEQFree() The EQ needs to be freed after all the queues are cleaned to avoid having non-processed events on the event queue on free. This will prevent the memory from being freed. Test-Parameters: forbuildonly Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: Ie38ec25e09bf6d7cf2aadc30edd91d298897c51b | This should be ported to b2_12 |
https://review.whamcloud.com/#/c/34798 | |
LU-12264 lnet: Protect lp_dc_pendq manipulation with lp_lock Protect the peer discovery queue from concurrent manipulation by acquiring the lp_lock. Test-Parameters: forbuildonly Signed-off-by: Chris Horn <hornc@cray.com> Change-Id: If43b877c1c7ea203f346a3d6ea846f00b8f9661f | This should be ported to b2_12 |
https://review.whamcloud.com/#/c/34885/ | |
LU-12199 lnet: Ensure md is detached when msg is not committed It's possible for lnet_is_health_check() to return "true" when the message has not hit the network. In this situation the message is freed without detaching the MD. As a result, requests do not receive their unlink events and these requests are stuck forever. A little cleanup is included here: - The value of lnet_is_health_check() is only used in one place, so we don't need to save the result of it in a variable. - We don't need separate logic to detach the md when the send was successful. We'll fall through to the finalizing code after incrementing the health counters Test-Parameters: forbuildonly Cray-bug-id: LUS-7239 Signed-off-by: Chris Horn <hornc@cray.com> Change-Id: I6301d491090b862d016eed3aac8afd7be8685e57 | This should be ported to b2_12 |
https://review.whamcloud.com/#/c/34797/ | |
LU-12199 lnet: verify msg is commited for send/recv Before performing a health check make sure the message is committed for either send or receive. Otherwise we can just finalize it. Test-Parameters: forbuildonly Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: Id7bd956f8e81e60a2d63059730973f851d4c7abe | This should be ported to b2_12 |
https://review.whamcloud.com/#/c/34957 | |
LU-12339 lnet: select LO interface for sending In the following scenario Lustre->LNetPrimaryNID with 0@lo Discover is initiated on 0@lo The peer is created with 0@lo and <addr>@<net> The interface health of the peer's <addr>@<net> is decremented LNetPut() to self on <addr>@<net> selection algorithm selects 0@lo to send to This exposes an issue where we try and go through the peer credit management algorithm, but because there are no credits associated with 0@lo we end up indefinitely queuing the message. ptlrpc will then get stuck waiting for send completion on the message. This was exposed via conf-sanity 32a Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: I98e9d3428b594a0d041d27d8e8d8de7596825edc | This should be ported to b2_12 |
https://review.whamcloud.com/#/c/33447/ | |
LU-10153 lnet: remove route add restriction Remove restriction with adding routes to the same remote network via two different gateways. Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: Iefc5aa10f73e9e7bdd283f5e933fbb8ee819df50 | There is no need to restrict the addition of routes to the same remote network via two different gateways. This change is simple. Just remove lnet_check_routes() and its callers. |
https://review.whamcloud.com/#/c/33182 | |
LU-11292 lnet: Discover routers on first use Discover routers on first use. This brings the behavior when interacting with routers inline with when dealing with normal peers. Test-Parameters: forbuildonly Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: I8527e41daf2f5f6ab5f04aac1285aaa6cc4ee594 |
|
https://review.whamcloud.com/#/c/33183/ | |
LU-11298 lnet: use peer for gateway The routing code uses peer_ni for a gateway. However with Mulit-Rail a gateway could have multiple interfaces on several different networks. Instead of using a single peer_ni as the gateway we should be using the peer and let the MR selection code select the best peer_ni to send to. This patch moves the gateway from peer to peer_ni. Much of the code needs to be rewritten in the following patches to account for that change. This patch disables the routing features by disabling the code to add/delete routes. Test-Parameters: forbuildonly Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: Ia7dab552268c4a7fbd7b88122b9a95363d155fd7 | The routing code will change quiet a bit so this patch removes most of the current routing code and then reintroduces it later. This patch concentrates on switching the gateway from using The design decision here is that a gateway is a node where LNet is started with the routing feature enabled. A gateway node can have multiple interfaces. In order to align routing with Multi-Rail, then the code should be first selecting a gateway peer, then using multi-rail to select the best peer_ni on that gateway to use. The following functions are removed in this patch and will be introduced in later patches lnet_is_route_alive() lnet_rtr_addref_locked() lnet_rtr_decref_locked() lnet_shuffle_seed() lnet_add_route_to_rnet() lnet_add_route() # the bulk of the code is removed lnet_check_routes() # the bulk of the code is removed lnet_del_route() # the bulk of the code is removed lnet_parse_rc_info() # the bulk of the code is removed lnet_destroy_rc_data() lnet_update_rc_data_locked() lnet_router_check_interval() lnet_ping_router_locked() lnet_prune_rc_data() lnet_compare_peers() Key fields are moved from lpni_rtrq # moved lpni_rtr_list # moved lpni_ping_notsent # deleted lpni_ping_timestamp # deleted lpni_ping_deadline # deleted lpni_rtr_refcount # moved lpni_healthy # this is a remnant code which is cleaned up lpni_routes # moved The lnet_route structure is changed in the following way: struct lnet_peer *lr_gateway # this is now lnet_peer instead of lnet_peer_ni __u32 lr_lnet It is no longer possible to determine the local network of the route by simply looking at the gateway peer, since the peer can have multiple interfaces on different networks. Therefore the route now must define the local network and remote network. This way we are able to select and compare routes properly. The rest of the changes concentrate on removing the use of In lib-move.c there are changes in both
Routing is disabled with this patch. |
https://review.whamcloud.com/#/c/33184/ | |
LU-11299 lnet: lnet_add/del_route() Reimplemented lnet_add_route() and lnet_del_route() to use the peer instead of the peer_ni. Test-Parameters: forbuildonly Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: I3734098a81ab18d1d74220c691d96a9b9817e6da | This patch re-implements the following functions, which now use lnet_rtr_addref_locked() lnet_rtr_decref_locked() lnet_shuffle_seed() lnet_add_route_to_rnet() lnet_add_route() lnet_del_route_from_rnet() lnet_del_route() |
https://review.whamcloud.com/#/c/33448/ | |
LU-11551 lnet: Do not allow deleting of router nis Check the peer before deleting a peer_ni. If it's a router then do not allow deletion of the peer-ni. Test-Parameters: forbuildonly Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: I372052b4e9b5af3a8f18a49676fc60b4c8077cbd | Add a check before deleting the peer_ni in lnet_del_peer_ni() |
https://review.whamcloud.com/#/c/33449/ | |
LU-11300 lnet: router sensitivity Introduce the lnet_router_sensitivity module parameter to control the sensitivity of routers to failures. It defaults to 100% which means a router interface needs to be fully healthy in order to be used. Test-Parameters: forbuildonly Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: I3e9333033f049918c1cdca58a72604c71884acbe | This patch introduces the router_sensitivity_percentage module parameter |
https://review.whamcloud.com/#/c/33455/ | |
LU-11300 lnet: configure lnet_router_sensitivity Allow the configuration of lnet_router_sensitivity from the user space utility lnetctl Test-Parameters: forbuildonly Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: I715059580d3d2d443432a8b550d4cdafc9f9f632 | This patch allows setting the router sensitivity from lnetctl |
https://review.whamcloud.com/#/c/33450/ | |
LU-11300 lnet: cache ni status When processing the data in the PUSH or the REPLY make sure to cache the ns_status. This is the status of the peer_ni as reported by the peer itself. Test-Parameters: forbuildonly Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: I14de2460f578fb7f47d329a97b8833f49c569b74 | This patch caches the ns_status reported in the PING reply for a GET sent as part of discovery
|
https://review.whamcloud.com/#/c/33451/ | |
LU-11300 lnet: Cache the routing feature When processing a REPLY or a PUSH for a discovery cache the whether the routing feature is enabled or disabled as reported by the peer. Test-Parameters: forbuildonly Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: I69bd41fade196773af0e1004c2e7fff2fb91392d | This patch caches the routing feature status (enabled or disabled) in the REPLY or the PUSH processed by lnet_peer_merge_data(). This is important for later patches which check if the peer is a router or not. |
https://review.whamcloud.com/#/c/33186/ | |
LU-11300 lnet: peer aliveness Peer NI aliveness is now solely dependent on the health infrastructure. With the addition of router_sensitivity_percentage, peer NI is considered dead if its health drops below the percentage specified of the total health. Setting the percentage to 100% means that a peer_ni is considered dead if it's interface is less than fully healthy. Removed obsolete code that queries the peer NI every second since the health infrastructure introduces the recovery mechanism which is designed to recover the health of peer NIs. Test-Parameters: forbuildonly Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: I506060fbb66c74295808891b689d7d634dc69284 |
Code which calls this function is updated. router_proc.c cleanup |
https://review.whamcloud.com/#/c/33185/ | |
LU-11300 lnet: router aliveness A route is considered alive if the gateway is able to route messages from the local to the remote net. That means that at least one of the network interfaces on the remote net of the gateway is viable. Introduced the concept of sensitivity percentage. This defaults to 100%. It holds a dual meaning: 1. A route is considered alive if at least one of the its interfaces' health is >= LNET_MAX_HEALTH_VALUE * router_sensitivity_percentage 100 means at least one interface has to be 100% healthy 2. On a router consider a peer_ni dead if its health is not at least LNET_MAX_HEALTH_VALUE * router_sensitivity_percentage. 100% means the interface has to be 100% healthy. Re-implemented lnet_notify() to decrement the health of the peer interface if the LND reports a failure on that peer. Test-Parameters: forbuildonly Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: Ie97561fb70bf6a558bc90fa9266a6ba38fa3d293 | NOTE: break up this patch into a patch which introduces the router_sensitvity_percentage and a patch which uses the value. The changes which are in lib-msg.c do not belong to this patch. Need a separate patch for it. NOTE: This should be titled "route aliveness" This patch introduces a new concept on how to determine that a route is alive. A route is alive if the following two conditions are met:
The health value of a router remote interface will always be set to MAX because we do not send to it directly, therefore we never decrement its health value. The way we know if it's up or down is when we discover it, the router response with the status of the interface which we cache and use to determine the status of the remote interface.
|
https://review.whamcloud.com/#/c/33452 | |
LU-11300 lnet: simplify lnet_handle_local_failure() Pass the struct lnet_ni to lnet_handle_local_failure() instead of the message structure, since nothing else from the message is being used. This also makes symmetrical with lnet_handle_remote_failure() Test-Parameters: forbuildonly Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: I10146ec5bf5f378e28a7725382f00132ada32c6e | |
https://review.whamcloud.com/#/c/33187/ | |
LU-11299 lnet: Cleanup rcd Cleanup all code pertaining to rcd, as routing code will use discovery going forward and there will be no need to keep its own pinging code. Test-Parameters: forbuildonly Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: If31caa3b5703df40b6ae0f758f2fe764991aa4f3 | Cleans up the legacy code which handled router pinging |
https://review.whamcloud.com/#/c/33453/ | |
LU-11299 lnet: modify lnd notification mechanism LND notifies when a peer is up or down. If a it notifies LNet that the peer is up and sets the "reset" flag to true then this indicates to LNet that the LND knows about the health of the peer and is telling LNet that the peer is fully healthy. LNet will set the health value of the peer to maximum, otherwise it will increment the health by one. If the LND notifies the LNet that the peer is down, LNet will decrement the health of the peer by sensitivity value configured. LNet then turns around and rechecks the peer aliveness and if its dead it'll notify the LND. This code is only used by the socklnd because it needs to teardown connections. This is in keeping with the original funcionality. Test-Parameters: forbuildonly Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: Ifa614405fb0c2cd4f6bcb1a2a97e856320eb6cbe | Updates lnet_notify() lnet_set_healthv() lnet_notify_peer_down() gni changes o2iblnd changes socklnd changes |
https://review.whamcloud.com/#/c/33454/ | |
LU-11299 lnet: use discovery for routing Instead re-inventing the wheel, routing now uses discovery. Everyone router interval the router is discovered. This will update the router information locally and will serve to let the router know that the peer is alive. Test-Parameters: forbuildonly Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: I211bf15af0b0a5d50f9e2a69a385419a1dd5096 | lnet_consolidate_routes_locked() lnet_peer_get_ni_locked() lnet_check_routers() lnet_router_discovery_complete()
|
https://review.whamcloud.com/#/c/33188/ | |
LU-11378 lnet: MR aware gateway selection When selecting a route use the Multi-Rail Selection algorithm to select the best available peer_ni of the best route. The selected peer_ni can then be used to send the message or to discover it if the gateway peer needs discovering. Test-Parameters: forbuildonly Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: I376af57611591eed2eb1edb80a1b3a68b5aefd19 | This patch modifies lib-move.c to properly select the gateway and then the gateway peer_ni to send to. |
https://review.whamcloud.com/#/c/33298/ | |
LU-11300 lnet: consider router_check_interval Consider router_check_interval when waking up the monitor thread, to make sure you wakeup the monitor thread at the earliest possible time. Test-Parameters: forbuildonly Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: Ibc4b53886b59a9bc174a29d0da711ac77db3a62c | The monitor thread wakes up the minimum of
This patch introduces the router_check_interval for consideration in the monitor thread wake up algorithm |
https://review.whamcloud.com/#/c/33300/ | |
LU-11475 lnet: allow deleting router primary_nid Discovery doesn't allow deleting a primary_nid of a peer. This is necessary because upper layers only know to reach the peer by using the primary_nid. For routers this is not the case. So if a router changes its interfaces and comes back up again, the peer_ni should be adjusted. Test-Parameters: forbuildonly Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: I9da056172f35a5f15eed5ba0e02fcb37ac414c54 |
|
https://review.whamcloud.com/#/c/34539 | |
LU-11475 lnet: transfer routers When a primary NID of a peer is about to be deleted because it's being transfered to another peer, if that peer is a gateway then transfer all gateway properties to the new peer. Test-Parameters: forbuildonly Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: Ib475c389ca5630906416a5112b3088f6f5d03950 | |
https://review.whamcloud.com/#/c/33301/ | |
LU-11477 lnet: handle health for incoming messages In case of routers (as well as for the general case) it's important to update the health of the ni/lpni for incoming messages. For an lpni specifically when we receive a message is when we know that the lpni is up. A percentage router health is required in order to send a message to a gateway. That defaults to 100, meaning that a router interface has to be absolutely healthy in order to send to it. This matches the current behavior. So if a router interface goes down an its health goes down significantly, but then it comes back up again; either we receive a message from it or we discover it and get a reply, then in order to start using that router interface again we have to boost its health all the way up to maximum. This behavior is special cased for routers. Test-Parameters: forbuildonly Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: Ida6c23f95dbef56c2e6ed7b6d03743939d8b30a0 | Most of the modifications are in lnet_health_check()
|
https://review.whamcloud.com/#/c/33304/ | |
LU-11478 lnet: misleading discovery seqno. There is a sequence number used when sending discovery messages. This sequence number is intended to detect stale messages. However it could be misleading if the peer reboots. In this case the peer's sequence number will reset. The node will think that all information being sent to it is stale, while in reality the peer might've changed configuration. There is no reliable why to know whether a peer rebooted, so we'll always assume that the messages we're receiving are valid. So we'll operate on first come first serve basis. Test-Parameters: forbuildonly Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: I421a00e47bc93ee60fa37c648d6d9a726d9def9c | Need to pass this by Olaf Weber |
https://review.whamcloud.com/#/c/33305/ | |
LU-11470 lnet: drop all rule Add a rule to drop all messages arriving on a specific interface. This is useful for simulating failures on a specific router interface. Test-Parameters: forbuildonly Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: Ic69f683fb2caf7a69a1d85428878c89b7b1ee3ad | For testing routers we want to be able to add a rule on the router to drop all messages arriving on that router interface from anywhere. This way we can simulate a router interface down scenario. Problem is the source and destination in the router case are not the router NID. So the rule specifies the local NID of the router. If the local nid is not specific then it default to LNET_NID_ANY. Unlike source and destination it is mandatory. specifying NID any allows the drop rule to match messages in the absence of a specified local_nid drop all field is added which can be set from command line. |
https://review.whamcloud.com/#/c/33620/3 | |
LU-11641 lnet: handle discovery off When discovery is turned off locally or when the peer either has discovery off or doesn't support MR at all then degrade discovery behavior to a standard ping. This will allow routers to continue using discovery mechanism even if it's turned off. Test-Parameters: forbuildonly Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: I7f0829d37cbff2bf9e41de251efa715fc4c97e5d | The original discovery behavior was that if you turn off discovery then it doesn't send a PING or a PUSH. However, this causes a problem for routers. One example is Cray need to turn off discovery because it interferes with their routing setup. To handle this we changed the behavior for discovery off case. Discovery will always PING when requested. When the PING REPLY comes it'll update the existing peer_nis but not add or move peers_nis among peers. This is needed to update the router peer_ni statuses on the nodes. Most of the changes are spread through peer.c to change what we do when discovery on the peer is off or when it is turned off locally. |
https://review.whamcloud.com/#/c/33634/2 | |
LU-11297 lnet: handle router health off Routing infrastructure depends on health infrastructure to manage route status. However, health can be turned off. Therefore, we need to enable health for gateways in order to monitor them properly. Each peer now has its own health sensitivity. When adding a route the gateway's health sensitivity can be explicitly set from lnetctl or if not specified then it'll default to 1, thereby turning health on for that gateway, allowing peer NI recovery if there is a failure. Test-Parameters: forbuildonly Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: Ibae33d595e97d0eec432ae8f5d51898ce0776f01 | The way router health is decided is via the health mechanism. This presents a problem since health is turned off by default. This patch enables it for all routers. |
https://review.whamcloud.com/#/c/33635/2 | |
LU-11297 lnet: set gw sensitivity from lnetctl Allow an optional parameter from the: lnetctl route add command to set the health sensitivity of the gateway lnetctl route add --net <net> --gateway <gw> --sensitivity <value> Test-Parameters: forbuildonly Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: Iee120c78a41b79da6ab6bdf1560f558df89233e2 | Allow configuring router sensitivity value when adding a route. It's an optional parameter and defaults to 1 if not specified. |
https://review.whamcloud.com/#/c/33651/ | |
LU-11664 lnet: push router interface updates A router can bring up/down its interfaces if it hasn't received any messages on that interface for a configurable period (alive_router_ping_timeout). When this even occures the router can now push its status change to the peers it's talking to in order to inform them of the change in its status. This will allow the router users to handle asym router failures quicker. Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: I9530ed7d9bc0a86edc43e3f610cc943f1732dcfd | To allow for faster recovery if the router interfaces go down because it doesn't receive a ping, a push with the updated information is pushed to all the peers. This allows peers to stop using these routers if asym router failure is enabled (it is by default). |
https://review.whamcloud.com/#/c/34510 | |
LU-11299 lnet: net aliveness If a router is discovered on any interface on the network, then update the network last alive time and the NI's status to UP. If a router isn't discovered on any interface on a network, then change the status of all the interfaces on that network to down. Test-Parameters: forbuildonly Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: I1d67eb4b3284ccb8306ad4c877a2fcbdf4958d8c | |
https://review.whamcloud.com/#/c/34511/ | |
LU-11299 lnet: discover each gateway Net Wakeup every gateway aliveness interval / number of local networks. Discover each local gateway network in round robin. This is done to make sure the gateway keeps its networks up. Test-Parameters: forbuildonly Signed-off-by: Amir Shehat <ashehata@whamcloud.com> Change-Id: I4035e39c286cb599d4eb8f9df7ed5d278e6d744a | |
https://review.whamcloud.com/#/c/34625/ | |
LU-12053 lnet: look up MR peers routes An MR peer can have multiple interfaces some of which we might have a route to. The primary NID of the peer might not necessarily specify a NID we have a route to. When looking up a route, we must iterate over all the nets the peer is on and select the one which we can route to. Taking into consideration the peer can exist on multiple routed networks we also have a simple round robin algorithm to iterate over all the networks we can reach the peer on. Test-Parameters: forbuildonly Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: I0651dd4f732c8b71872f73cf2512b08f34129bd9 | |
https://review.whamcloud.com/#/c/34772/ | |
LU-12200 lnet: check peer timeout on a router On a router assume that a peer is alive and attempt to send it messages as long as the peer_timeout hasn't expired. Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: I0806a52c8ad7acc1c93dcf32353f1c4467c618b1 |
Overview
Content Tools