More Details | |
---|---|
LU-11292 lnet: Discover routers on first use Discover routers on first use. This brings the behavior when interacting with routers inline with when dealing with normal peers. Test-Parameters: forbuildonly Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: I8527e41daf2f5f6ab5f04aac1285aaa6cc4ee594 |
|
https://review.whamcloud.com/#/c/33183/ | |
LU-11298 lnet: use peer for gateway The routing code uses peer_ni for a gateway. However with Mulit-Rail a gateway could have multiple interfaces on several different networks. Instead of using a single peer_ni as the gateway we should be using the peer and let the MR selection code select the best peer_ni to send to. This patch moves the gateway from peer to peer_ni. Much of the code needs to be rewritten in the following patches to account for that change. This patch disables the routing features by disabling the code to add/delete routes. Test-Parameters: forbuildonly Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: Ia7dab552268c4a7fbd7b88122b9a95363d155fd7 | The routing code will change quiet a bit so this patch removes most of the current routing code and then reintroduces it later. This patch concentrates on switching the gateway from using The design decision here is that a gateway is a node where LNet is started with the routing feature enabled. A gateway node can have multiple interfaces. In order to align routing with Multi-Rail, then the code should be first selecting a gateway peer, then using multi-rail to select the best peer_ni on that gateway to use. The following functions are removed in this patch and will be introduced in later patches lnet_is_route_alive() lnet_rtr_addref_locked() lnet_rtr_decref_locked() lnet_shuffle_seed() lnet_add_route_to_rnet() lnet_add_route() # the bulk of the code is removed lnet_check_routes() # the bulk of the code is removed lnet_del_route() # the bulk of the code is removed lnet_parse_rc_info() # the bulk of the code is removed lnet_destroy_rc_data() lnet_update_rc_data_locked() lnet_router_check_interval() lnet_ping_router_locked() lnet_prune_rc_data() lnet_compare_peers() Key fields are moved from lpni_rtrq # moved lpni_rtr_list # moved lpni_ping_notsent # deleted lpni_ping_timestamp # deleted lpni_ping_deadline # deleted lpni_rtr_refcount # moved lpni_healthy # this is a remnant code which is cleaned up lpni_routes # moved The lnet_route structure is changed in the following way: struct lnet_peer *lr_gateway # this is now lnet_peer instead of lnet_peer_ni __u32 lr_lnet # it is no longer possible to determine the local network of the route by simply looking at the gateway peer, since the peer can have multiple interfaces on different networks. Therefore the route now must define the local network and remote network. This way we are able to select and compare routes properly. The rest of the changes concentrate on removing the use of In lib-move.c there are changes in both
Routing is disabled with this patch. |
https://review.whamcloud.com/#/c/33184/ | |
LU-11299 lnet: lnet_add/del_route() Reimplemented lnet_add_route() and lnet_del_route() to use the peer instead of the peer_ni. Test-Parameters: forbuildonly Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: I3734098a81ab18d1d74220c691d96a9b9817e6da | NOTES: lnet_check_routes() is removed in this patch. We should move it in its own patch against ticket: LU-10153. Since the previous patch removes a bunch of functions. The reason for removing lnet_check_routes() is that we no longer restrict multiple routes on the same remote network. This patch re-implements the following functions, which now use lnet_rtr_addref_locked() lnet_rtr_decref_locked() lnet_shuffle_seed() lnet_add_route_to_rnet() lnet_add_route() lnet_del_route_from_rnet() lnet_del_route() |
Prevent peer_ni deletion if it's being used as a router | |
Code split from: https://review.whamcloud.com/#/c/33188/5/lnet/lnet/peer.c | |
Router sensitivity introduction | |
This patch introduces the router_sensitivity_percentage value. I | |
Router sensitivity user space setting | |
This patch allows setting the router sensitivity from lnetctl | |
Cache the ni_status reported in the ping REPLY. | |
This patch caches the ns_status reported in the PING reply this will include the peer.c changes in https://review.whamcloud.com/#/c/33187/5/lnet/lnet/peer.c | |
https://review.whamcloud.com/#/c/33303/ | |
LU-11300 lnet: start with peer down When creating an peer_ni call lnet_peers_start_down() to check if we should set the peer_ni's status as up or down. Test-Parameters: forbuildonly Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: I05005f10ca4b1b11f93e57c304052e155679304a | NOTE: WE need to move this patch up in the series. This patch ensures we maintain current behavior. We start with peer down depending on the tunable checked via lnet_peers_start_down.
|
Cache the routing feature status reported in the ping REPLY | |
This patch caches the routing feature status (enabled or disabled) in the response to the PING | |
https://review.whamcloud.com/#/c/33186/ | |
LU-11300 lnet: peer aliveness Peer NI aliveness is now solely dependent on the health infrastructure. With the addition of router_sensitivity_percentage, peer NI is considered dead if its health drops below the percentage specified of the total health. Setting the percentage to 100% means that a peer_ni is considered dead if it's interface is less than fully healthy. Removed obsolete code that queries the peer NI every second since the health infrastructure introduces the recovery mechanism which is designed to recover the health of peer NIs. Test-Parameters: forbuildonly Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: I506060fbb66c74295808891b689d7d634dc69284 | NOTE: This patch needs cleaning up to make it match the final form.
Code which calls this function is updated. router_proc.c cleanup Merge the code from https://review.whamcloud.com/#/c/33302/3 to this patch. |
https://review.whamcloud.com/#/c/33185/ | |
LU-11300 lnet: router aliveness A route is considered alive if the gateway is able to route messages from the local to the remote net. That means that at least one of the network interfaces on the remote net of the gateway is viable. Introduced the concept of sensitivity percentage. This defaults to 100%. It holds a dual meaning: 1. A route is considered alive if at least one of the its interfaces' health is >= LNET_MAX_HEALTH_VALUE * router_sensitivity_percentage 100 means at least one interface has to be 100% healthy 2. On a router consider a peer_ni dead if its health is not at least LNET_MAX_HEALTH_VALUE * router_sensitivity_percentage. 100% means the interface has to be 100% healthy. Re-implemented lnet_notify() to decrement the health of the peer interface if the LND reports a failure on that peer. Test-Parameters: forbuildonly Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: Ie97561fb70bf6a558bc90fa9266a6ba38fa3d293 | NOTE: break up this patch into a patch which introduces the router_sensitvity_percentage and a patch which uses the value. The changes which are in lib-msg.c do not belong to this patch. Need a separate patch for it. NOTE: This should be titled "route aliveness" This patch introduces a new concept on how to determine that a route is alive. A route is alive if the following two conditions are met:
The health value of a router remote interface will always be set to MAX because we do not send to it directly, therefore we never decrement its health value. The way we know if it's up or down is when we discover it, the router response with the status of the interface which we cache and use to determine the status of the remote interface.
|
Cleanup all rcd code | |
Cleans up the legacy code which handled router pinging | |
Update LND notify mechansim | |
Updates lnet_notify() lnet_set_healthv() lnet_notify_peer_down() gni changes o2iblnd changes socklnd changes https://review.whamcloud.com/#/c/33187/5/lnet/lnet/router.c (end of the file changes) | |
Use discovery for router checking | |
lnet_consolidate_routes_locked() lnet_peer_get_ni_locked() lnet_check_routers() | |
https://review.whamcloud.com/#/c/33188/ | |
LU-11378 lnet: MR aware gateway selection When selecting a route use the Multi-Rail Selection algorithm to select the best available peer_ni of the best route. The selected peer_ni can then be used to send the message or to discover it if the gateway peer needs discovering. Test-Parameters: forbuildonly Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: I376af57611591eed2eb1edb80a1b3a68b5aefd19 | This patch modifies lib-move.c to properly select the gateway and then the gateway peer_ni to send to. |
https://review.whamcloud.com/#/c/33298/ | |
LU-11300 lnet: consider router_check_interval Consider router_check_interval when waking up the monitor thread, to make sure you wakeup the monitor thread at the earliest possible time. Test-Parameters: forbuildonly Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: Ibc4b53886b59a9bc174a29d0da711ac77db3a62c | The monitor thread wakes up the minimum of
This patch introduces the router_check_interval for consideration in the monitor thread wake up algorithm |
https://review.whamcloud.com/#/c/33299/ | |
LU-11299 lnet: router discovery complete callback Added a discovery complete callback which is called when a router has completed it's discovery process. If the router failed discovery then the status of each lpni is set to down. This is necessary because lpnis on remote networks are never communicated with. So their health remains at max. However, if we can't discover the router, then we have to assume that the whole router and all its NIs are down. Test-Parameters: forbuildonly Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: I8d77b78b20ac555bc3afabd9404ca4b0fd19bd2d | Introduce:
|
https://review.whamcloud.com/#/c/33300/ | |
LU-11475 lnet: allow deleting router primary_nid Discovery doesn't allow deleting a primary_nid of a peer. This is necessary because upper layers only know to reach the peer by using the primary_nid. For routers this is not the case. So if a router changes its interfaces and comes back up again, the peer_ni should be adjusted. Test-Parameters: forbuildonly Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: I9da056172f35a5f15eed5ba0e02fcb37ac414c54 |
|
https://review.whamcloud.com/#/c/33301/ | |
LU-11477 lnet: handle health for incoming messages In case of routers (as well as for the general case) it's important to update the health of the ni/lpni for incoming messages. For an lpni specifically when we receive a message is when we know that the lpni is up. A percentage router health is required in order to send a message to a gateway. That defaults to 100, meaning that a router interface has to be absolutely healthy in order to send to it. This matches the current behavior. So if a router interface goes down an its health goes down significantly, but then it comes back up again; either we receive a message from it or we discover it and get a reply, then in order to start using that router interface again we have to boost its health all the way up to maximum. This behavior is special cased for routers. Test-Parameters: forbuildonly Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: Ida6c23f95dbef56c2e6ed7b6d03743939d8b30a0 | Most of the modifications are in lnet_health_check()
|
https://review.whamcloud.com/#/c/33304/ | |
LU-11478 lnet: misleading discovery seqno. There is a sequence number used when sending discovery messages. This sequence number is intended to detect stale messages. However it could be misleading if the peer reboots. In this case the peer's sequence number will reset. The node will think that all information being sent to it is stale, while in reality the peer might've changed configuration. There is no reliable why to know whether a peer rebooted, so we'll always assume that the messages we're receiving are valid. So we'll operate on first come first serve basis. Test-Parameters: forbuildonly Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: I421a00e47bc93ee60fa37c648d6d9a726d9def9c | Need to pass this by Olaf Weber |
https://review.whamcloud.com/#/c/33305/ | |
LU-11470 lnet: drop all rule Add a rule to drop all messages arriving on a specific interface. This is useful for simulating failures on a specific router interface. Test-Parameters: forbuildonly Signed-off-by: Amir Shehata <ashehata@whamcloud.com> Change-Id: Ic69f683fb2caf7a69a1d85428878c89b7b1ee3ad | For testing routers we want to be able to add a rule on the router to drop all messages arriving on that router interface from anywhere. This way we can simulate a router interface down scenario. Problem is the source and destination in the router case are not the router NID. So the rule specifies the local NID of the router. If the local nid is not specific then it default to LNET_NID_ANY. Unlike source and destination it is mandatory. specifying NID any allows the drop rule to match messages in the absence of a specified local_nid drop all field is added which can be set from command line. |
Overview
Content Tools