Overview

NAT Overview

Network address translation (NAT) is a method of remapping an IP address space into another by modifying network address information in the IP header of packets while they are in transit across a traffic routing device.

In a NATed environment when a tcp connection is created external to the NAT environment, the socket which is created on the receiving end is bound to the external IP Address of the machine. For example if you have a NATed VM running on a node the physical machine can have address 192.168.0.18, and the VM can have an internal address: 192.168.122.100. When the VM is communicating externally, the physical machine's address is the one which is externally visible. The port is also mapped to a range enforced by iptables rules.

Problem Overview

When an LNet node's socklnd behind a NAT establishes a connection to a peer, NAT remaps the IP address to the external IP address and maps the privileged port the socklnd uses to a non-privileged port. When the LNet peer receives the connection request it uses the remote IP address of the socket to create the NID of the peer. The relevant code can be found in ksocknal_recv_hello() 

1875         if (!active &&
1876             conn->ksnc_port > LNET_ACCEPTOR_MAX_RESERVED_PORT) {
1877                 /* Userspace NAL assigns peer_ni process ID from socket */
1878                 recv_id.pid = conn->ksnc_port | LNET_PID_USERFLAG;
1879                 recv_id.nid = LNET_MKNID(LNET_NIDNET(ni->ni_nid), conn->ksnc_ipaddr);
1880         } else {
1881                 recv_id.nid = hello->kshm_src_nid;
1882                 recv_id.pid = hello->kshm_src_pid;
1883         }

Since NAT has mapped the port to something outside the reserved port range, it leads to the NID being created from the socket's IP address. This is necessary in order to be able to re-establish connections to the initiating node. The internal IP address is unreachable outside the NATed environment.

The socklnd will proceed to forward this NID to the lnet_parse(), which will process the request properly. However, when it ends up responding to the node, it'll use the NID created from the socket, which contains the external NATed IP address. The response is received by the node, however it is promptly dropped because it doesn't match the configured NID.

Solution

The solution is to abstract the NATed address from the LNet layer. LNet layer needs to maintain the correct NID for the NATed peer, otherwise if you have multiple VMs on the same physical machine, they'd all be using the same external IP address. LNet will not be able to differentiate between these NIDs.

Fortunately, the active socklnd sends a hello message as part of the connection establishment. This hello  message contains the actual configured NID of the active side. The passive side of the connection can then map the active's configured NID to the NATed IP address and port.

The active's configured NID is passed up to LNet. LNet can continue it's processing and then call the socklnd's send callback to send a message. The socklnd will lookup the mapping with the node's configured NID as key and use the external IP address and port. If the connection already exists it'll use that socket. Otherwise, it'll be able re-establish the connection using the external IP address and port the NATed peer used for the initial connection. This process is subject to the ip table rules described below

Detailed Design

Design

  • When the socklnd receives a hello message on a non-privileged port, it'll check to see if NAT support is on. If NAT support is off then the connection is closed with appropriate failure message logged.
  • If NAT support is on, then a mapping between the Private NID as KEY and the External IP/Port is maintained in a hash table.
  • The private NID of the NATed client is then passed on through lnet_parse()  to LNet. It's important to note that the physical IP address of the node is never exposed to the LNet layer.
  • LNet carries it's own processing and then responds or sends new messages, identifying the peer with the private NID.
  • Socklnd looks up the peer with the Private NID as the key and finds the External IP/Port in the mapping table and uses that to send back the message.
    • This step is not strictly necessary, because the socket is already created. However, it becomes important if the socket has been closed.
  • If the socket is closed and the non-NATed side needs to send a message to the same peer, then it can lookup the External IP/Port and use those to create a connection.
    • The iptables section below describes the rules needed to make this step possible. In the absence of these rules the remote host will refuse the connection.

iptables rules (or similar)

In order to allow the Non-NATed peer to establish a connection with a NATed node the following strategy will need to be followed. Note: this strategy only works after the initial connection is established by the NATed side.

  • Setup NAT for a specific VM to use a specific range of masquerade ports, ex:1024-1048
    • This is needed if there are multiple VMs running on the host or if it's important to keep some ports open for other applications.
  • Setup IP forwarding to forward any messages directed to 1024-1048 -> VM-ip
    • Same reason as above.

Below is an example ip rules. The relevant rules are marked with an *

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
num   pkts bytes target     prot opt in     out     source               destination         
1    2015K 3972M ACCEPT     all  --  *      virbr0  0.0.0.0/0            192.168.122.0/24     ctstate RELATED,ESTABLISHED
2    1462K   91M ACCEPT     all  --  virbr0 *       192.168.122.0/24     0.0.0.0/0           
3        0     0 ACCEPT     all  --  virbr0 virbr0  0.0.0.0/0            0.0.0.0/0           
*4       21  1260 ACCEPT     tcp  --  *      *       0.0.0.0/0            192.168.122.100      tcp dpt:988
5       36  2160 REJECT     all  --  *      virbr0  0.0.0.0/0            0.0.0.0/0            reject-with icmp-port-unreachable
6        0     0 REJECT     all  --  virbr0 *       0.0.0.0/0            0.0.0.0/0            reject-with icmp-port-unreachable

Chain PREROUTING (policy ACCEPT 2376 packets, 281K bytes)
num   pkts bytes target     prot opt in     out     source               destination         
*1       50  3000 DNAT       tcp  --  enp5s0 *       0.0.0.0/0            0.0.0.0/0            tcp dpts:1024:65535 to:192.168.122.100:988

Chain POSTROUTING (policy ACCEPT 4218 packets, 314K bytes)
num   pkts bytes target     prot opt in     out     source               destination         
1    18428 1380K piavpn.anchors  all  --  *      *       0.0.0.0/0            0.0.0.0/0           
2       65  4700 RETURN     all  --  *      *       192.168.122.0/24     224.0.0.0/24        
3        0     0 RETURN     all  --  *      *       192.168.122.0/24     255.255.255.255     
*4      168 10040 MASQUERADE  tcp  --  *      *       192.168.122.0/24    !192.168.122.0/24     masq ports: 1024-65535
5     6469  492K MASQUERADE  udp  --  *      *       192.168.122.0/24    !192.168.122.0/24     masq ports: 1024-65535
6        0     0 MASQUERADE  all  --  *      *       192.168.122.0/24    !192.168.122.0/24    

To explain, we'll start from the bottom up:

  • POSTROUTING rules apply to outgoing packets, when a socket is being created. This is what enforces the address translation. It's saying:
    • Any packet sourced from any address on 192.168.122.0/24, destined to an address not on that network, then use a port from the range 1024 - 65535
    • This rule does not apply to socket traffic
  • PREROUTING rules apply to incoming packets before a socket is created. This is needed in order for nodes outside the NAT to connect to nodes inside the NAT. It's saying:
    • For all tcp traffic incoming on interface en5s0 destined to a port in the range of 1024:65535 should be forwarded to 192.168.122.100 (our VM) on port 988 (the port Lustre listens on)
      • This range is large, so it's advisable to set up a narrower range.
  • FORWARD rules ACCEPTS packets destined to 192.168.122.100 port 988. If that rules isn't there, the packets will be rejected with icmp-port-unreachable.

Limitations

The first iteration of the solutions has a few limitations:

  1. The NATed peer must always establish the connection. The non-NATed side can not  reach the NATed peer because it's hidden behind of the address translation.
    1. From a Lustre perspective the implication of that is it'll only work for clients, since clients are the ones which establish the connection.
    2. Some traffic shaping rules need to be applied to allow for the servers to re-establish connection with the NATed client using the external IP and port used by the NATed peer initially.
  2. Multi-Rail is currently not supported with this solution.
    1. The mapping is from one NID to the NATed IP address/port. If MR is invoked the mapping will not work as it's currently implemented.

Support Matrix

It is recommended to apply this patch on the client and the servers.

The current implementation of this feature has the following support matrix.

clientserverSupported
NATNon-NATSupported
Non-NATNAT1 Not Supported
NATNATNot Supported
  1. The patch supports NAT → Non-NAT connection establishment, however since Lustre clients always initiate the first connection through the mount command, this particular case is not supported. However, it is possible to support it with further changes and configurations.

Required Work

The current work is a Proof of Concept stage. Before landing the following changes need to be made.

  • Possibly support configuration to allow for support of Non-NAT clients with NAT servers.
    • This will not be implemented at this time as there is no use case, however:
      • it is potentially possible to configure the servers with a port range to use for NATed clients
      • IP rules similar to above need to be added to allow tcp message forwarding to the NATed clients.
    • A similar solution can be formulated to support NAT ↔ NAT communication. However, I'll leave this to be expanded on as the need arises.
  • Various code cleanups
  • Thorough test plan
  • Automate test plan with LUTF
  • Code reviews.