Overview
Network address translation (NAT) is a method of remapping an IP address space into another by modifying network address information in the IP header of packets while they are in transit across a traffic routing device.
In a NATed environment when a tcp connection is created external to the NAT environment, the socket which is created on the receiving end is bound to the external IP Address of the machine. For example if you have a NATed VM running on a node the physical machine can have address 192.168.0.18, and the VM can have an internal address: 192.168.122.100. When the VM is communicating externally, the physical machine's address is the one which is externally visible. The port is also mapped to a range enforced by iptables rules.
Problem Overview
When an LNet node's socklnd behind a NAT establishes a connection to a peer, NAT remaps the IP address to the external IP address and maps the privileged port the socklnd uses to a non-privileged port. When the LNet peer receives the connection request it uses the remote IP address of the socket to create the NID of the peer. The relevant code can be found in ksocknal_recv_hello()
1875 if (!active && 1876 conn->ksnc_port > LNET_ACCEPTOR_MAX_RESERVED_PORT) { 1877 /* Userspace NAL assigns peer_ni process ID from socket */ 1878 recv_id.pid = conn->ksnc_port | LNET_PID_USERFLAG; 1879 recv_id.nid = LNET_MKNID(LNET_NIDNET(ni->ni_nid), conn->ksnc_ipaddr); 1880 } else { 1881 recv_id.nid = hello->kshm_src_nid; 1882 recv_id.pid = hello->kshm_src_pid; 1883 }
Since NAT has mapped the port to something outside the reserved port range, it leads to the NID being created from the socket's IP address. This is necessary in order to be able to re-establish connections to the initiating node. The internal IP address is unreachable outside the NATed environment.
The socklnd will proceed to forward this NID to the lnet_parse()
, which will process the request properly. However, when it ends up responding to the node, it'll use the NID created from the socket, which contains the external NATed IP address. The response is received by the node, however it is promptly dropped because it doesn't match the configured NID.
Solution
The solution is to abstract the NATed address from the LNet layer. LNet layer needs to maintain the correct NID for the NATed peer, otherwise if you have multiple VMs on the same physical machine, they'd all be using the same external IP address. LNet will not be able to differentiate between these NIDs.
Fortunately, the active socklnd sends a hello
message as part of the connection establishment. This hello
message contains the actual configured NID of the active side. The passive side of the connection can then map the active's configured NID to the NATed IP address and port.
The active's configured NID is passed up to LNet. LNet can continue it's processing and then call the socklnd's send callback to send a message. The socklnd will lookup the mapping with the node's configured NID as key and use the external IP address and port. If the connection already exists it'll use that socket. Otherwise, it'll be able re-establish the connection using the external IP address and port the NATed peer used for the initial connection. This process is subject to the ip table rules described below
Detailed Design
- When the socklnd receives a hello message on a non-privileged port, it'll check to see if NAT support is on. If NAT support is off then the connection is closed with appropriate failure message logged.
- If NAT support is on, then a mapping between the Private NID as KEY and the External IP/Port is maintained in a hash table.
- The private NID of the NATed client is then passed on through
lnet_parse()
to LNet. It's important to note that the physical IP address of the node is never exposed to the LNet layer. - LNet carries it's own processing and then responds or sends new messages, identifying the peer with the private NID.
- Socklnd looks up the peer with the Private NID as the key and finds the External IP/Port in the mapping table and uses that to send back the message.
- This step is not strictly necessary, because the socket is already created. However, it becomes important if the socket has been closed.
- If the socket is closed and the non-NATed side needs to send a message to the same peer, then it can lookup the External IP/Port and use those to create a connection.
- The iptables section below describes the rules needed to make this step possible. In the absence of these rules the remote host will refuse the connection.
iptables rules (or similar)
In order to allow the Non-NATed peer to establish a connection with a NATed node the following strategy will need to be followed. Note: this strategy only works after the initial connection is established by the NATed side.
- Setup NAT for a specific VM to use a specific range of masquerade ports, ex:1024-1048
- This is needed if there are multiple VMs running on the host or if it's important to keep some ports open for other applications.
- Setup IP forwarding to forward any messages directed to 1024-1048 -> VM-ip
- Same reason as above.
Below is an example ip rules. The relevant rules are marked with an *
Chain FORWARD (policy ACCEPT 0 packets, 0 bytes) num pkts bytes target prot opt in out source destination 1 2015K 3972M ACCEPT all -- * virbr0 0.0.0.0/0 192.168.122.0/24 ctstate RELATED,ESTABLISHED 2 1462K 91M ACCEPT all -- virbr0 * 192.168.122.0/24 0.0.0.0/0 3 0 0 ACCEPT all -- virbr0 virbr0 0.0.0.0/0 0.0.0.0/0 *4 21 1260 ACCEPT tcp -- * * 0.0.0.0/0 192.168.122.100 tcp dpt:988 5 36 2160 REJECT all -- * virbr0 0.0.0.0/0 0.0.0.0/0 reject-with icmp-port-unreachable 6 0 0 REJECT all -- virbr0 * 0.0.0.0/0 0.0.0.0/0 reject-with icmp-port-unreachable Chain PREROUTING (policy ACCEPT 2376 packets, 281K bytes) num pkts bytes target prot opt in out source destination *1 50 3000 DNAT tcp -- enp5s0 * 0.0.0.0/0 0.0.0.0/0 tcp dpts:1024:65535 to:192.168.122.100:988 Chain POSTROUTING (policy ACCEPT 4218 packets, 314K bytes) num pkts bytes target prot opt in out source destination 1 18428 1380K piavpn.anchors all -- * * 0.0.0.0/0 0.0.0.0/0 2 65 4700 RETURN all -- * * 192.168.122.0/24 224.0.0.0/24 3 0 0 RETURN all -- * * 192.168.122.0/24 255.255.255.255 *4 168 10040 MASQUERADE tcp -- * * 192.168.122.0/24 !192.168.122.0/24 masq ports: 1024-65535 5 6469 492K MASQUERADE udp -- * * 192.168.122.0/24 !192.168.122.0/24 masq ports: 1024-65535 6 0 0 MASQUERADE all -- * * 192.168.122.0/24 !192.168.122.0/24
To explain, we'll start from the bottom up:
- POSTROUTING rules apply to outgoing packets, when a socket is being created. This is what enforces the address translation. It's saying:
- Any packet sourced from any address on 192.168.122.0/24, destined to an address not on that network, then use a port from the range 1024 - 65535
- This rule does not apply to socket traffic
- PREROUTING rules apply to incoming packets before a socket is created. This is needed in order for nodes outside the NAT to connect to nodes inside the NAT. It's saying:
- For all tcp traffic incoming on interface en5s0 destined to a port in the range of 1024:65535 should be forwarded to 192.168.122.100 (our VM) on port 988 (the port Lustre listens on)
- This range is large, so it's advisable to set up a narrower range.
- For all tcp traffic incoming on interface en5s0 destined to a port in the range of 1024:65535 should be forwarded to 192.168.122.100 (our VM) on port 988 (the port Lustre listens on)
- FORWARD rules ACCEPTS packets destined to 192.168.122.100 port 988. If that rules isn't there, the packets will be rejected with icmp-port-unreachable.
Limitations
The first iteration of the solutions has a few limitations:
- The NATed peer must always establish the connection. The non-NATed side can not reach the NATed peer because it's hidden behind of the address translation.
- From a Lustre perspective the implication of that is it'll only work for clients, since clients are the ones which establish the connection.
- Some traffic shaping rules need to be applied to allow for the servers to re-establish connection with the NATed client using the external IP and port used by the NATed peer initially.
- Multi-Rail is currently not supported with this solution.
- The mapping is from one NID to the NATed IP address/port. If MR is invoked the mapping will not work as it's currently implemented.
Support Matrix
It is recommended to apply this patch on the client and the servers.
The current implementation of this feature has the following support matrix.
client | server | Supported |
---|---|---|
NAT | Non-NAT | Supported |
Non-NAT | NAT | 1 Not Supported |
NAT | NAT | Not Supported |
- The patch supports NAT → Non-NAT connection establishment, however since Lustre clients always initiate the first connection through the mount command, this particular case is not supported. However, it is possible to support it with further changes and configurations.
Required Work
The current work is a Proof of Concept stage. Before landing the following changes need to be made.
- Possibly support configuration to allow for support of Non-NAT clients with NAT servers.
- This will not be implemented at this time as there is no use case, however:
- it is potentially possible to configure the servers with a port range to use for NATed clients
- IP rules similar to above need to be added to allow tcp message forwarding to the NATed clients.
- A similar solution can be formulated to support NAT ↔ NAT communication. However, I'll leave this to be expanded on as the need arises.
- This will not be implemented at this time as there is no use case, however:
- Various code cleanups
- Thorough test plan
- Automate test plan with LUTF
- Code reviews.