February 25, 2013

Failure of vShield Edge NAT/VPN Traffic Post-5.1 Upgrade

UPDATE: Turns out this is a known issue during the 1.5 > 5.1 VSM upgrade and a fix should be released in an upcoming patch.

That's about the shortest title I could think of to be descriptive of this issue. TLDR is that NAT rules on vShield Edge appliances appear to be causing unexpected behavior on VPN traffic after a vCloud upgrade from 1.5 to 5.1.
Background: We recently upgraded from 1.5 to 5.1. For most of our vDCs, we simply have a single vSE/Routed network that connects a private subnet to a "WAN" network and pulls a public IP from a pool. We forward (NAT) and allow (firewall) selected ports (e.g. 3389 for RDP) to virtual machines. Most of these networks also have a site-to-site VPN tunnel with a physical firewall across the internet. After the upgrade, we went and converted our rules to match on original IP and then enabled "multiple interfaces" - effectively taking them out of compatibility mode. Everything looked good (even for the vSE devices still in compatibility mode)
Issue: We first noticed this when a client reported that they could not access a virtual machine via RDP using it's internal (VSE protected) IP across a VPN tunnel, but could access the VM via RDP using it's public hostname/IP address. We allow all traffic across the VPN (firewall has an any:any rule for VPN traffic). When we logged in to troubleshoot (simply thinking the VPN was down), we found that we could connect to any port on the remote VM across the VPN tunnel except 3389. I could ping from the local subnet to the troubled VM on the vApp network with no problem. I could connect to other ports that were open on the remote VM with no problem. I could not connect to 3389 across the VPN.
We thought it might be isolated, but found the issue on every VSE we have: If there existed a DNAT rule to translate inbound traffic for a particular port, that port would be unresponsive when traffic traversed the VPN tunnel destined for the target of the DNAT rule.

While vCloud Director doesn't show anything strange in the firewall section of vSE configuration, if you log in to vShield Manager and look at the firewall rules there, a "Deny" rule with the private/internal/translated IP is added for any NAT rule that exists:


This, I'm assuming, is for security reasons during the upgrade but it does not show up in vCloud Director (thus our confusion). After taking our appliances out of compatibility mode post-upgrade, the rules were still there.

Solution:  After the vSE is out of compatibility mode (see pg. 49 of the vCD 5.1 Install Guide), re-apply the service configuration (Right-Click vShield Edge Appliance in vCloud Director and select "Re-Apply Service Configuration"). You can also re-deploy the appliance or add an arbitrary rule to the firewall list - both appear to have the same effect.

Red Flags and the Value of Experience

One of the things I hear often said, and something I subscribe to as well, is the idea that a lot of technical knowledge in the world of IT ...