Digging Deeper into VXLAN

http://blogs.cisco.com/datacenter/digging-deeper-into-vxlan/

Yes, I am still talking about VXLAN, rather you folks are still talking about VXLAN, so I thought its worthwhile digging deeper into the topic since there is so much interest out there. There also still seem to be a fair number of misconceptions around VXLAN, so let’s see what we can do to clear things up.

This time around, I have some partners in crime for the discussion:

Larry Kreeger is currently a Principal Engineer at Cisco Systems’ SAVTG working on Nexus 1000V architecture. Larry has a wide ranging background in networking accumulated from over 25 years of experience in developing networking products. His recent focus is data center networking, especially as it relates to data center virtualization.

Ajit Sanzgiri has worked on various networking technologies at Cisco and other bay area networking companies over the last 16 years. His interests include hardware based switching and routing solutions, Ethernet and wireless LANs and virtual networking. Currently he works on the Nexus1000v and related network virtualization products.

So, Larry and Ajit have put together this VXLAN primer--its fairly dense stuff, so we are breaking this into three posts. In this initial post, we’ll cover the basics--why VXLANs and what is VXLAN. I know I’ve covered this to some degree already, but Larry and Ajit are going to dig a little deeper, which will hopefully help clarify the lingering questions and misconceptions. In the next post, we’ll discuss how VXLAN compares with the other tools in your networking arsenal, and, in the final post, we’ll cover more of the common questions we are seeing.

1 Why VXLANs ?

VLANs have been used in networking infrastructures for many years now to solve different problems. They can be used to enforce L2 isolation, as policy enforcement points and as routing interface identifiers. Network services like firewalls have used them in novel ways for traffic steering purposes

Support for VLANs is now available in most operating systems, NICs, network equipment (e.g. switches, routers, firewalls etc.) and also in most virtualization solutions. As virtualized data centers proliferate and grow, some shortcomings of the VLAN technology are beginning to make themselves felt. Cloud providers need some extensions to the basic VLAN mechanism if these are to be overcome.

The first is the VLAN namespace itself. 802.1q specifies a VLAN ID to be 12 bits which restricts the number of VLANs in a single switched L2 domain to 4096 at best. (Usually some VLAN IDs are reserved for ‘well-known’ uses, which restricts the range further.) Cloud provider environments require accommodating different tenants in the same underlying physical infrastructure. Each tenant may in turn create multiple L2/L3 networks within their own slice of the virtualized data center. This drives the need for a greater number of L2 networks.

The second issue has to do with the operational model for deploying VLANs. Although VTP exists as a protocol for creating, disseminating and deleting VLANs as well as for pruning them for optimal extent, most networks disable it. That means some sort of manual coordination is required among the network admin, the cloud admin and the tenant admin to transport VLANs over existing switches. Any proposed extension to VLANs must figure out a way to avoid such coordination. To be more precise, adding each new L2 network must not require incremental config changes in the transport infrastructure.

Third, VLANS today are too restrictive for virtual data centers in terms of physical constraints of distance and deployment. The new standard should ideally be free (at least ‘freer’) of these constraints. This would allow data centers more flexibility in distributing workloads, for instance, across L3 boundaries.

Finally, any proposed extension to the VLAN mechanism should not necessarily require a wholesale replacement of existing network gear. The reason for this should be self-evident.

VXLAN is the proposed technology to support these requirements.

2 What are VXLANs ?

2.1 What’s in a name?

As the name VXLANs (Virtual eXtensible LANs) implies, the technology is meant to provide the same services to connected Ethernet end systems that VLANs do today, but in a more extensible manner. Compared to VLANs, VXLANs are extensible with regard to scale, and extensible with regard to the reach of their deployment.

As mentioned, the 802.1Q VLAN Identifier space is only 12 bits. The VXLAN Identifier space is 24 bits. This doubling in size allows the VXLAN Id space to increase by over 400,000 percent to over 16 million unique identifiers. This should provide sufficient room for expansion for years to come.

VXLANs use Internet Protocol (both unicast and multicast) as the transport medium. The ubiquity of IP networks and equipment allows the end to end reach of a VXLAN segment to be extended far beyond the typical reach of VLANs using 802.1Q today. There is no denying that there are other technologies that can extend the reach of VLANs (Cisco FabricPath/TRILL is just one), but none are as ubiquitously deployed as IP.

2.2 Protocol Design Considerations

When it comes to networking, not every problem can be solved with the same tool. Specialized tools are optimized for specific environments (e.g. WAN, MAN, Campus, Datacenter). In designing the operation of VXLANs, the following deployment environment characteristics were considered for its deployment. These characteristics are based on large datacenters hosting highly virtualized workloads providing Infrastructure as a Service offerings.

Highly distributed systems. VXLANs should work in an environment where there could be many thousands of networking nodes (and many more end systems connected to them). The protocol should work without requiring a centralized control point, nor without a hierarchy of protocols.
Many highly distributed segments with sparse connectivity. Each VXLAN segment could be highly distributed among the networking nodes. Also, with so many segments, the number of end systems connected to any one segment is expected to be relatively low, and therefore the percentage of networking nodes participating in any one segment would also be low.
Highly dynamic end systems. End systems connected to VXLANs can be very dynamic, both in terms of creation/deletion/power-on/off and in terms of mobility across the network nodes.
Work with existing, widely deployed network equipment. This translates into Ethernet switches and IP routers.
Network infrastructure administered by a single administrative domain. This is consistent with operation within a datacenter, and not across the internet.
Low network node overhead / simple implementation. With the requirement to support very large numbers of network nodes, the resource requirements on each node should not be intensive both in terms of memory footprint or processing cycles. This also means consideration for hardware offload.

2.3 How does it work?

The VXLAN draft defines the VXLAN Tunnel End Point (VTEP) which contains all the functionality needed to provide Ethernet layer 2 services to connected end systems. VTEPs are intended to be at the edge of the network, typically connecting an access switch (virtual or physical) to an IP transport network. It is expected that the VTEP functionality would be built into the access switch, but it is logically separate from the access switch. The figure below depicts the relative placement of the VTEP function.

Each end system connected to the same access switch communicates through the access switch. The access switch acts as any learning bridge does, by flooding out its ports when it doesn’t know the destination MAC, or sending out a single port when it has learned which direction leads to the end station as determined by source MAC learning. Broadcast traffic is sent out all ports. Further, the access switch can support multiple “bridge domains” which are typically identified as VLANs with an associated VLAN ID that is carried in the 802.1Q header on trunk ports. In the case of a VXLAN enabled switch, the bridge domain would instead by associated with a VXLAN ID.

Each VTEP function has two interfaces. One is a bridge domain trunk port to the access switch, and the other is an IP interface to the IP network. The VTEP behaves as in IP host to the IP network. It is configured with an IP address based on the subnet its IP interface is connected to. The VTEP uses this IP interface to exchange IP packets carrying the encapsulated Ethernet frames with other VTEPs. A VTEP also acts as an IP host by using the Internet Group Membership Protocol (IGMP) to join IP multicast groups.

In addition to a VXLAN ID to be carried over the IP interface between VTEPs, each VXLAN is associated with an IP multicast group. The IP multicast group is used as communication bus between each VTEP to carry broadcast, multicast and unknown unicast frames to every VTEP participating in the VXLAN at a given moment in time. This is illustrated in the figure below.

The VTEP function also works the same way as a learning bridge, in that if it doesn’t know where a given destination MAC is, it floods the frame, but it performs this flooding function by sending the frame to the VXLAN’s associated multicast group. Learning is similar, except instead of learning the source interface associated with a frame’s source MAC, it learns the encapsulating source IP address. Once it has learned this MAC to remote IP association, frames can be encapsulated within a unicast IP packet directly to the destination VTEP.

The initial use case for VXLAN enabled access switches are for access switches connected to end systems that are Virtual Machines (VMs). These switches are typically tightly integrated with the hypervisor. One benefit of this tight integration is that the virtual access switch knows exactly when a VM connects to or disconnects from the switch, and what VXLAN the VM is connected to. Using this information, the VTEP can decide when to join or leave a VXLAN’s multicast group. When the first VM connects to a given VXLAN, the VTEP can join the multicast group and start receiving broadcasts/multicasts/floods over that group. Similarly, when the last VM connected to a VXLAN disconnects, the VTEP can use IGMP to leave the multicast group and stop receiving traffic for the VXLAN which has no local receivers.

Note that because the potential number of VXLANs (16M!) could exceed the amount of multicast state supported by the IP network, multiple VXLANs could potentially map to the same IP multicast group. While this could result in VXLAN traffic being sent needlessly to a VTEP that has no end systems connected to that VXLAN, inter VXLAN traffic isolation is still maintained. The same VXLAN Id is carried in multicast encapsulated packets as is carried in unicast encapsulated packets. It is not the IP network’s job to keep the traffic to the end systems isolated, but the VTEP’s. Only the VTEP inserts and interprets/removes the VXLAN header within the IP/UDP payload. The IP network simply sees IP packets carrying UDP traffic with a well-known destination UDP port.

So, that was the first installment--if you have questions, post them as comments and we’ll get back to you.

Hey folks--this is the second of three posts looking a little more closely at VXLAN. If you missed the first post, you can find it here. In this installment we are going to look at the some of the other options out there. Two of the most common questions we see are ”why do I need yet another protocol?” and “can I now get rid of X?” This should help you answer these questions.So, let’s dig in…

3 Comparison with other technologies

3.1 Overlay Transport Virtualization (OTV)

If one were to look carefully at the encapsulation format of VXLAN one might notice that it is actually a subset of the IPv4 OTV encapsulation in draft-hasmit-ovt-03, except the Overlay ID field is not used (and made reserved) and the well-known destination UDP port is not yet allocated by IANA (but will be different).

If one were to look even closer, they would notice that OTV is actually a subset of the IPv4 LISP encapsulation, but carrying an Ethernet payload instead of an IP payload.

Using a common (overlapping) encapsulation for all these technologies simplifies the design of hardware forwarding devices and prevents reinvention for its own sake.

Given that the packet on the wire is very similar between VXLAN and OTV, what is different? OTV was designed to solve a different problem. OTV is meant to be deployed on aggregation devices (the ones at the top of an structured hierarchy of 802.1Q switches) to interconnect all (up to 4094) VLANs in one hierarchy with others either in the same or in another datacenter, creating a single stretched 4K VLAN domain. It is optimized to operate over the capital I Internet as a Data Center Interconnect. Cisco’s latest version is able to interconnect datacenters without relying on IP multicast, which is not always available across the Internet. It prevents flooding of unknown destinations across the Internet by advertising MAC address reachability using routing protocol extensions (namely IS-IS). Each OTV device peers with each other using IS-IS. There is expected to be a limited number of these OTV devices peering with each other over IS-IS (because of where they are placed -- at a layer 2 aggregation point). Within a given layer 2 domain below this aggregation point, there are still only 4K VLANs available, so OTV does not create more layer 2 network segments. Instead it extends the existing ones over the Internet.

Since VXLAN is designed to be run within a single administrative domain (e.g. a datacenter), and not across the Internet, it is free to use Any Source Multicast (ASM) (a.k.a. (*,G) forwarding) to flood unknown unicasts. Since a VXLAN VTEP may be running in every host in a datacenter, it must scale to numbers far beyond what IS-IS was designed to scale to.

Note that OTV can be complimentary to VXLANs as a Data Center Interconnect. This is helpful in two ways. For one, the entire world is not poised to replace VLANs with VXLANs any time soon. All physical networking equipment supports VLANs. The first implementations of VXLANs will be only in virtual access switches (the ones Virtual Machines connect to), so this means that only VMs can connect to VXLANs. If a VM wants to talk with a physical device such as a physical server, layer 3 switch, router, physical network appliance, or even a VM running on a hypervisor that does not support a VXLAN enabled access switch -- then it must use a VLAN. So, if you have a VM that wants to talk with something out on the Internet…it must go through a router, and that router will communicate with the VM over a VLAN. Given that some VMs will still need to connect to VLANs, they will still exist and if layer 2 adjacency is desired across datacenters, then OTV works well to interconnect them. The layer 2 extension provided by OTV can be used, not just to interconnect VLANs with VMs and physical devices connected to them, but also by VTEPs as well. Since VTEPs require the use of ASM forwarding, and this may not be available across the Internet, OTV can be used to extend the transport VLAN(s) used by the VTEPs across the Internet between multiple datacenters.

3.2 MAC-in-GRE

Why did VXLANs use a MAC-in-UDP encapsulation instead of MAC-in-GRE? The easy answer is to say, for the same reasons OTV and LISP use UDP instead of GRE. The reality of the world is that the vast majority (if not all) switches and routers do not parse deeply into GRE packets for applying policies related to load distribution (Port Channel and ECMP load spreading) and security (ACLs).

Let’s start with load distribution. Port Channels (or Cisco’s Virtual Port Channels) are used to aggregate the bandwidth of multiple physical links into one logical link. This technology is used both at access ports and on inter-switch trunks. Switches using Cisco’s FabricPath can get even greater cross-sectional bandwidth by combining Port Channels with ECMP forwarding -- but only if the switches can identify flows (this is to prevent out-of-order delivery which can kill L4 performance). If one of today’s switches were to try to distribute GRE flows between two VTEPs that used a GRE encapsulation, all the traffic would be polarized to use only one link within these Port Channels. Why? Because the physical switches only see two IP endpoints communicating, and cannot parse the GRE header to identify the individual flows from each VM. Fortunately, these same switches all support parsing of UDP all the way to the UDP source and destination port numbers. By configuring the switches to use the hash of source IP/dest IP/L4 protocol/source L4 port/dest L4 port (typically referred to as a 5-tuple), they can spread each UDP flow out to a different link of a Port Channel or ECMP route. While VXLAN does use a well-known destination UDP port, the source UDP port can be any value. A smart VTEP can spread the all the VMs 5-tuple flows over many source UDP ports. This allows the intermediate switches to spread the multiple flows (even between the same two VMs!) out over all the available links in the physical network. This is an important feature for data center network design. Note that this does not just apply to layer 2 switches, since VXLAN traffic is IP and can cross routers as well, it applies to ECMP IP routing in the core as well.

Note that MAC-in-GRE based schemes can perform a similar trick as mentioned above by creating flow-based entropy within a sub-portion of the GRE key (as opposed to the source UDP port), but it is a moot point unless all the switches and routers along the path can parse the GRE Key field and use it to generate a hash for Port Channel / ECMP load distribution

Next comes security. As soon as you start carrying your layer 2 traffic over IP routers, you open yourself up for packet injection on to a layer 2 segment from anywhere there is IP access…unless you use firewalls and/or ACLs to protect the VXLAN traffic. Similar to the load balancing issue above, if GRE is used, firewalls and layer 3 switches and routers with ACLs will typically not parse deeply into the GRE header enough to differentiate one type of tunneled traffic from another. This means all GRE would need to be blocked indiscriminately. Since VXLAN uses UDP with a well-known destination port, firewalls and switch/router ACLs can be tailored to block only VXLAN traffic.

Note that one downside to any encapsulation approach, whether it is based on UDP or GRE is that by having the hypervisor software add an encapsulation, today’s NICs and/or NIC drivers do not have a mechanism to be informed about the presence of the encapsulation for performing NIC hardware offloads. It will be a performance benefit for either of these encapsulation methods for NIC vendors to update their NICs and/or NIC drivers and for hypervisor vendors to allow access to these capabilities. Given that NIC vendors (Intel, Broadcom and Emulex) have given public support to both VXLAN and GRE based encapsulations, I can only guess that support for both schemes will be forthcoming.

3.3 LISP

Locator/ID Separation Protocol (LISP) is a technology that allows end systems to keep their IP address (ID) even as they move to a different subnet within the Internet (Location). It breaks the ID/Location dependency that exists in the Internet today by creating dynamic tunnels between routers (Ingress and Egress Tunnel Routers). Ingress Tunnel Routers (ITRs) tunnel packets to Egress Tunnel Routers (ETRs) by looking up the mapping of an end system’s IP address (ID) to its adjacent ETR IP address (Locator) in the LISP mapping system.

LISP provides true end system mobility while maintaining shortest path routing of packets to the end system. With traditional IP routing, an end station’s IP address must match the subnet it is connected to. While VXLAN can extend a layer 2 segment (and therefore the subnet it is congruent with), across hosts which are physically connected to different subnets, when a VM on a particular host needs to communicate out through a physical router via a VLAN, the VMs IP address must match the subnet of that VLAN -- unless the router supports LISP.

If a VXLAN is extended across a router boundary, and the IP Gateway for the VXLAN’s congruent subnet is a VM on the other side of the router, this means traffic will flow from the originating VMs server, across the IP network to the IP Gateway VM residing on another host, and then back up into the physical IP network via a VLAN. This phenomenon is sometime referred to as “traffic tromboning” (alluding to the curved shape of a trombone). Thus, while VXLANs support VMs moving across hosts connected to different layer 2 domains (and therefore subnets), it doesn’t provide the direct path routing of traffic that LISP does.

3.4 MAC-in-MAC

VMware has an existing proprietary equivalent of VXLAN which is deployed today with vCloud Director, called vCloud Director Network Isolation (vCDNI). vCDNI uses a MAC-in-MAC encapsulation. Cisco and VMware, along with others in the hypervisor and networking industry have worked together on a common industry standard to replace vCDNI -- namely VXLAN. VXLAN has been designed to overcome the shortcomings of the vCDNI MAC-in-MAC encapsulation -- namely load distribution, and limited span of a layer 2 segment.

The first one is the same issue that GRE has with load distribution across layer 2 switch Port Channels (and ECMP for FabricPath). The second is that because the outer encapsulation is a layer 2 frame (not an IP packet), all network nodes (hypervisors in the case vCDNI), MUST be connected to the same VLAN. This limits the flexibility in placing VMs within your datacenter if you have any routers interconnecting your server pods unless you use a layer 2 extension technology such as OTV to do it.

So, a couple of points to wrap things up. Hopefully, this gives you a better understanding of why VXLAN instead of some of the existing options. Beyond that, I hope it becomes clear that while VXLAN is immensely useful, it is not magical--it relies on a well-built, well-operating L2/L3 infrastructure so other technologies and protocols are going to come into play in the real world. As usual, post questions to either of these blog entires and we will get them answered as best we can.

So, here is our final installment--we are wrapping up with some of the more common questions were are seeing. In you missed the earlier posts, be sure to check out Part 1 and Part 2. I also have a couple of earlier posts introducing VXLAN and answering some of the initial questions.

So, onto the FAQs…

How do VXLANs relate to VN-tags ?

Some people have asked what is the relationship between VN-tags (aka 802.1Qbh) and VXLANs. Does one rule out the other ? The answer is a definite ‘no’. The VN-tag exists on the link between a VM and the virtual switch (or ‘VM-switch’). If such a switch has support for VXLANs then the VMs in question can get connectivity through VXLANs. A packet will never need to have both at the same time. In the context of Cisco products, VM-FEX technology remains complementary to VXLANs.

So now we can migrate VMs across subnets ?

There is also some confusion over what implications VXLANs have for VM mobility. Claims such as “VXLANs permit mobility across L3 boundaries” have been taken to mean different things.

We want to make clear that VMs connected to VXLANs do not need to change their own IP addresses as a consequence of this technology. Essentially, VMs connected to a VXLAN remain unaware that they are on a VXLAN -- just as they are usually unaware of running on VLANs. It is up to the underlying network to ensure connectivity between such VMs and provide layer-2 semantics such as mac-layer broadcast and unicast traffic. As a consequence, any mobility event -- live or otherwise -- has no effect on the internals of the VM.

At the same time, since the native Ethernet frames are carried over an IP encapsulation, the tunnel endpoints themselves do not need to be on the same VLAN or IP subnet in order to ensure connectivity of the VXLAN. This creates the potential for a VM on a certain VXLAN to move between hosts which are themselves on different subnets. It is important however not to interpret this to mean live VM migration is now immediately possible across subnets, since other factors can get in the way. As an example, live VM migration itself requires transfer of VM data between two hosts. This transfer may not be possible or officially supported across subnets. All that VXLANs ensure is connectivity to the same perceived layer 2 broadcast network regardless of which host it is on (assuming of course that the network supports VXLANs) and regardless of which subnet the host connects to. However, VXLANs do not, by themselves, circumvent other impediments to live VM migration -- such as the transfer issue mentioned above.

What about routing across VXLANs ?

So, you are thinking “This is all well and good to interconnect VMs at layer-2 in a scalable way, but how do they talk to the real world of corporate networks and the internet” ? In other words, who routes between VXLANs ? Or between VXLANs and VLANs or VXLANs and the global internet ?

The answer to this will evolve over time just as it did with VLAN technology. If a router is ignorant of 802.1Q tagging it cannot route across VLANs unless someone else terminates VLAN tagging on its behalf. For instance an 802.1Q-capable L2 switch can strip the tag and forward native Ethernet frames to/from the router. The router would then only support one “VLAN interface” on each physical interface.

With VXLANs, too, VXLAN capable devices could take the responsibility for stripping the encapsulation before forwarding the packet to the router for routing. If VXLAN functionality remains confined to virtual switches, the router, too, will need to be a virtual router i.e. routing software running inside a VM. As and when non-virtual physical switches support the VXLAN format, real physical router boxes can connect to them. Of course, as in the VLAN case, this will limit the number of routable VXLAN interfaces on the router. A better solution would be for the router itself to encap/decap the VXLAN tunneled packets so it can support a large number of VXLAN interfaces on the same physical interface.

One intermediate step between using purely virtual routers and having physical routers support the VXLAN encapsulation would be for the L2 network to permit bridging between VXLANs and VLANs. This will allow both physical and virtual devices -- whether routers or other nodes -- to connect into VXLANs without requiring an upgrade to their software. Such a bridging functionality is defined in the proposed draft standard for just such purposes.

In many cloud provider environments, tenants may be able to directly create VXLANs within their portion of the public data center (sometimes called an Organization’s Virtual Data Center). The IP addressing of the tenant-administered VMs on these VXLANs will in general not be coordinated across different tenants. This makes NAT a very desirable feature on the router or routers directly attached to such client administered VXLANs. Ditto for VPN connectivity to the tenant’s own network.

Are VXLANs secure ?

With proper precautions they can be just as secure as regular networks. It is true however, that there is a greater risk of attacks and users must understand this and take measures to guard against it. In the worst case, an attacker can inject himself into a VXLAN by sending IP-encapsulated packets from anywhere. Of course this requires access to the IP network. A first line of defense is to have a perimeter firewall that denies IP traffic with the VXLAN encapsulation from the outside.

This does not prevent attacks from the inside. For that users would need to control access at internal routers to ensure that only authorized tunnel endpoints can inject packets into VXLAN tunnels. This can be done statically (knowing the physical topology of the network) or by employing additional IP security mechanisms that guarantee encryption and/or authentication.

Rather than re-invent this particular wheel, the VXLAN draft lets users make use of existing methods to secure VXLAN tunneled traffic, while pointing out where the risks lie.

What about network services ?

Since VXLANs are expected to be deployed in hosted environments people naturally want to know how to enable network services (firewalls, IPS, load balancing, WAN optimization) for VXLANs. The answer to this is pretty much the same as for routers. Either the services need to be enabled in endpoints that can be attached to VXLANs (i.e. virtual in the immediate future), or these services need to become VXLAN aware or someone needs to perform a bridging function between VXLANs and whatever it is that the services understand (physical interfaces, VLANs etc.)

So, if you are intrigued by VXLANs, you may want to ping you Cisco account team--we will be kicking off a closed beta soon.

posted @ 2014-08-06 17:40 popsuper1982 阅读(702) 评论(0) 收藏举报

刷新页面返回顶部

刘超的通俗云计算

Digging Deeper into VXLAN

公告