  1. SDN

    1. SDN核心思想




  1. Controller与worker之间的流量转发控制与通信
  2. Worker实现傻瓜式的所谓的"流表转发"

虽然说SDN并没有要求一定支持OpenFlow协议(可以通过其他协议来实现SDN的目的),但是鉴于网络的开放性特点,行业需要有标准的协议, OpenFlow协议是SDN概念提出之时随之而来的协议。下图来自OpenFlow官方白皮书,描述了OpenFlow协议在SDN框架下的位置和作用:











网络虚拟化又分为"网络资源虚拟化"和"网络功能虚拟化"。后者即是本文要详细探讨的NFV(Network Function Virtualization)。


所谓的网络虚拟化,即:Similar to computer virtualiza-tion, network virtualization promises to improve resource allocation, permits operators to check-point their network before changes, and allows competing customers to share the same equipment in a con-trolled and isolated fashion. Critically, virtual networks also promise to provide a safe and realistic environment to deploy and evaluate experimental "clean slate" pro-tocols in production networks.

Thus, by analogy, the network itself should have a hardware abstraction layer. This layer should be easy to slice so that multiple wildly different net-works can run simultaneously on top without interfer-ing with each other, on a variety of different hardware,including switches, routers, access points, and so on.

Above the hardware abstraction layer, we want new pro-tocols and addressing formats to run independently in their own isolated slice of the same physical network, enabling networks optimized for the applications run-ning on them, or customized for the operator who owns them. Below the virtualization layer, new hardware can be developed for different environments with different speed, media (wireline and wireless), power or fanout requirements.

While individual technolo-gies can slice particular hardware resources (e.g., MPLS can virtualize forwarding tables) and layers (e.g., WDM slices the physical layer, VLANs slices the link layer),there is currently no one single technology or clean ab-straction that will virtualize the network as a whole.






  1. 网络虚拟化




下图用类比计算虚拟化来实现网络虚拟化的框图,openflow相当于x86的指令,这里的FlowVisor类似KVM实现网络资源的虚拟化,NOX对应桌面的操作系统win等(Guest OS)。



  1. 以斯坦福等学院派为代表的openflow技术




  1. 以vmware为代表的主机overlay技术



主机Overlay隧道技术大都采用VXLAN,当然也有NVGRE、MPLS over GRE、MPLS over UDP等。STT是H3C提出的overlay标准。


  1. 以Cisco/华为为代表的网络overlay技术





  1. NFV

    1. OPNFV






  1. VNF性能提升方案









  1. 虚拟通道性能提升

    1. virtio-net



虚拟网卡分为两种:全虚拟化和半虚拟化。所谓的全虚拟化即VM感知不到自己使用的网卡实际上是由hypervisor模拟出来的,比如e1000网卡,这个虚拟网卡完全由hyperviosr比如Kvm-Qemu软件模拟出来的;半虚拟化则在hypervisor做一部分工作,VM的guest os也需要做一部分修改。


guest发出中断信号退出kvm,从kvm退出到用户空间的qemu进程。然后由qemu开始对tap设备进行读写。 可以看到这里从用户态进入内核,再从内核切换到用户态,进行了2次切换。


  1. vhost-net


guest发出中断信号退出kvm,kvm直接和vhost-net.ko通信,然后由vhost-net.ko访问tap设备。 这样网络数据只需要经过从用户态到内核态的一次切换,就可以完成数据的传输。大大提高了虚拟网卡的性能。



  1. vhost-user


Vhost-user相对于vhost-net,改进点在于数据报文无需进入host kernel,直接在用户态进行传输,可以减少一次数据拷贝等开销。

正是vhost-user提供了用户态进程之间高效的网络通信机制,往往被用来与其他优化技术,比如DPDK、ODP、snabb switch等一起使用。


  1. ivshmem


Qemu 1.5.x版本以上即支持ivshmem机制。


【ivshmem pci BARs】

BAR是PCI配置空间中从0x10 到 0x24的6个register,用来定义PCI需要的配置空间大小以及配置PCI设备占用的地址空间,X86中地址空间分为MEM和IO两类,因此PCI 的BAR在bit0来表示该设备是映射到memory还是IO,bar的bit0是只读的,bit1保留位,bit2 中0表示32位地址空间,1表示64位地址空间,其余的bit用来表示设备需要占用的地址空间大小与设备起始地址。







enumivshmem_registers {






  • Mask寄存器


  • Status寄存器


  • IVPosition寄存器

IVPosition是只读的,报告了guest id号码。Guest id是非负整数。id只会在设备就绪是被设置。如果设备没有准备好,IVPosition返回-1。应用程序必须确保他们有有效的id后才开始使用共享内存。

  • Doorbell寄存器









ivshmem中断模式分为Pin-based 中断和msi中断。两种中断模式的区别请参见附录"MSI与MSI-X"。





mount tmpfs /dev/shm -t tmpfs -osize=32m



2.ivshmem server建立


./ivshmem_server -m 64 -p/tmp/nahanni &






在qemu一端通过–chardev socket建立socket连接,并通过-deviceivshmem建立共享内存设备。

./qemu-system-x86_64 -hda mg -L /pc-bios/ --smp 4

–chardev socket,path=/tmp/nahanni,id=nahanni

-device ivshmem,chardev=nahanni,size=32m,msi=off

-serial telnet:,server,nowait,nodelay-enable-kvm&

Server端通过select侦听一个qemu上socket的连接。 Qemu端启动时需要设置-chardevsocket,path=/tmp/nahanni,id=nahanni,通过该设置qemu通过查找chardev注册类型register_types会调用qemu_chr_open_socket->unix_connect_opts,实现与server之间建立socket连接,server的add_new_guest会指派给每个vm一个id号,并且将id号同一系列eventfd文件描述符一起发给qemu进程。

在非中断模式中,无需通过–chardevsocket建立连接,但同样需要支持-device ivshmem建立共享内存:

./qemu-system-x86_64-dyn -hda Img -L /pc-bios/ --smp 4

-device ivshmem,shm=nahanni,size=32m

-serial telnet:,server,nowait,nodelay&




4.Guest OS中ivshmem驱动


register_chrdev(0, "kvm_ivshmem", &kvm_ivshmem_ops);

static const struct file_operations kvm_ivshmem_ops = {

.owner = THIS_MODULE,

.open =kvm_ivshmem_open,

.mmap =kvm_ivshmem_mmap,

.read =kvm_ivshmem_read,

.ioctl = kvm_ivshmem_ioctl,

.write = kvm_ivshmem_write,

.llseek = kvm_ivshmem_lseek,

.release = kvm_ivshmem_release,





5.Guest OS中如何使用ivshmem机制与host os以及其他guest os通信














  1. macvtap






  • TUN 设备

TUN 设备是一种虚拟网络设备,通过此设备,程序可以方便得模拟网络行为。


所有物理网卡收到的包会交给内核的 Network Stack 处理,然后通过 Socket API 通知给用户程序。下面看看 TUN 的工作方式:

普通的网卡通过网线收发数据包,但是 TUN 设备通过一个文件收发数据包。所有对这个文件的写操作会通过 TUN 设备转换成一个数据包送给内核;当内核发送一个包给 TUN 设备时,通过读这个文件可以拿到包的内容。

  • TAP 设备

TAP 设备与 TUN 设备工作方式完全相同,区别在于:

TUN 设备的 /dev/tunX 文件收发的是 IP 层数据包,只能工作在 IP 层,无法与物理网卡做 bridge,但是可以通过三层交换(如 ip_forward)与物理网卡连通。

TAP 设备的 /dev/tapX 文件收发的是 MAC 层数据包,拥有 MAC 层功能,可以与物理网卡做 bridge,支持 MAC 层广播。


有时我们可能需要一块物理网卡绑定多个 IP 以及多个 MAC 地址,虽然绑定多个 IP 很容易,但是这些 IP 会共享物理网卡的 MAC 地址,可能无法满足我们的设计需求,所以有了 MACVLAN 设备,其工作方式如下:

MACVLAN 会根据收到包的目的 MAC 地址判断这个包需要交给哪个虚拟网卡。单独使用 MACVLAN 好像毫无意义,但是配合之前介绍的 network namespace 使用,我们可以构建这样的网络:

由于 macvlan 与 eth0 处于不同的 namespace,拥有不同的 network stack,这样使用可以不需要建立 bridge 在 virtual namespace 里面使用网络。


MACVTAP 是对 MACVLAN的改进,把 MACVLAN 与 TAP 设备的特点综合一下,使用 MACVLAN 的方式收发数据包,但是收到的包不交给 network stack 处理,而是生成一个 /dev/tapX 文件,交给这个文件:

由于 MACVLAN 是工作在 MAC 层的,所以 MACVTAP 也只能工作在 MAC 层,不会有 MACVTUN 这样的设备。



  1. SR-IOV


内存虚拟化:用硬件实现从GVA(Guest Virtual Address)àGPA(Guest Physical Address)àHPA(Host Physical Address)的两次地址转换。传统非虚拟化的操作系统只通过硬件MMU完成一次从GVA到GPA的虚拟化。


软件I/O虚拟化有三种模型:Split I/O、Direct I/O和Passthrough I/O。

硬件辅助I/O虚拟化,从实现技术上有Intel的VT-d(virtualization technology for Directed I/O)和AMD的IOMMU(I/O Memory Management Unit)。从实现规范和框架上有SR-IOV,PCI-SIG国际组织专门针对PCIe设备而制定的规范。

Split I/O虚拟化模型的核心宗旨是所有来自guest的IO都由一个IO驱动程序(在Dom0区域,具有IO权限)来代理与物理IO交互。需要Guest OS做修改,也是XEN虚拟化引擎的实现方案,称之为半虚拟化解决方案,如下图所示:

Direct I/O虚拟化模型的核心思想是无需Guest OS做修改,直接使用设备的驱动程序,来自Guest的IO都由VMM来代理与物理设备交互。称之为完全虚拟化解决方案,如下图所示:

Passthrough I/O虚拟化模型允许guest直接操作物理IO设备,无需或少量经过VMM的干预。优点是性能高,缺点是物理IO设备只能给指定的guest使用,无法实现共享。如下图所示:



SR-IOV是PCI-SIG组织推出的为了减少VMM对虚拟IO的干扰以提升IO虚拟化性能的规范,继承Passthrough IO技术,通过IOMMU减少地址转换和地址空间保护的开销。


一个PF(Physical Function)管理多个VF(Virtula Function)。一个PF是具有完整PCIe功能的物理设备,具有唯一的VID,包括IO空间、存储空间和配置空间。每个VF有自己独立的配置空间,但是共享PF的IO空间和存储空间,是一个轻量级的PCIe设备。每个VF有自己唯一的RID,唯一标识交换源,同时还用于搜索IOMMU页表,使得每个虚机可以使用单独的IOMMU页表。

PCIe switch下面挂多个具有SR-IOV的PCIe设备,如果要实现PCIe Switch的虚拟化,则需要MR-IOV,当前还没有厂商实现MR-IOV。



PF 驱动是一个专门管理SR-IOV设备全局功能驱动,而且还要配置相关共享资源。PF 驱动随着VMM的不同而不同,一般需要具有比普通虚拟机更高的权限才能对其进行操作。PF驱动包含了所有传统驱动的功能,使得VMM能够访问设备I/O资源。也可以通过调用PF驱动执行相关操作从而影响整个设备。PF驱动必须在VF驱动之前加载,而且需要等VF驱动卸载之后才能卸载。



VF驱动就如普通的PCIe设备驱动运行在Guest OS中。






IO设备会产生非常多的中断,DMA必须要能够将中断正确的路由到客户机中。一般有两种方法,一种是采用中断路由控制器进行路由,一种是通过DMA写请求发出MSI消息(Message Singled Interupt)。DMA写请求中会包含目的地址,所以DMA需要访问所有的内存空间,无法实现中断隔离。


Intel VT-d通过改造IOMMU,实现多个DMA保护区域,最终实现DMA虚拟化,也称之为DMA重映射(DMA Remapping)。通过重新定义MSI消息格式,不再嵌入目的地址,而是采用消息ID来区分不同的DMA保护区域。







第1、2步:包到达,被送往L2 进行分类和交换;

第3步:根据目的MAC地址进行分类,这时候,改包与 缓冲池1匹配;


第5步:DMA操作到达Intel 芯片集,在这里VT-d(由Hypervisor 配置)进行DMA操作所需的地址翻译;结果使得该包被直接送入到虚拟机的VF驱动缓冲里面;














Intel SR-IOV 驱动(也即PF驱动)包含了所有 Intel 以太网卡的功能,并且还有下面使用SR-IOV时拥有的特殊功能:

·给每个VF生成一个MAC 地址







Intel VF驱动实例代码是标准 Intel ixgbe 10 Gigabit Ethernet 驱动的一个修改后的版本。通过设备ID来加载。Intel VF有一个设备ID表明它们是一个VF, 这样VF驱动就可以被加载。

Intel VF 驱动可以被分割为三个部分:


·I/O操作——使用SR-IOV 功能来进行I/O操作,而避免Hypervisor的干预



















VF驱动与Guest OS有关系,所以在vmware上面的虚机里面运行的VF驱动无需特别提供。


  1. VMDq











要支持VMDq,需要网卡支持,同时还需要改造VMM或vSwitch。Linux KVM当前版本似乎不支持VMDq,需要经过改造。


[RFC] Virtual Machine Device Queues (VMDq) support on KVM

Network adapter with VMDq technology presents multiple pairs of tx/rx queues,

and renders network L2 sorting mechanism based on MAC addresses and VLAN tags

for each tx/rx queue pair. Here we present a generic framework, in which network

traffic to/from a tx/rx queue pair can be directed from/to a KVM guest without

any software copy.

Actually this framework can apply to traditional network adapters which have

just one tx/rx queue pair. And applications using the same user/kernel interface

can utilize this framework to send/receive network traffic directly thru a tx/rx

queue pair in a network adapter.

We use virtio-net architecture to illustrate the framework.

The basic idea is to utilize the kernel Asynchronous I/O combined with Direct

I/O to implements copy-less TUN/TAP device. AIO and Direct I/O is not new to

kernel, we still can see it in SCSI tape driver.

With traditional file operations, a copying of payload contents from/to the

kernel DMA address to/from a user buffer is needed. That's what the copying we

want to save.

The proposed framework is like this:

A TUN/TAP device is bound to a traditional NIC adapter or a tx/rx queue pair in

host side. KVM virto-net Backend service, the user space program submits

asynchronous read/write I/O requests to the host kernel through TUN/TAP device.

The requests are corresponding to the vqueue elements include both transmission

& receive. They can be queued in one AIO request and later, the completion will

be notified through the underlying packets tx/rx processing of the rx/tx queue


Detailed path:

To guest Virtio-net driver, packets receive corresponding to asynchronous read

I/O requests of Backend service.

1) Guest Virtio-net driver provides header and payload address through the

receive vqueue to Virtio-net backend service.

2) Virtio-net backend service encapsulates multiple vqueue elements into

multiple AIO control blocks and composes them into one AIO read request.

3) Virtio-net backend service uses io_submit() syscall to pass the request to

the TUN/TAP device.

4) Virtio-net backend service uses io_getevents() syscall to check the

completion of the request.

5) The TUN/TAP driver receives packets from the queue pair of NIC, and prepares

for Direct I/O.

A modified NIC driver may render a skb which header is allocated in host

kernel, but the payload buffer is directly mapped from user space buffer which

are rendered through the AIO request by the Backend service. get_user_pages()

may do this. For one AIO read request, the TUN/TAP driver maintains a list for

the directly mapped buffers, and a NIC driver tries to get the buffers as

payload buffer to compose the new skbs. Of course, if getting the buffers

fails, then kernel allocated buffers are used.

6) Modern NIC cards now mostly have the header split feature. The NIC queue

pair then may directly DMA the payload into the user spaces mapped payload


Thus a zero-copy for payload is implemented in packet receiving.


7) The TUN/TAP driver manually copy the host header to space user mapped.

8) aio_complete() to notify the Virtio-net backend service for io_getevents().

To guest Virtio-net driver, packets send corresponding to asynchronous write

I/O requests of backend. The path is similar to packet receive.

1) Guest Virtio-net driver provides header and payload address filled with

contents through the transmit vqueue to Virtio-net backed service.

2) Virtio-net backend service encapsulates the vqueue elements into multiple

AIO control blocks and composes them into one AIO write request.

3) Virtio-net backend service uses the io_submit() syscall to pass the

requests to the TUN/TAP device.

4) Virtio-net backend service uses io_getevents() syscall to check the request


5) The TUN/TAP driver gets the write requests and allocates skbs for it. The

header contents are copied into the skb header. The directly mapped user space

buffer is easily hooked into skb. Thus a zero copy for payload is implemented

in packet sending.

6) aio_complete() to notify the Virtio-net backend service for io_getevents().

The proposed framework is described as above.

Consider the modifications to the kernel and qemu:

To kernel:

1) The TUN/TAP driver may be modified a lot to implement AIO device operations

and to implement directly user space mapping into kernel. Code to maintain the

directly mapped user buffers should be in. It's just a modification for driver.

2) The NIC driver may be modified to compose skb differently and slightly data

structure change to add user directly mapped buffer pointer.

Here, maybe it's better for a NIC driver to present an interface for an rx/tx

queue pair instance which will also apply to traditional hardware, the kernel

interface should not be changed to make the other components happy.

The abstraction is useful, though it is not needed immediately here.

3) The skb shared info structure may be modified a little to contain the user

directly mapped info.

To Qemu:

1) The Virtio-net backend service may be modified to handle AIO read/write

requests from the vqueues.

2) Maybe a separate pthread to handle the AIO request triggering is needed.

  1. multiqueue virtio-net


To make sure the whole stack could be worked in parallel, the parallelism of not only the front-end (guest driver) but also the back-end (vhost and tap/macvtap) must be explored. This is done by:

  • Allowing multiple sockets to be attached to tap/macvtap
  • Using multiple threaded vhost to serve as the backend of a multiqueue capable virtio-net adapter
  • Use a multi-queue awared virtio-net driver to send and receive packets to/from each queue

The main goals of multiqueue is to explore the parallelism of each module who is involved in the packet transmission and reception:

  • macvtap/tap: For single queue virtio-net, one socket of macvtap/tap was abstracted as a queue for both tx and rx. We can reuse and extend this abstraction to allow macvtap/tap can dequeue and enqueue packets from multiple sockets. Then each socket can be treated as a tx and rx, and macvtap/tap is fact a multi-queue device in the host. The host network codes can then transmit and receive packets in parallel.
  • vhost(注意这里指vhost-net,不是vhost-user): The parallelism could be done through using multiple vhost threads to handle multiple sockets. Currently, there's two choices in design.
    • 1:1 mapping between vhost threads and sockets. This method does not need vhost changes and just launch the the same number of vhost threads as queues. Each vhost thread is just used to handle one tx ring and rx ring just as they are used for single queue virtio-net.
    • M:N mapping between vhost threads and sockets. This methods allow a single vhost thread to poll more than one tx/rx rings and sockests and use separated threads to handle tx and rx request.
  • qemu: qemu is in charge of the fllowing things
    • allow multiple tap file descriptors to be used for a single emulated nic
    • userspace multiqueue virtio-net implementation which is used to maintaining compatibility, doing management and migration
    • control the vhost based on the userspace multiqueue virtio-net
  • guest driver
    • Allocate multiple rx/tx queues
    • Assign each queue a MSI-X vector in order to parallize the packet processing in guest stack


【Guest virtio-net驱动代码分析】


网卡上报中断模式分为msix模式和非msix模式。Intel VT-d技术必须是msix模式。

1. 如果没有开启msix模式,则调用vp_request_intx申请一个中断,中断处理函数是vp_interrupt。vp_interrupt实际调用的是vp_vring_interrupt(配置变更的中断除外)。


2. 开启了msix模式,还要区分不同的模式,要么是所有virtqueue共享一个中断,要么是每个virtqueue独立一个中断,无论是哪种模式,都需要调用vp_request_msix_vectors去申请irq中断资源。还要对每个virtqueue,调用setup_vq来完成初始化。






  1. 虚拟交换机性能提升


  1. OVS



Host os默认采用linux bridge来进行网络转发,但由于linux bridge支持的功能有限,不易扩展,于是Open vSwitch(OVS)作为linux bridge的替代者出现。如下图所示:

在OVS里面有br和port的概念,br对应linux bridge的br.101、br.102等逻辑接口,port对应linux的eth0、eth1等物理网卡接口。

但是我们可能会发现当前采用开源社区的OVS的性能实际要比linux bridge的二层转发性能要低。之所以还要用OVS,主要在于OVS的扩展性。


一种方法是将OVS从内核空间移到用户态,以减少数据报文的一次拷贝,当前OVS已经支持用户态,但还是实验版本。还有一种原生用户态的虚机交换机snabb vswitch。

一种方法是同时改在guest os和OVS,使得用户态与内核态的零拷贝机制。比如intel主导的DPDK机制。


  1. snabb

snabb当前主要有两个开源项目:snabb swtich和snabb NFV。

Snabb is written using these main techniques:

  • Lua, a high-level programming language that is easy to learn.
  • LuaJIT, a just-in-time compiler that is competitive with C.
  • Ethernet I/O with no kernel overhead ("kernel bypass" mode).


Snabb Switch是一个用户态的虚拟交换机。

Problem: Virtual machines need networking that is both fast and flexible. Hardware NIC virtualization is fast, software networking is flexible, but neither is both.

Solution: Snabb NFV provides both performance and flexibility. The secret sauce is a best-of-both-worlds design that combines SR-IOV hardware virtualization with a feature-rich software layer based on Virtio.

Snabb NFV is deployed as an OpenStack ML2 mechanism driver.

The operator configures Neutron using the standard commands and API. Snabb NFV then implements the Neutron configuration using its own a fast data-plane and robust control-plane.

Snabb NFV supports these Neutron extensions: Provider Networks, Security Groups, Port Filtering, QoS, and L2-over-L3 tunnels (aka softwires).

Snabb NFV is distributed based on OpenStack Icehouse. The distribution includes NFV-oriented updates to QEMU, Libvirt, Nova, and Neutron.

Snabb是userspace virtio app机制的一个应用。

所谓的userspace virtio app机制是C/S框架的通信机制,也就是vhost-user机制。

snabb switch的工作原理:

  1. Snabb用vhost-user(QEMU 2.1的新特性)去和VM通信(所以不再需要tap设备,也不需要Kernel)

打开qemu的vhost-user特性, 如下:

qemu -m 1024 -mem-path /bak/shared,prealloc=on,share=on -netdev type=vhost-user,id=net0,file=/path/to/socket -device virtio-net-pci,netdev=net0

-mem-path选项支持为一个虚机分配和其他进程共享内存的guest内在vring, vring是虚机的网卡数据的缓存,再通过unix socket将vring的文件描述符、中断号、IO事件等传给同在用户空间的snabb switch进程。snabb switch进程就可以直接通过文件描述符去ving中取数据了。

b) Snabb用intel10g.lua驱动或OpenOnload提供的驱动(去和硬件网卡打交道


  1. Lagopus

Lagopus vSwitch that provides high-performance packet processing。

Lagopus software switch is a yet another OpenFlow 1.3 software switch implementation. Lagopus software switch is designed to leverage multi-core CPUs for high-performance packet processing and fowarding with DPDK. Many network protocol formats are supported, such as Ethernet, VLAN, QinQ, MAC-in-MAC, MPLS and PBB. In addition, tunnel protocol processing is supported for overlay-type networking with GRE, VxLAN and GTP.

Lagopus swtich支持两个版本:raw-socket和DPDK supported。







  1. VNF网络协议栈性能提升










  1. Snabb NFV




  1. libuinet


This is a user-space port of the FreeBSD TCP/IP stack, begun with the

FreeBSD 9.1-RELEASE sources and many pieces of Kip Macy's user-space

port of an earlier version of the FreeBSD stack, libplebnet.

Unlike the stock FreeBSD TCP/IP stack, this stack can initiate and

terminate arbitrary TCP/IP connections, including those on

arbitrarily-nested VLANs. Listen sockets can be bound to a wildcard

IP address (across everything on the wire, not just local interfaces),

wildcard port, and specific VLAN tag stacks. L2 information for

accepted connections is available to the application. Outbound

connections can be bound to any IP and port, as well as any MAC

address and VLAN tag stack.

This stack can also passively reconstruct TCP streams using a copy of

those streams' bidirectional packet flow. Reconstruction can continue

even in the face of packet loss (in which case zero-filled holes in

the affected streams are reported to the application).

Packet I/O is currently accomplished via netmap or libpcap (although

the latter interface is relatively new and untested).


  1. mTCP

mTCP is a highly scalable user-level TCP stack for multicore systems.

mTCP source code is distributed under the Modified BSD License.



  1. NUSE

Linux has also been ported to itself. You can now run the kernel as a userspace application - this is called UserMode Linux (UML).

This is a library operating system (LibOS) version of Linux kernel, which will benefit in the couple of situations like:

  • operating system personalization
  • full featured network stack for kernel-bypass technology (a.k.a. a high-speed packet I/O mechanism) like Intel DPDK, netmap, etc
  • testing network stack in a complex scenario.

Right now, we have 2 sub-projects of this LibOS.

  • Network Stack in Userspace (NUSE) NUSE allows us to use Linux network stack as a library which any applications can directory use by linking the library. Each application has its own network stack so, it provides an instant virtualized environment apart from a host operating system.
  • Direct Code Execution (DCE) DCE provides network simulator integration with Linux kernel so that any Linux implemented network protocols are for a protocol under investigate.


  1. OpenDP

Open data plane on DPDK TCP/IP stack for DPDK

ANS(accelerated network stack)provide a userspace TCP/IP stack for use with DPDK. ANS is a static library which can be compiled with your App. You can add or delete ether interface, IP address and static routing from ANS. Your App can forward traffic to ANS.


  1. OpenOnLoad

OpenOnload runs on Linux and supports TCP/UDP/IP network protocols with the standard BSD sockets API, and requires no modifications to applications to use.

It achieves performance improvements in part by performing network processing at user-level, bypassing the OS kernel entirely on the data path.

Networking performance is improved without sacrificing the security and multiplexing functions that the OS kernel normally provides.

OpenOnload comprises a user-level shared library that intercepts network-related system calls and implements the protocol stack, and supporting kernel modules.

To accelerate an application with the Onload user-level transport, simply

invoke the application on the command line as normal, prepended with


ie. Instead of:

netperf -t TCP_RR -H myserver

do this:

onload netperf -t TCP_RR -H myserver

and tuned for best latency:

onload -p latency netperf -t TCP_RR -H myserver






  1. Rump kernel

NUSE将整个Linux kernel编译成一个用户态动态库,供应用程序使用,OpenDP为用户态应用程序提供加速的TCP/IP协议栈,OpenOnLoad则无需修改应用程序即可使用TCP/IP协议栈以加速应用程序的网络处理性能。

Rump kernel的思路是让内核态的驱动程序可以不用修改即可运行在用户态。驱动程序包括设备驱动程序、文件系统、TCP/IP协议栈等。

The NetBSD rump kernel is the first implementation of the "anykernel" concept where drivers either can be compiled into and/or run in the monolithic kernel or in user space on top of a light-weight rump kernel.

The NetBSD drivers can be used on top of the rump kernel on a wide range of POSIX operating systems, such as the Linux, NetBSD.

The rump kernels can also run without POSIX directly on top of the Xen hypervisor, the L4 microkernel in Genode OS or even on "OS-less" bare metal.

Rump kernel与DPDK结合后的体系层次如下:


An anykernel is different in concept from microkernels, exokernels, partitioned kernels or hybrid kernels in that it tries to preserve the advantages of a monolithic kernel, while still enabling the faster driver development and added security in user space.

The "anykernel" concept refers to an architecture-agnostic approach to drivers where drivers can either be compiled into the monolithic kernel or be run as a userspace process, microkernel-style, without code changes.

With drivers, a wider concept is considered where not only device drivers are included but also file systems and the networking stack.







  1. KVMforNFV

© 2015 Open Platform for NFV Project, Inc., a Linux Foundation Collaborative Project.





  1. VPP




  1. Intel硬件辅助虚拟化

    1. 处理器辅助虚拟化VT-x

英特尔处理器内更出色的虚拟化支持英特尔VT-x 有助于提高基于软件的虚拟化解决方案的灵活性与稳定性。通过按照纯软件虚拟化的要求消除虚拟机监视器(VMM)代表客户操作系统来听取、中断与执行特定指令的需要,不仅能够有效减少 VMM 干预,还为 VMM 与客户操作系统之间的传输平台控制提供了有力的硬件支持,这样在需要 VMM干预时,将实现更加快速、可靠和安全的切换。此外,英特尔VT-x 具备的虚拟机迁移特性还可为您的 IT 投资提供有力保护,并进一步提高故障切换、负载均衡、灾难恢复和维护的灵活性:

--英特尔VT FlexPriority:当处理器执行任务时,往往会收到需要注意的其它设备或应用发出的请求或"中断"命令。为了最大程度减少对性能的影响,处理器内的一个专用寄存器(APIC任务优先级寄存器,或 TPR)将对任务优先级进行监控。如此一来,只有优先级高于当前运行任务的中断才会被及时关注。

英特尔FlexPriority 可创建 TPR6 的一个虚拟副本,该虚拟副本可读取,在某些情况下,如在无需干预时,还可由客户操作系统进行更改。上述举措可以使频繁使用 TPR 的 32 位操作系统获得显著的性能提升。(例如,能够将在 Windows Server* 2000上运行的应用的性能提高 35%。)

--英特尔虚拟化灵活迁移技术(Intel VT FlexMigration):虚拟化的一个重要优势是能够在无需停机的情况下,将运行中的应用在物理服务器之间进行迁移。英特尔虚拟化灵活迁移技术(Intel VT FlexMigration)旨在实现基于英特尔处理器的当前服务器与未来服务器之间的无缝迁移,即使新的系统可能包括增强的指令集也不例外。借助此项技术,管理程序能够在迁移池内的所有服务器中建立一套一致的指令,实现工作负载的无缝迁移。这便生成了可在多代硬件中无缝运行的更加灵活、统一的服务器资源池。



  1. 芯片组辅助虚拟化VT-d

英特尔芯片组内更出色的虚拟化支持由于每台服务器上整合了更多的客户操作系统,数据进出系统的传输量(I/O 流量)有所增加并且更趋复杂。如果没有硬件辅助,虚拟机监视器(VMM)必须直接参与每项 I/O 交易。这不仅会减缓数据传输速度,还会由于更频繁的 VMM 活动而增大服务器处理器的负载。这就如同在一个繁忙的购物中心,每位顾客都不得不通过一个门进出该中心,并且只能从中心经理那里得到指示。这样不仅会耽误顾客的时间,也会使经理无法处理其它紧急事件。

英特尔VT-d 通过减少 VMM 参与管理 I/O 流量的需求,不但加速了数据传输,而且消除了大部分的性能开销。这是通过使 VMM将特定 I/O 设备安全分配给特定客户操作系统来实现的。每个设备在系统内存中都有一个专用区域,只有该设备及其分配的客户操作系统才能对该区域进行访问。

完成初始分配之后,数据即可直接在客户操作系统与为其分配的设备之间进行传输。这样,I/O 流量的流动将更加迅速,而减少的 VMM 活动则会进一步缩减服务器处理器的负载。此外,由于用于特定设备或客户操作系统的 I/O 数据不能被其它任何硬件或客户软件组件访问,系统的安全性与可用性也得到了进一步增强。


  1. 网卡辅助虚拟化VT-c

英特尔I/O 设备内更出色的虚拟化支持随着企业在虚拟化环境中部署越来越多的应用,并利用实时迁移来节省功率或提升可用性,对虚拟化 I/O 设备的要求也在显著提高。通过将广泛的硬件辅助特性集成到 I/O 设备(该设备用于保持服务器与数据中心网络、存储基础设施及其它外部设备的连接)中,英特尔VT-c 可针对虚拟化进一步优化网络。从本质上来说,这套技术组合的功能与邮局非常相似:将收到的信件、包裹及信封分门别类,然后投递到各自的目的地。通过在专用网络芯片上执行这些功能,英特尔VT-c 大幅提高了交付速度,减少了 VMM 与服务器处理器的负载。英特尔VT-c 包括以下两项关键技术(当前所有的英特尔万兆位服务器网卡及选定的英特尔千兆位服务器网卡均可支持):

--借助虚拟机设备队列(VMDq)最大限度提高 I/O 吞吐率:在传统服务器虚拟化环境中,VMM 必须对每个单独的数据包进行分类,并将其发送到为其分配的虚拟机。这样会占用大量的处理器周期。而借助 VMDq,该分类功能可由英特尔服务器网卡内的专用硬件来执行,VMM 只需负责将预分类的数据包组发送到适当的客户操作系统。这将减缓 I/O 延迟,使处理器获得更多的可用周期来处理业务应用。英特尔VT-c可将 I/O 吞吐量提高一倍以上,使虚拟化应用达到接近本机的吞吐率。每台服务器将整合更多应用,而 I/O 瓶颈则会更少。

--借助虚拟机直接互连(VMDc)大幅提升虚拟化性能:借助PCI-SIG 单根 I/O 虚拟化(SR-IOV)标准,虚拟机直接互连(VMDc)支持虚拟机直接访问网络 I/O 硬件,从而显著提升虚拟性能。如前所述,英特尔VT-d 支持客户操作系统与设备I/O 端口之间的直接通信信道。通过支持每个 I/O 端口的多条直接通信信道,SR-IOV 可对此进行扩展。例如,通过单个英特尔万兆位服务器网卡,可为 10 个客户操作系统中的每个操作系统分配一个受保护的、1 Gb/秒的专用链路。这些直接通信链路绕过了 VMM 交换机,可进一步提升 I/O 性能并减少服务器处理器的负载。




  1. Intel ONP

    1. ONP®-onp

Intel® Open Network Platform (Intel ONP) is a reference architecture that provides engineering guidance and ecosystem enablement support to encourage widespread adoption of SDN and NFV solutions in Telco, Enterprise and Cloud. It is not a commercial product, but a pre-production reference that drives development and showcase SDN/NFV solution capabilities. Intel ONP reference architecture brings together Industry Standard High Volume Servers (SHVS) based on Intel® Architecture (IA) and a software stack composed of open source, open standard software ingredients.


One of the key objectives of Intel ONP is to align and optimize key Open Community software ingredients for architects and engineers targeting high performing SDN and NFV open source based solutions. Primary Intel ONP Software Ingredients included are: DPDK for accelerated packets processing; Open vSwitch* (OVS) including support for OVS with DPDK which enables much better performance of the data plan when using DPDK libraries; OpenDaylight* (ODL) controller; and OpenStackorchestrator.





  1. DPDK

Intel将DPDK运用到了NFV领域,分为DPDK Switch和DPDK NFV。其中DPDK NFV中包含协议栈优化和IVSHMEM优化通道。

DPDK is a set of software libraries and Ethernet drivers (native and virtualized) that run in Linux user space to boost packet processing throughput on Intel® architecture.

DPDK library components include:

  • Environment Abstraction Layer - abstracts huge-page file system, provides multi-thread and multi-process support.
  • Memory Manager - allocates pools of objects in memory. A pool is created in huge page memory space and uses a ring to store free objects. It also provides an alignment helper to make sure that objects are padded, to spread them equally on all DRAM channels.
  • Buffer Manager – reduces, by a significant amount, the time the operating system spends allocating and de-allocating buffers. The DPDK library pre-allocates fixed size buffers which are stored in memory pools.
  • Queue Manager - implements safe lockless queues, instead of using spinlocks, that allow different software components to process packets, while avoiding unnecessary wait times.
  • Flow Classification - provides an efficient mechanism which incorporates Intel® Streaming SIMD Extensions (Intel® SSE) to produce a hash based on tuple information, so that packets may be placed into flows quickly for processing, greatly improving throughput.



  • PMD affinitization
  • Multiple poll mode driver threads
  • DPDK port Rx Queues
  • Exact Match Cache
  • Compiler options
  • Simultaneous Multithreading (SMT)
  • The isolcpus kernel boot parameter

    isolcpus can be used on the kernel bootline to isolate cores from the kernel scheduler and hence dedicate them to OVS or other packet forwarding related workloads.

  • NUMA/Cluster On Die
  • Rx Mergeable buffers
  • Packet processing in the guest
  • DPDK virtio pmd in the guest






  1. DPDK with IVSHMEM


从前面分析可知,ivshmem机制,一是qemu本身支持,二是guest os需要ivshmem驱动。那么DPDK针对ivshmem做了哪些改善呢?


The DPDK IVSHMEM library facilitates fast zero-copy data sharing among virtual machines (host-to-guest or guest-to-guest) by means of QEMU's IVSHMEM mechanism.

  • The library works by providing a command line for QEMU to map several hugepages into a single IVSHMEM device.
  • For the guest to know what is inside any given IVSHMEM device (and to distinguish between DPDK and non-DPDK IVSHMEM devices), a metadata file is also mapped into the IVSHMEM segment.
  • No work needs to be done by the guest application to map IVSHMEM devices into memory; they are automatically recognized by the DPDK Environment Abstraction Layer (EAL).

A typical DPDK IVSHMEM use case looks like the following:

所以这个所谓的metadata file记录了在ivshmem这个设备里面的信息。

如何创建这个metadata file呢?DPDK提供的API:rte_ivshmem_metadata_create() to create a new metadata file


  • rte_ivhshmem_metadata_add_memzone() to add rte_memzone to metadata file
  • rte_ivshmem_metadata_add_ring() to add rte_ring to metadata file
  • rte_ivshmem_metadata_add_mempool() to add rte_mempool to metadata file

Guest或host的应用程序则可以通过读取metadata file获得memzone、ring、mempool等相关的信息。


rte_ivshmem_metadata_cmdline_generate() to generate the command line for QEMU

【IVSHMEM Environment Configuration】

  • Compile a special version of QEMU from sources.

The source code can be found on the QEMU website (currently, version 1.4.x is supported, but version 1.5.x is known to work also), however, the source code will need to be patched to support using regular files as the IVSHMEM memory backend. The patch is not included in the DPDK package, but is available on the Intel®DPDK-vswitch project webpage (either separately or in a DPDK vSwitch package).

  • Enable IVSHMEM library in the DPDK build configuration.

In the default configuration, IVSHMEM library is not compiled. To compile the IVSHMEM library, one has to either use one of the provided IVSHMEM targets (for example, x86_64-ivshmem-linuxapp-gcc), or set CONFIG_RTE_LIBRTE_IVSHMEM to "y" in the build configuration.

  • Set up hugepage memory on the virtual machine.

The guest applications run as regular DPDK (primary) processes and thus need their own hugepage memory set up inside the VM.



如何在DPDK的基础上编写一个基于ivshmem的应用程序呢?在DPDK 2.0的软件包中未找到基于ivshmem的样例应用程序。但是官方给出了编写ivshmem应用程序的一些最佳实践:

When considering the use of IVSHMEM for sharing memory, security implications need to be carefully evaluated. IVSHMEM is not suitable for untrusted guests, as IVSHMEM is essentially a window into the host process memory. This also has implications for the multiple VM scenarios. While the IVSHMEM library tries to share as little memory as possible, it is quite probable that data designated for one VM might also be present in an IVSMHMEM device designated for another VM. Consequently, any shared memory corruption will affect both host and all VMs sharing that particular memory.

IVSHMEM applications essentially behave like multi-process applications, so it is important to implement access serialization to data and thread safety. DPDK ring structures are already thread-safe, however, any custom data structures that the user might need would have to be thread-safe also.

Similar to regular DPDK multi-process applications, it is not recommended to use function pointers as functions might have different memory addresses in different processes.

It is best to avoid freeing the rte_mbuf structure on a different machine from where it was allocated, that is, if the mbuf was allocated on the host, the host should free it. Consequently, any packet transmission and reception should also happen on the same machine (whether virtual or physical). Failing to do so may lead to data corruption in the mempool cache.






  1. DPDK with ring




DPDK用ring主要用来软件模拟以太网网卡以实现VM to VM 或VM to host的报文通信。也就是虚机之间或虚机与主机之间可以通过这个模拟的以太网网卡进行报文收发以达到交换信息和数据的目的。


DPDK提供了PMD on ring,但是未提供PMD on ivshmem。










  1. DPDK with vhost-user


DPDK通过vhost-net-user来与qemu中的vhost模块交互以获得virtio vring的相关信息,具体对virtio网卡的操作通过virtio-net-user来实现。

另外DPDK vhost还实现了另外一种机制叫vhost-cuse。不过intel宣称后续重点支持vhost-user机制。




For QEMU, this is done by using the -object memory-backend-file,share=on,... option. Which means QEMU will create a file to serve as the guest RAM. The share=on option allows another process to map that file, which means it can access the guest RAM.








rte_vhost_driver_register(path, flags)

This function registers a vhost driver into the system. For vhost-user server mode, a Unix domain socket file path will be created.

Currently two flags are supported (these are valid for vhost-user only):


DPDK vhost-user will act as the client when this flag is given.


When DPDK vhost-user acts as the client it will keep trying to reconnect to the server (QEMU) until it succeeds. This is useful in two cases:

  • When QEMU is not started yet.
  • When QEMU restarts (for example due to a guest OS reboot).

This reconnect option is enabled by default. However, it can be turned off by setting this flag.


提供rte_vhost_driver_session_start()用于vhost message的分发。



  • new_device(int vid)

This callback is invoked when a virtio net device becomes ready. vid is the virtio net device ID.

  • destroy_device(int vid)

This callback is invoked when a virtio net device shuts down (or when the vhost connection is broken).

  • vring_state_changed(int vid, uint16_t queue_id, int enable)

This callback is invoked when a specific queue's state is changed, for example to enabled or disabled.






  1. DPDK with PMD

本节讨论的是DPDK的网络收发的机制PMD(Poll-Mode Driver)轮询模式。


DPDK NFV的网卡有virtio、IVSHMEM(基于内存文件模拟的PCI设备)、VF(SR-IOV)、ring(基于内存队列模拟的以太网网卡)、pcap(基于磁盘文件模拟的以太网网卡)5类;DPDK vSwitch的网卡有物理千兆、万兆、40G网卡外,还有tap、ivshmem、VF、ring、pcap等虚拟网卡设备。





DPDK的uio驱动在报文收发方面采用所谓的PMD(Poll-Mode Driver)轮询模式。






  • run-to-completion模式下,使用API来轮询指定的网卡收包RX描述符队列。然后报文也就在这个核上处理,然后通过发送API将报文放入网卡TX描述符队列中。
  • pipe-line模式下,一个core通过API轮询一个或者是多个端口的RX描述符队列。报文收下来之后通过ring传递给其它core。其它的core处理报文,处理完后通过发送API将报文放到端口的TX描述符ring中。


DPDK PMD支持硬件负载卸载Hardware Offload。








  • Round-Robin(Mode 0)
  • Active Backup(Mode 1)
  • Balance XOR(Mode 2)


  • Broadcast(Mode 3)
  • Link Aggregation 802.3AD(Mode 4)
  • Transmit Load Balancing(Mode 5)




librte_pmd_af_packet:从packet socket接收发送报文


Packet sockets are used to receive or send raw packets at the device driver (OSI Layer 2) level. They allow the user to implement protocol modules in user space on top of the physical layer.





【PMD on virtio】

对应Guest OS的驱动






vritio_rxtx.c:这是PMD on virtio的关键实现




2.virtio PMD驱动注册





3.virtio 作为pci设备的初始化







另外,RTE_PCI_DRV_INTR_LSC 这个标志的意思是设备的link状态通过中断来通知。从下面可知virtio设备的link状态是通过中断来通知的。



4.virtio 中断处理


下面是virtio驱动收报文的函数(需要理解virtio收发报文的机制,参考前面的multiqueue virtio-net章节的介绍)



【PMD on VF】



【PMD on ring】

















【PMD on pNIC】




  1. OVS with DPDK

    1. DPDK对OVS的改进点

DPDK Switch则是利用vhost-user技术,改造OVS实现类似于snabb switch一样性能强劲的用户态交换机。


  1. 将OVS改造成用户态的vSwitch
  2. 支持vhost-user通信机制,使得VM与OVS之间、VM与VM之间支持vhost-user通信机制,只要VM的guest os支持virtio即可
  3. 支持ISVSHMEM通信机制,使得VM与OVS之间、VM与VM之间支持shmem通信机制,但是VM的guest os必须是基于DPDK改造过以支持shmem
  4. 支持直接操作物理网卡,提供uio和vfio两种驱动模式


5)overlay 隧道的优化


  1. 性能测试



据IBM实验(来源于SDNLAB 唐刚)

a) CPU:2 sockets Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz, 10 cores per scoket

b) RAM:64 Gbytes

• NIC cards are Intel10-Gigabit X540-AT2

• Kernel used is Linux3.13.0-53-generic


物理网卡----OVS with DPDK----VM---OVS with DPDK-----物理网卡。

OVS with DPDK与虚机VM之间的通道采用IVSHMEM(其实严格意义上应该是采用ring机制,只不过ring机制与ivshmem机制有相似,都是共享内存的零拷贝机制。在DPDK 2.0软件包中找不到基于ivshmem的样例应用程序),虚机VM里面跑的转发程序使用ring client,在ovs的安装目录下有。

在这种模式下,可以达到10Mpps(64字节 10Gbps转发速率对应14.88Mpps)









  1. 内核态的OVS+vhost优化的性能要远远低于OVS with DPDK。64字节的吞吐前者不到2G,后者9G左右
  2. OVS with DPDK with IVSHMEM通道在应用层新建性能方面又优于OVS with DPDK with vhost-user。新建速率前者在1.3万,后者在5000左右
  3. OVS with DPDK with IVSHMEM通道在3层吞吐方面与OVS with DPDK with vhost-user相当,在9G左右





网络拓扑2:测试虚拟交换机OVS native与OVS with DPDK性能

可见,64字节吞吐,OVS with DPDK 是OVS Natvie的10倍,达到7Gbps左右。

网络拓扑3:测试一个虚机经过OVS Native和OVS with DPDK下的性能

可见,虚机交换机上挂一个虚机,64字节吞吐,OVS with DPDK是OVS Native的6~7倍,达到2Gbps左右。另外经过虚机后,性能要比单纯经过虚机交换机的性能(7G)下降很多。





可见,虚拟交换机的转发性能与服务器的CPU个数基本成直线关系,在8 cores的情况下256自己吞吐可以达到40Gbps。同时转发流程的不同会有稍微差异,但是不大。


可见,XLAN加解封装,114字节吞吐(64+50),OVS with DPDK能够接近4Gbps。同时与CPU个数基本成线性关系,但是OVS Native则基本不随CPU个数的增加而增加。


  1. Run OVS with DPDK



Open vSwitch can use Intel(R) DPDK lib to operate entirely in userspace.

The DPDK support of Open vSwitch is considered experimental. It has not been thoroughly tested.


DPDK:>= 2.2

OVS: >= 2.5

Linux kernel:>= 2.6.34


第一步:Configure build & install DPDK


export DPDK_DIR=/usr/src/dpdk-2.2 cd $DPDK_DIR


2.Update config/common_linuxapp so that DPDK generate single lib file. (modification also required for IVSHMEM build)


Then run make install to build and install the library.

For default install without IVSHMEM:

make install T=x86_64-native-linuxapp-gcc

To include IVSHMEM (shared memory):

make install T=x86_64-ivshmem-linuxapp-gcc



第二步:Configure & build the Linux kernel

DPDK对Linux kernel的要求如下:

  • Kernel version >= 2.6.34
  • glibc >= 2.7 (for features related to cpuset)
  • UIO support
  • HPET and HPET_MMAP configuration options should also be enabled if HPET(High Precision Event Timer) support is required.



第三步:Configure & build OVS

  • Non IVSHMEM:

export DPDK_BUILD=$DPDK_DIR/x86_64-native-linuxapp-gcc/


export DPDK_BUILD=$DPDK_DIR/x86_64-ivshmem-linuxapp-gcc/


cd (OVSDIR)/./ [CFLAGS="-g -O2 -Wno-cast-align"] make



第一步:设置Linux kernel

1.Setup system boot Add the following options to the kernel bootline:

default_hugepagesz=1GB hugepagesz=1G hugepages=1

2.Mount the hugetable filesystem

mount -t hugetlbfs -o pagesize=1G none /dev/hugepages


DPDK devices can be setup using either the VFIO (for DPDK 1.7+) or UIO modules.



UIO requires inserting an out of tree driver igb_uio.ko that is available in DPDK:

  • insert uio.ko: modprobe uio
  • insert igb_uio.ko: insmod $DPDK_BUILD/kmod/igb_uio.ko
  • Bind network device to igb_uio: $DPDK_DIR/tools/ --bind=igb_uio eth1



VFIO needs to be supported in the kernel and the BIOS.

  • Insert vfio-pci.ko: modprobe vfio-pci
  • Set correct permissions on vfio device:

sudo /usr/bin/chmod a+x /dev/vfio

sudo /usr/bin/chmod 0666 /dev/vfio/*

  • Bind network device to vfio-pci:

$DPDK_DIR/tools/ --bind=vfio-pci eth1


  • First time only db creation (or clearing):

mkdir -p /usr/local/etc/openvswitch mkdir -p /usr/local/var/run/openvswitch rm /usr/local/etc/openvswitch/conf.db ovsdb-tool create /usr/local/etc/openvswitch/conf.db \ /usr/local/share/openvswitch/vswitch.ovsschema


  • Start ovsdb-server

ovsdb-server --remote=punix:/usr/local/var/run/openvswitch/db.sock \ --remote=db:Open_vSwitch,Open_vSwitch,manager_options \ --private-key=db:Open_vSwitch,SSL,private_key \ --certificate=Open_vSwitch,SSL,certificate \ --bootstrap-ca-cert=db:Open_vSwitch,SSL,ca_cert --pidfile --detach


  • First time after db creation, initialize:

ovs-vsctl --no-wait init


第四步:Start vswitchd

export DB_SOCK=/usr/local/var/run/openvswitch/db.sock ovs-vswitchd --dpdk -c 0x1 -n 4 -- unix:$DB_SOCK --pidfile --detach


If allocated more than one GB hugepage (as for IVSHMEM), set amount and use NUMA node 0 memory:

ovs-vswitchd --dpdk -c 0x1 -n 4 --socket-mem 1024,0 -- unix:$DB_SOCK --pidfile –detach



To use ovs-vswitchd with DPDK, create a bridge with datapath_type "netdev" in the configuration database. For example:

ovs-vsctl add-br br0 -- set bridge br0 datapath_type=netdev


Now you can add dpdk devices. OVS expects DPDK device names to start with "dpdk" and end with a portid. vswitchd should print (in the log file) the number of dpdk devices found.

ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk ovs-vsctl add-port br0 dpdk1 -- set Interface dpdk1 type=dpdk

Once first DPDK port is added to vswitchd, it creates a Polling thread and polls dpdk device in continuous loop. Therefore CPU utilization for that thread is always 100%.


Note: creating bonds of DPDK interfaces is slightly different to creating bonds of system interfaces. For DPDK, the interface type must be explicitly set, for example:

ovs-vsctl add-bond br0 dpdkbond dpdk0 dpdk1 -- set Interface dpdk0 type=dpdk -- set Interface dpdk1 type=dpdk



cd /usr/src/ovs/utilities/

./ovs-ofctl del-flows br0


Add flows between port 1 (dpdk0) to port 2 (dpdk1)

./ovs-ofctl add-flow br0 inport=1,action=output:2 ./ovs-ofctl add-flow br0 inport=2,action=output:1 ``


【使用DPDK rings】

OVS运行在Host OS上,OVS with DPDK则可以使用DPDK ring机制。

OVS with DPDK套件提供了ring client,其运行分为两种情形:ring client运行在host OS和运行在虚机VM里面。

运行在Host OS:

Following the steps above to create a bridge, you can now add dpdk rings as a port to the vswitch. OVS will expect the DPDK ring device name to start with dpdkr and end with a portid.

ovs-vsctl add-port br0 dpdkr0 -- set Interface dpdkr0 type=dpdkr

DPDK rings client test application

Included in the test directory is a sample DPDK application for testing the rings. This is from the base dpdk directory and modified to work with the ring naming used within ovs.

location tests/ovs_client

To run the client :

cd /usr/src/ovs/tests/ ovsclient -c 1 -n 4 --proc-type=secondary -- -n "port id you gave dpdkr"

In the case of the dpdkr example above the "port id you gave dpdkr" is 0.

The application simply receives an mbuf on the receive queue of the ethernet ring and then places that same mbuf on the transmit ring of the ethernet ring.


In addition to executing the client in the host, you can execute it within a guest VM. To do so you will need a patched qemu(支持IVSHMEM).



【使用DPDK vhost-user】

第一步:设置OVS with DPDK

Following the steps above to create a bridge, you can now add DPDK vhost-user as a port to the vswitch. Unlike DPDK ring ports, DPDK vhost-user ports can have arbitrary names, except that forward and backward slashes are prohibited in the names.

For vhost-user, the name of the port type is dpdkvhostuser

ovs-vsctl add-port br0 vhost-user-1 -- set Interface vhost-user-1 type=dpdkvhostuser

This action creates a socket located at /usr/local/var/run/openvswitch/vhost-user-1, which you must provide to your VM on the QEMU command line.


  1. Configure sockets.

Pass the following parameters to QEMU to attach a vhost-user device:

-chardev socket,id=char1,path=/usr/local/var/run/openvswitch/vhost-user-1 –netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce –device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1

where vhost-user-1 is the name of the vhost-user port added to the switch. Repeat the above parameters for multiple devices, changing the chardev path and id as necessary. Note that a separate and different chardev path needs to be specified for each vhost-user device. For example you have a second vhost-user port named 'vhost-user-2', you append your QEMU command line with an additional set of parameters:

-chardev socket,id=char2,path=/usr/local/var/run/openvswitch/vhost-user-2 –netdev type=vhost-user,id=mynet2,chardev=char2,vhostforce –device virtio-net-pci,mac=00:00:00:00:00:02,netdev=mynet2

2.Configure huge pages.

QEMU must allocate the VM's memory on hugetlbfs. Vhost-user ports access a virtio-net device's virtual rings and packet buffers mapping the VM's physical memory on hugetlbfs. To enable vhost-user ports to map the VM's memory into their process address space, pass the following paramters to QEMU:

-object memory-backend-file,id=mem,size=4096M,mem-path=/dev/hugepages, share=on –numa node,memdev=mem –mem-prealloc

3.Optional: Enable multiqueue support QEMU needs to be configured with multiple queues and the number queues must be less or equal to Open vSwitch other_config:n-dpdk-rxqs. The qbelowisthenumberofqueues.Thev is the number of vectors, which is '$q x 2 + 2'.

-chardev socket,id=char2,path=/usr/local/var/run/openvswitch/vhost-user-2 -netdev type=vhost-user,id=mynet2,chardev=char2,vhostforce,queues=qdevicevirtionetpci,mac=00:00:00:00:00:02,netdev=mynet2,mq=on,vectors=v

第三步:设置guest os的网卡驱动

If one wishes to use multiple queues for an interface in the guest, the driver in the guest operating system must be configured to do so. It is recommended that the number of queues configured be equal to '$q'.

For example, this can be done for the Linux kernel virtio-net driver with:

ethtool -L combined <$q>

-L: Changes the numbers of channels of the specified network device

combined: Changes the number of multi-purpose channels.





  1. 改进点源码分析





OVS with DPDK搬到用户态了,此时与虚机连接的接口和与物理网卡连接的接口都被DPDK所接管,与内核接口则有TAP和Socket了。

OVS with DPDK与虚机连接的接口和与物理网卡连接的接口都被DPDK所接管,其核心实现代码:openvswitch-2.5.0/lib/netdev-dpdk.c

OVS with DPDK 支持的3类网络接口:





至于Intel提到对数据平面转发、Tunnel等部分的优化,实际在ovs 2.5版本中未找到具体实现。



    1. (Fast data - Input/Output) is a collection of several projects and libraries to amplify the transformation to support flexible, programmable and composable services on a generic hardware platform. offers the Software Defined Infrastructure developer community a landing site with multiple projects fostering innovations in software-based packet processing towards the creation of high-throughput, low-latency and resource-efficient IO services suitable to many architectures (x86, ARM, and PowerPC) and deployment environments (bare metal, VM, container).

At the heart of is Vector Packet Processing (VPP) technology.



  1. VPP

    1. VPP概述

In development since 2002, VPP is production code currently running in shipping products.

It runs in user space on multiple architectures including x86, ARM, and Power architectures on both x86 servers and embedded devices. The design of VPP is hardware, kernel, and deployment (bare metal, VM, container) agnostic.

It runs completely in userspace.

VPP-powered is two orders of magnitude faster than currently available technologies.

The primary problem Cisco set out to solve with the development of vector packet processing (VPP) in 2002 was "Accelerating the NFV Data Plane".


The VPP Technology also provides a very high performance low level API. The API works via a shared memory message bus. The messages passed along the bus are specified in a simple IDL (Interface Definition Language) which is used to create C client libraries and Java client libraries. Support for generating bindings in additional languages could be added. These client libraries make it very easy to write external applications that programmatically control VPP. The shared memory message bus approach is very high performance.

Remote programmability can be achieved by having as your external app a Management Agent that exposes your SDN protocol of choice.

There is a 'Honeycomb' agent available at launch which exposes yang models for VPP functionality via netconf and restconf available currently. A controller which supports netconf/yang, such as OpenDaylight can 'mount' the Honeycomb Management Agent.


Primary Characteristics Of VPP

-Improved fault-tolerance and ISSU when compared to running similar packet processing in the kernel:

  • crashes seldom require more than a process restart
  • software updates do not require system reboots
  • development environment is easier to use and perform debug than similar kernel code
  • user-space debug tools (gdb, valgrind, wireshark)
  • leverages widely-available kernel modules (uio, igb_uio): DMA-safe memory

-Runs as a Linux user-space process:

  • same image works in a VM, in a Linux container, or over a host kernel
  • KVM and ESXi: NICs via PCI direct-map
  • Vhost-user, netmap, virtio paravirtualized NICs
  • Tun/tap drivers
  • DPDK poll-mode device drivers

-Integrated with the Intel DPDK, VPP supports existing NIC devices including:

  • Intel i40e, Intel ixgbe physical and virtual functions, Intel e1000, virtio, vhost-user, Linux TAP
  • HP rebranded Intel Niantic MAC/PHY
  • Cisco VIC

-Security issues considered:

  • Extensive white-box testing by Cisco's security team
  • Image segment base address randomization
  • Shared-memory segment base address randomization
  • Stack bounds checking
  • Debug CLI "chroot"
  • The vector method of packet processing has been proven as the primary punt/inject path on major architectures.

-Supported Architectures

  • x86/64
  • Supported Packaging Models

-The VPP platform supports package installation on the following operating systems:

  • Debian
  • Ubuntu 14.04



  1. VPP工作原理

【cache thrashing】

Supported by the data plane development kit's poll mode drivers (PMD) and ring buffer libraries, VPP aims to increase forwarding plane throughput by reducing the number of misses in flow/forwarding table caches while replacing standard serial packet lookups with a parallel approach.

Short-lived flows or high-entropy packet fields -- those likely to have differing values from packet to packet -- kill caches, hence the introduction of the megaflow (aggregate) cache into OVS in an attempt to turn mice (flows) into elephants.

Let's start from scratch: An empty cache is "cold," resulting in misses on all queries, of course. The cache "warms up" as those misses are subsequently used to populate it, at which point the cache is said to be "warm."

A warm cache should be resulting in an appropriate number of query "hits" with either simple first-in, first-out (FIFO) methodologies or, more appropriately, least recently used (LRU) or least frequently used (LFU) algorithms deciding how old entries should be replace by new one. This replacement policy is actually more critical than I'm giving it credit for, here, as a high cache churn rate could be due to poor replacement algorithms. A more likely reason, though, are those pesky short-lived flows, with each new addition resulting in a miss and the possible replacement of a long-term flow entry in the cache with one that is never seen again. An incessant eviction of useful data is (superbly) referred to as "cache thrashing."

In CPU parlance, this functionality would leverage the information cache (i-cache), It also mentions a supporting data cache (d-cache), used to store pre-fetched data needed to support the i-cache. VPP primarily favors the i-cache, although there are some advantages obtained with increased d-cache efficiencies as well.


【forwording graph and graph node】

The forwarding graph, which essentially defines the forwarding operations for each given packet, comprises a number of "graph nodes," each with a different role to play in processing or directing the packet.

VPP technology, is highly modular, allowing for new graph nodes to be easily "plugged in" without changes to the underlying code base or kernel. This gives developers the potential to easily build any number of packet processing devices with varying forwarding graphs, including not just those supporting switches and routers but intrusion detection and prevention, firewalls or load balancers.




【scalar processing and vector processing】

With the bottleneck being the cache, even in the most highly tuned, all-user-space, DPDK accelerated environments, the switch pipeline operates in a serial mode, handing one packet at a time. Even if there is a nice big DPDK FIFO chock full o' packets, they are sent through the "forwarding graph" individually. In computing parlance, this is called scalar processing.



Rather than working on single packets in a serial manner, a VPP operates simultaneously on an array, or collection, of packets.this is called vector processing.

Rather than just grabbing the packet at the front of the line, however, the VPP engine takes a chunk of packets up to a predetermined maximum of, let's say, 256. Naturally, the vector itself doesn't contain the actual packets but pointers to their locations in a buffer.

The "superframe" of N x packets, as it has been referred to, proceeds to the first graph node, where the Ethernet header is decoded and the EtherType is identified, as previously. While our theory of temporal locality suggests that the EtherType will be identical across the vector (i.e. IPv4), naturally there is a chance a group of diverse packets (i.e. IPv6) made it into the superframe. If this is the case, the forwarding graph forks and the superframe is partitioned into two "subnets," each with a distinct next-hop graph node.

The problem with that traditional scalar packet processing is:

  • thrashing occurs in the I-cache
  • each packet incurs an identical set of I-cache misses
  • no workaround to the above except to provide larger caches

By contrast, vector processing processes more than one packet at a time.

One of the benefits of the vector approach is that it fixes the I-cache thrashing problem. It also mitigates the dependent read latency problem (pre-fetching eliminates the latency).

This approach fixes the issues related to stack depth / D-cache misses on stack addresses. It improves "circuit time". The "circuit" is the cycle of grabbing all available packets from the device RX ring, forming a "frame" (vector) that consists of packet indices in RX order, running the packets through a directed graph of nodes, and returning to the RX ring. As processing of packets continues, the circuit time reaches a stable equilibrium based on the offered load.

As the vector size increases, processing cost per packet decreases because you are amortizing the I-cache misses over a larger N.



【temporal locality】

VPP operates on a simple principle with a (typically) scientific name: temporal locality -- or locality in time. In terms of application flows, this phenomenon notes the relationship between packets sampled within a short period of time and the strong likelihood that they are similar, if not identical, in nature. Packets with such attributes would reuse the same resources and will be accessing the same (cache) memory locations.


  1. VPP性能


One of the benefits of this implementation of VPP is its high performance on relatively low-power computing. This high level of performance is based on the following highlights:

  • High-performance user-space network stack for commodity hardware
  • The same code for host, VMs, Linux containers
  • Integrated vhost-user virtio backend for high speed VM-to-VM connectivity
  • L2 and L3 functionality, multiple encapsulations
  • Leverages best-of-breed open source driver technology: Intel DPDK
  • Extensible by use of plugins
  • Control-plane / orchestration-plane via standards-based APIs


The rates reflect VPP and OVSDPDK performance tested on Haswell x86 platform with E5-2698v3 2x16C 2.3GHz. The graphs shows NDR rates for 12 port 10GE, 16 core, IPv4.





  1. VPP源代码分析





vlib_main_t:VPP的控制数据结构,其中包括graph node数据结构;

vlib_node_runtime_t:用于描述graph node运行时的信息;



static uword ip4_input (vlib_main_t * vm,

vlib_node_runtime_t * node,

vlib_frame_t * frame)














int vnet_hw_interface_rx_redirect_to_node (vnet_main_t * vnm, u32 hw_if_index,

u32 node_index)



  1. SDN、NFV、OpenStack







  1. OVS
  2. OpenContrail的vRouter
  3. OpenStack的DVR




基本上都是利用了OpenStack Neutron的Plugin机制:

第三方的网络项目都会向Neturon注册一个ML2(The Modular Layer2) Plugin作为与Neturon API service交互的一个代理。



  1. VNF性能提升实践


下图是我们通过优化虚机与host os的通道,采用SR-IOV技术,相比采用virtio,vFW的IO吞吐得到几十倍的提升:




  1. 总结

    1. OpenStack是虚拟网络资源的编排系统,OpenDaylight、OpenContrail等是虚拟网络资源的调度与控制系统,DVR或OVS是虚拟网络资源的载体和虚拟网络功能的执行者
    2. NFV是从网络功能虚拟化的实现层面来说的,VNF是NFV的重要输出。OpenContrail可以看做集VNF的编排、控制于一体的系统
    3. DPDK并非Intel首创,实际上在前期NP、Multicore的实践经验的提出了框架,而且从前面分析看出DPDK的框架没有超出当年RMI、Cavium提出的解决框架,RMI提出的数据平面是在内核空间实现,Cavium的框架是在用户空间实现,由于Intel受到一些厂商用Cavium芯片较多的影响,采用了用户空间的解决方案。目前看来用户空间的解决方案在可调试性、可移植性方面要比内核空间的解决方案好得多
    4. VNF是通过NFV实现的,而NFV的实现方式当前主要有两种:虚机和容器。
    5. 由于VNF毕竟是通过软件虚拟而得到的,所以对于网络IO吞吐的影响比较大,如何提升VNF的网络IO性能呢?从虚机实现的角度来看,主要优化的点有三个:VNF的DataPath、虚机与host os的虚拟通道、host os与物理网卡的通道。
    6. VNF的DataPath优化方案主要有:OpenDP、Snabb、libuinet、mTCP、NUSE、OpenOnLoad、VPP等;虚机与Host OS的虚拟通道优化方案主要有:virtio-net、vhost-net、vhost-user、ivshmem、macvtap、mutiqueue virtio-net、SR-IOV、VMDq等;host os与物理网卡的通道优化方案主要有:snabb、lagopus、OVS、Rump Kernel、KVM4NFV
    7. Intel的DPDK在这三个方面都有自己的优化方案:DataPath的优化主要体现在PMD驱动,虚机与Host OS通道优化主要体现在ivshmem、vhost-use、SR-IOV等,Host OS与网络网卡的通道优化主要体现在对OVS的改造


  1. 附录

    1. PCIe规范


●VID:全称Vendor Identification,又称Vendor ID,是代表发明设备的专利所有者(技术厂商)的识别码,即常说的厂商ID,这个ID是PCI-SGI组织统一编制命名的,是唯一的厂商标识,不允许重名。例如:ATI的VID是0x1002,而NVIDIA的VID则是0x10DE。

●DID:全称Device Identification,又称Device ID,是针对设备本身标识的代码,即常说的设备ID。这个ID标识主要区别同类设备的不同型号,一般由技术发明厂商按PCI规范命名,不同厂商的设备可以有重名(由于不同厂商都有唯一的VID,因此并不会混淆身份)。例如:同样研发代号为R350的ATI 9800和ATI 9800XT,设备ID却不同,分别为:0x4E48和0x4E4A。

●SID:全称 Subsystem-Identification,又称Sub-ID子(次要)系统(设备)识别码,是授权被制造的设备的二次编码。和DID的区别是这个代码不是原技术厂商设备的代码,而是二级制造厂商代工设备的代码。有时候,如果是由技术厂商自己组织制造设备,这个编码也可以和DID同名。

●SVID:全称 Subsystem-Vendor Identification,又称Sub-Vendor ID子(次要)系统厂商识别码,是由PCI-SGI组织认证的二级制造厂商的识别码,同样是唯一的厂商标识,不允许重名,但同一家技术厂商的VID和SVID可以重名。SID和SVID代码通常放在一起,作为设备二次认证的IDs。例如:8139网卡的Subsystem IDs组合是0x813910EC,前面的8139是SID,后面的10EC是SVID。

●RID:全称Revision ID,又称Rev ID。即通常所说的版本号:REV.X.

●CC:全称Class-Code,类型代码,是区别不同类或者同类不同规格设备类型的编码。对于每一种设备如显示、声卡、SCSI、USB设备等等都有各自的标准分类。例如:Class-Code:000c0300,代表UHCI类型的USB设备;而Class-Code: 000c0310,则代表OHCI类型的USB设备。



  1. pf_ring与vring





  1. 提出一种基于环形缓冲区的新的套接字pf_ring socket
  2. 每创建一个pf_ring套接字就分配一个环形缓冲区
  3. 当将这个套接字与网卡绑定时,网卡接收到报文后,DMA直接将数据写入此套接字的环形缓冲区中
  4. 应用层可以直接读取环形缓冲区的数据
  5. 当新的报文到来时可以覆盖已经被应用层读取的缓冲区

注:Unix有一种报文收发机制,叫Device Polling设备轮询,所提供的API通常叫NAPI,其工作原理:

  1. 当网卡接收到一个报文后,产生一个中断信号给系统
  2. 系统接收到这个中断信号后,会做:关闭网卡中断;激活轮询进程,对网卡进行轮询收包;打开网卡中断



  1. 报文描述符
  2. used vring和available vring,vring的每个元素就是报文描述符
  3. virtqueue,实际就是guest virtio driver与virtio pci device之间用户传输数据的队列






  1. MSI与MSI-X

MSI的全称是Message Signaled Interrupt。

MSI出现在PCI 2.2和PCIe的规范中,是一种内部中断信号机制。传统的中断都有专门的中断pin,当中断信号产生时,中断PIN电平产生变化(一般是拉低)。INTx就是传统的外部中断触发机制,它使用专门的通道来产生控制信息。然而PCIe并没有多根独立的中断PIN,于是使用特殊的信号来模拟中断PIN的置位和复位。MSI允许设备向一段指定的MMIO地址空间写一小段数据,然后chipset以此产生相应的中断给CPU。

从电气机械的角度,MSI减少了对interrupt pin个数的需求,增加了中断号的数量,传统的PCI中断只允许每个device拥有4个中断,并且由于这些中断都是共享的,大部分device都只有一个中断,MSI允许每个device有1,2,4,8,16甚至32个中断。




  1. 参考资料

  1. 软件包 Intel DPDK 2.0
  2. 软件包 OVS 2.5.0
  3. 软件包 VPP 10.06
  4. 《Intel® Open Network Platform Release 2.1 Reference Architecture Guide.pdf》
