Kubernetes为什么不使用libnetwork
历史背景
https://kubernetes.io/zh-cn/blog/2016/01/14/why-kubernetes-doesnt-use-libnetwork/
早在Kubernetes 1.0版本且同一时间Docker基于libnetwork的CNM那个时候,Kubernetes已经有自己基础网络插件且不同于libnetwork
Kubernetes采用的是CNI的方式,是由CoreOS 提出的替代 Container Network Interface 模型以及 App Container (appc) 规范的一部分
为什么不使用CNM
原因一
Docker 有一个“本地”和“全局”驱动程序的概念。本地驱动程序(例如 “bridge” )以机器为中心,不进行任何跨节点协调。全局驱动程序(例如 “overlay” )依赖于 libkv (一个键值存储抽象库)来协调跨机器。这个键值存储是另一个插件接口,并且是非常低级的(键和值,没有其他含义)。 要在 Kubernetes 集群中运行类似 Docker’s overlay 驱动程序,需要运行 consul, etcd 或 zookeeper 的整个不同实例 (see multi-host networking) 否则必须自己研发 libkv 来支持
Kubernetes最初支持过这一计划,但是整体认为 libkv 接口是非常低级且架构在内部定义为 Docker 。而且还必须直接暴露底层键值存储,或者提供定制的API接口(在Kubernetes平台的结构化API它本身是在键值系统上实现的)。对于性能的可伸缩性和安全性原因都不是很有吸引力。结果是为了支持Docker网络结构的实现而Kubernetes的整体架构变得复杂不符合设计思想。
原因二
CNI 在哲学上与 Kubernetes 更加一致。它比 CNM 简单得多,不需要守护进程,并且至少有合理的跨平台( CoreOS 的 rkt 容器运行时支持它)。跨平台意味着可以支持多种Container runtimes(例如 Docker , Rocket , Hyper )运行相同的网络配置。 它遵循 UNIX 的理念,即做好一件事。
此外,包装 CNI 插件并生成更加个性化的 CNI 插件成本廉价(它可以通过简单的 shell 脚本完成) CNM 在这方面要复杂得多。这使得 CNI 对于快速开发和迭代是有吸引力的选择。早期的原型已经证明,可以将 kubelet 中几乎 100% 的当前硬编码网络逻辑弹出到插件中。
Kubernetes调查了为 Docker 编写 “bridge” CNM驱动程序 并运行 CNI 驱动程序。事实证明非常复杂。首先, CNM 和 CNI 模型非常不同,因此没有一种“方法”协调一致。 就上面讨论的Global与local和KV存储的问题。假设这个驱动程序会声明是local端这必须从 Kubernetes 获取有关逻辑网络的信息。不幸的是, Docker 驱动程序很难映射到像 Kubernetes 这样的其他控制平面。具体来说,驱动程序不会被告知连接容器的网络名称 – 只是 Docker 内部分配的 ID 。这使得驱动程序很难映射回另一个系统中存在的任何网络概念
CNM是什么
CNM是基于libnetwork所实现的,CNM主要是由以下三个组件组成,如下
- Sandbox
A Sandbox contains the configuration of a container’s network stack. This includes management of the container’s interfaces, routing table and DNS settings. An implementation of a Sandbox could be a Linux Network Namespace, a FreeBSD Jail or other similar concept. A Sandbox may contain many endpoints from multiple networks.
network sandbox 包括容器网络栈、网络接口(网卡)、路由表、DNS配置,Sandbox的技术实现有Linux Network Namespace,FreeBSD Jail或者类似其它的namespace隔离技术,一个Sandbox包含多个网络多个endpints
- Endpoint
An Endpoint joins a Sandbox to a Network. An implementation of an Endpoint could be aveth
pair, an Open vSwitch internal port or similar. An Endpoint can belong to only one network and it can belong to only one Sandbox, if connected.
Endpoint是Sandbox接入Network之间的介质,实现技术方式有`veth`pair 设备、Open vSwitch内部端口,一个Endpoint一端接的是Sandbox一端接的是Network
- Network
A Network is a group of Endpoints that are able to communicate with each-other directly. An implementation of a Network could be a Linux bridge, a VLAN, etc. Networks consist of many endpoints.
Network是一组可以相互通信的endpoints,主要实现技术方式有Linux Bridge、VLAN,Network可以包含多个endpoints
除此之外,CNM还依赖二个关键的对象完成Docker的网络管理功能
- Network Controller
networkController为用户提供接入libnetwork入口的APIs接口(Docker engine)来管理网络配置、libnetwork支持多个驱动,同时也支持绑定指定的驱动到指定的网络 - Driver
driver对于用户来说是一个不可见的对象,是通过插件式的接入方式,提供最终网络实现功能,负责网络管理(IPAM)包括资源的分配与回收
CNI是什么
CNI是Kubernetes平台支持容器网络插件的接口,CNI要求网络插件必须是一个可执行的文件,可被上层容器管理平台调用,网络插件只做二件事情
- 把容器加入网络
- 从网络删除容器
CNI规范了什么
- A format for administrators to define network configuration.
管理定义了网络配置的格式 - A protocol for container runtimes to make requests to network plugins.
约束容器运行时请求网络插件的协议 - A procedure for executing plugins based on a supplied configuration.
基于提供的配置执行插件的过程 - A procedure for plugins to delegate functionality to other plugins.
插件功能委托给其它插件过程 - Data types for plugins to return their results to the runtime.
返回给容器运行时的插件的数据类型结果
网络配置格式
CNI的网络配置,包含容器运时与网络插件交互使用的引导配置文件、插件执行时间、配置格式解释、需要传递给插件转换清单
配置文件
配置文件是由一组JSON键值项组成的,包含以下kyes(具体根据不同的插件而定,不是所有插件都一样)
- cniVersion 插件的版本信息
- name 插件名称
- type 插件类型
- plugins 插件的具体配置
Plugin configuration objects:
Plugin configuration objects may contain additional fields than the ones defined here. The runtime MUST pass through these fields, unchanged, to the plugin, as defined in section 3.
Required keys:
type
(string): Matches the name of the CNI plugin binary on disk. Must not contain characters disallowed in file paths for the system (e.g. / or \).
Optional keys, used by the protocol:
capabilities
(dictionary): Defined in section 3
Reserved keys, used by the protocol: These keys are generated by the runtime at execution time, and thus should not be used in configuration.
runtimeConfig
args
- Any keys starting with
cni.dev/
Optional keys, well-known: These keys are not used by the protocol, but have a standard meaning to plugins. Plugins that consume any of these configuration keys should respect their intended semantics.
ipMasq
(boolean): If supported by the plugin, sets up an IP masquerade on the host for this network. This is necessary if the host will act as a gateway to subnets that are not able to route to the IP assigned to the container.ipam
(dictionary): Dictionary with IPAM (IP Address Management) specific values:type
(string): Refers to the filename of the IPAM plugin executable. Must not contain characters disallowed in file paths for the system (e.g. / or \).
dns
(dictionary, optional): Dictionary with DNS specific values:nameservers
(list of strings, optional): list of a priority-ordered list of DNS nameservers that this network is aware of. Each entry in the list is a string containing either an IPv4 or an IPv6 address.domain
(string, optional): the local domain used for short hostname lookups.search
(list of strings, optional): list of priority ordered search domains for short hostname lookups. Will be preferred overdomain
by most resolvers.options
(list of strings, optional): list of options that can be passed to the resolver
- 配置文件范例
{ "cniVersion": "1.1.0", "name": "dbnet", "plugins": [ { "type": "bridge", // plugin specific parameters "bridge": "cni0", "keyA": ["some more", "plugin specific", "configuration"], "ipam": { "type": "host-local", // ipam specific "subnet": "10.1.0.0/16", "gateway": "10.1.0.1", "routes": [ {"dst": "0.0.0.0/0"} ] }, "dns": { "nameservers": [ "10.1.0.1" ] } }, { "type": "tuning", "capabilities": { "mac": true }, "sysctl": { "net.core.somaxconn": "500" } }, { "type": "portmap", "capabilities": {"portMappings": true} } ] }
执行协议
CNI协议基于容器运行时调用的二进制文件的执行。CNI定义插件二进制和运行时之间的协议(通俗点讲通过CNI给容器创建网络接口传递给网络插件属性参数)
CNI协议参数主要是通OS的环境变量传递给网络插件,如下
CNI_COMMAND
: indicates the desired operation;ADD
,DEL
,CHECK
,GC
, orVERSION
.CNI_CONTAINERID
: Container ID. A unique plaintext identifier for a container, allocated by the runtime. Must not be empty. Must start with an alphanumeric character, optionally followed by any combination of one or more alphanumeric characters, underscore (), dot (.) or hyphen (-).CNI_NETNS
: A reference to the container’s “isolation domain”. If using network namespaces, then a path to the network namespace (e.g./run/netns/[nsname]
)CNI_IFNAME
: Name of the interface to create inside the container; if the plugin is unable to use this interface name it must return an error.CNI_ARGS
: Extra arguments passed in by the user at invocation time. Alphanumeric key-value pairs separated by semicolons; for example, “FOO=BAR;ABC=123”CNI_PATH
: List of paths to search for CNI plugin executables. Paths are separated by an OS-specific list separator; for example ‘:’ on Linux and ‘;’ on Windows
CNI operations
CNI defines 5 operations: ADD
, DEL
, CHECK
, GC
, and VERSION
. These are passed to the plugin via the CNI_COMMAND
environment variable.
ADD
: Add container to network, or apply modifications
A CNI plugin, upon receiving an ADD
command, should either
- create the interface defined by
CNI_IFNAME
inside the container atCNI_NETNS
, or - adjust the configuration of the interface defined by
CNI_IFNAME
inside the container atCNI_NETNS
.
If the CNI plugin is successful, it must output a result structure (see below) on standard out. If the plugin was supplied a prevResult
as part of its input configuration, it MUST handle prevResult
by either passing it through, or modifying it appropriately.
If an interface of the requested name already exists in the container, the CNI plugin MUST return with an error.
A runtime should not call ADD
twice (without an intervening DEL) for the same (CNI_CONTAINERID, CNI_IFNAME)
tuple. This implies that a given container ID may be added to a specific network more than once only if each addition is done with a different interface name.
Input:
The runtime will provide a JSON-serialized plugin configuration object (defined below) on standard in.
Required environment parameters:
CNI_COMMAND
CNI_CONTAINERID
CNI_NETNS
CNI_IFNAME
Optional environment parameters:
CNI_ARGS
CNI_PATH
DEL
: Remove container from network, or un-apply modifications
A CNI plugin, upon receiving a DEL
command, should either
- delete the interface defined by
CNI_IFNAME
inside the container atCNI_NETNS
, or - undo any modifications applied in the plugin’s
ADD
functionality
Plugins should generally complete a DEL
action without error even if some resources are missing. For example, an IPAM plugin should generally release an IP allocation and return success even if the container network namespace no longer exists, unless that network namespace is critical for IPAM management. While DHCP may usually send a ‘release’ message on the container network interface, since DHCP leases have a lifetime this release action would not be considered critical and no error should be returned if this action fails. For another example, the bridge
plugin should delegate the DEL action to the IPAM plugin and clean up its own resources even if the container network namespace and/or container network interface no longer exist.
Plugins MUST accept multiple DEL
calls for the same (CNI_CONTAINERID
, CNI_IFNAME
) pair, and return success if the interface in question, or any modifications added, are missing.
Input:
The runtime will provide a JSON-serialized plugin configuration object (defined below) on standard in.
Required environment parameters:
CNI_COMMAND
CNI_CONTAINERID
CNI_IFNAME
Optional environment parameters:
CNI_NETNS
CNI_ARGS
CNI_PATH
CHECK
: Check container’s networking is as expected
CHECK
is a way for a runtime to probe the status of an existing container.
Plugin considerations:
- The plugin must consult the
prevResult
to determine the expected interfaces and addresses. - The plugin must allow for a later chained plugin to have modified networking resources, e.g. routes, on
ADD
. - The plugin should return an error if a resource included in the CNI Result type (interface, address or route) was created by the plugin, and is listed in
prevResult
, but is missing or in an invalid state. - The plugin should return an error if other resources not tracked in the Result type such as the following are missing or are in an invalid state:
- Firewall rules
- Traffic shaping controls
- IP reservations
- External dependencies such as a daemon required for connectivity
- etc.
- The plugin should return an error if it is aware of a condition where the container is generally unreachable.
- The plugin must handle
CHECK
being called immediately after anADD
, and therefore should allow a reasonable convergence delay for any asynchronous resources. - The plugin should call
CHECK
on any delegated (e.g. IPAM) plugins and pass any errors on to its caller.
Runtime considerations:
- A runtime must not call
CHECK
for a container that has not beenADD
ed, or has beenDEL
eted after its lastADD
. - A runtime must not call
CHECK
ifdisableCheck
is set totrue
in the configuration. - A runtime must include a
prevResult
field in the network configuration containing theResult
of the immediately precedingADD
for the container. The runtime may wish to use libcni’s support for cachingResult
s. - A runtime may choose to stop executing
CHECK
for a chain when a plugin returns an error. - A runtime may execute
CHECK
from immediately after a successfulADD
, up until the container isDEL
eted from the network. - A runtime may assume that a failed
CHECK
means the container is permanently in a misconfigured state.
Input:
The runtime will provide a json-serialized plugin configuration object (defined below) on standard in.
Required environment parameters:
CNI_COMMAND
CNI_CONTAINERID
CNI_NETNS
CNI_IFNAME
Optional environment parameters:
CNI_ARGS
CNI_PATH
All parameters, with the exception of CNI_PATH
, must be the same as the corresponding ADD
for this container.
VERSION
: probe plugin version support
The plugin should output via standard-out a json-serialized version result object (see below).
Input:
A json-serialized object, with the following key:
cniVersion
: The version of the protocol in use.
Required environment parameters:
CNI_COMMAND
GC
: Clean up any stale resources
The GC comand provides a way for runtimes to specify the expected set of attachments to a network. The network plugin may then remove any resources related to attachments that do not exist in this set.
Resources may, for example, include:
- IPAM reservations
- Firewall rules
A plugin SHOULD remove as many stale resources as possible. For example, a plugin should remove any IPAM reservations associated with attachments not in the provided list. The plugin MAY assume that the isolation domain (e.g. network namespace) has been deleted, and thus any resources (e.g. network interfaces) therein have been removed.
Plugins should generally complete a GC
action without error. If an error is encountered, a plugin should continue; removing as many resources as possible, and report the errors back to the runtime.
Plugins MUST, additionally, forward any GC calls to delegated plugins they are configured to use (see section 4).
The runtime MUST NOT use GC as a substitute for DEL. Plugins may be unable to clean up some resources from GC that they would have been able to clean up from DEL.
Input:
The runtime must provide a JSON-serialized plugin configuration object (defined below) on standard in. It contains an additional key;
cni.dev/attachments
(array of objects): The list of still valid attachments to this network:containerID
(string): the value of CNI_CONTAINERID as provided during the CNI ADD operationifname
(string): the value of CNI_IFNAME as provided during the CNI ADD operation
Required environment parameters:
CNI_COMMAND
CNI_PATH
Output: No output on success, “error” result structure on error.
网络配置执行
如何将网络配置转换并与插件交互(为容器创建、删除网络配置)
具体执行操作使用环境变量 CNI_COMMAND传递,具体的变量值(ADD DEL CHECK GC)
The operation of a network configuration on a container is called an attachment. An attachment may be uniquely identified by the (CNI_CONTAINERID, CNI_IFNAME)
tuple.
Attachment Parameters
While a network configuration should not change between attachments, there are certain parameters supplied by the container runtime that are per-attachment. They are:
- Container ID: A unique plaintext identifier for a container, allocated by the runtime. Must not be empty. Must start with an alphanumeric character, optionally followed by any combination of one or more alphanumeric characters, underscore (), dot (.) or hyphen (-). During execution, always set as the
CNI_CONTAINERID
parameter. - Namespace: A reference to the container’s “isolation domain”. If using network namespaces, then a path to the network namespace (e.g.
/run/netns/[nsname]
). During execution, always set as theCNI_NETNS
parameter. - Container interface name: Name of the interface to create inside the container. During execution, always set as the
CNI_IFNAME
parameter. - Generic Arguments: Extra arguments, in the form of key-value string pairs, that are relevant to a specific attachment. During execution, always set as the
CNI_ARGS
parameter. - Capability Arguments: These are also key-value pairs. The key is a string, whereas the value is any JSON-serializable type. The keys and values are defined by convention.
Furthermore, the runtime must be provided a list of paths to search for CNI plugins. This must also be provided to plugins during execution via the CNI_PATH
environment variable.
插件委托
在使用CNI作为容器网络配置接口,并不是所有的操作都能够支持插件分离操作实现,相反CNI希望这些操作委托给其它插件来实现,如IP地址管理
As part of its operation, a CNI plugin is expected to assign (and maintain) an IP address to the interface and install any necessary routes relevant for that interface. This gives the CNI plugin great flexibility but also places a large burden on it. Many CNI plugins would need to have the same code to support several IP management schemes that users may desire (e.g. dhcp, host-local). A CNI plugin may choose to delegate IP management to another plugin.
To lessen the burden and make IP management strategy be orthogonal to the type of CNI plugin, we define a third type of plugin — IP Address Management Plugin (IPAM plugin), as well as a protocol for plugins to delegate functionality to other plugins.
It is however the responsibility of the CNI plugin, rather than the runtime, to invoke the IPAM plugin at the proper moment in its execution. The IPAM plugin must determine the interface IP/subnet, Gateway and Routes and return this information to the “main” plugin to apply. The IPAM plugin may obtain the information via a protocol (e.g. dhcp), data stored on a local filesystem, the “ipam” section of the Network Configuration file, etc.
返回结果的数据类型
CNI最终执行的结果是以JSON文件返回,常用的二种结果
- ADD SUCESS
- VERSION SUCESS
{ "cniVersion": "1.1.0", "name": "dbnet", "type": "tuning", "sysctl": { "net.core.somaxconn": "500" }, "runtimeConfig": { "mac": "00:11:22:33:44:66" }, "prevResult": { "ips": [ { "address": "10.1.0.5/16", "gateway": "10.1.0.1", "interface": 2 } ], "routes": [ { "dst": "0.0.0.0/0" } ], "interfaces": [ { "name": "cni0", "mac": "00:11:22:33:44:55" }, { "name": "veth3243", "mac": "55:44:33:22:11:11" }, { "name": "eth0", "mac": "99:88:77:66:55:44", "sandbox": "/var/run/netns/blue" } ], "dns": { "nameservers": [ "10.1.0.1" ] } } }