Cilium Native Routing with eBPF 模式
Cilium Native Routing with eBPF 模式
一、环境信息
主机 | IP |
---|---|
ubuntu | 172.16.94.141 |
软件 | 版本 |
---|---|
docker | 26.1.4 |
helm | v3.15.0-rc.2 |
kind | 0.18.0 |
kubernetes | 1.23.4 |
ubuntu os | Ubuntu 20.04.6 LTS |
kernel | 5.11.5 内核升级文档 |
二、安装服务
kind
配置文件信息
root@kind:~# cat install.sh
#!/bin/bash
date
set -v
# 1.prep noCNI env
cat <<EOF | kind create cluster --name=cilium-kubeproxy-replacement --image=kindest/node:v1.23.4 --config=-
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
networking:
# kind 默认使用 rancher cni,cni 我们需要自己创建
disableDefaultCNI: true
# kind 安装 k8s 集群需要禁用 kube-proxy 安装,是 cilium 代替 kube-proxy 功能
kubeProxyMode: "none"
nodes:
- role: control-plane
- role: worker
- role: worker
containerdConfigPatches:
- |-
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."harbor.evescn.com"]
endpoint = ["https://harbor.evescn.com"]
EOF
# 2.remove taints
controller_node_ip=`kubectl get node -o wide --no-headers | grep -E "control-plane|bpf1" | awk -F " " '{print $6}'`
kubectl taint nodes $(kubectl get nodes -o name | grep control-plane) node-role.kubernetes.io/master:NoSchedule-
kubectl get nodes -o wide
# 3.install cni
helm repo add cilium https://helm.cilium.io > /dev/null 2>&1
helm repo update > /dev/null 2>&1
# Direct Routing Options(--set kubeProxyReplacement=strict --set tunnel=disabled --set autoDirectNodeRoutes=true --set ipv4NativeRoutingCIDR="10.0.0.0/8")
# Host Routing[EBPF](--set bpf.masquerade=true)
helm install cilium cilium/cilium \
--set k8sServiceHost=$controller_node_ip \
--set k8sServicePort=6443 \
--version 1.13.0-rc5 \
--namespace kube-system \
--set debug.enabled=true \
--set debug.verbose=datapath \
--set monitorAggregation=none \
--set ipam.mode=cluster-pool \
--set cluster.name=cilium-kubeproxy-replacement-ebpf \
--set kubeProxyReplacement=strict \
--set tunnel=disabled \
--set autoDirectNodeRoutes=true \
--set ipv4NativeRoutingCIDR="10.0.0.0/8" \
--set bpf.masquerade=true
# 4.install necessary tools
for i in $(docker ps -a --format "table {{.Names}}" | grep cilium)
do
echo $i
docker cp /usr/bin/ping $i:/usr/bin/ping
docker exec -it $i bash -c "sed -i -e 's/jp.archive.ubuntu.com\|archive.ubuntu.com\|security.ubuntu.com/old-releases.ubuntu.com/g' /etc/apt/sources.list"
docker exec -it $i bash -c "apt-get -y update >/dev/null && apt-get -y install net-tools tcpdump lrzsz bridge-utils >/dev/null 2>&1"
done
--set
参数解释
-
--set kubeProxyReplacement=strict
- 含义: 启用 kube-proxy 替代功能,并以严格模式运行。
- 用途: Cilium 将完全替代 kube-proxy 实现服务负载均衡,提供更高效的流量转发和网络策略管理。
-
--set tunnel=disabled
- 含义: 禁用隧道模式。
- 用途: 禁用后,Cilium 将不使用 vxlan 技术,直接在主机之间路由数据包,即 direct-routing 模式。
-
--set autoDirectNodeRoutes=true
- 含义: 启用自动直接节点路由。
- 用途: 使 Cilium 自动设置直接节点路由,优化网络流量。
-
--set ipv4NativeRoutingCIDR="10.0.0.0/8"
- 含义: 指定用于 IPv4 本地路由的 CIDR 范围,这里是
10.0.0.0/8
。 - 用途: 配置 Cilium 使其知道哪些 IP 地址范围应该通过本地路由进行处理,不做 snat , Cilium 默认会对所用地址做 snat。
- 含义: 指定用于 IPv4 本地路由的 CIDR 范围,这里是
-
--set bpf.masquerade
- 含义: 启用 eBPF 功能。
- 用途: 使用 eBPF 实现数据路由,提供更高效和灵活的网络地址转换功能。
- 安装
k8s
集群和cilium
服务
root@kind:~# ./install.sh
Creating cluster "cilium-kubeproxy-replacement-ebpf" ...
✓ Ensuring node image (kindest/node:v1.23.4) 🖼
✓ Preparing nodes 📦 📦 📦
✓ Writing configuration 📜
✓ Starting control-plane 🕹️
✓ Installing StorageClass 💾
✓ Joining worker nodes 🚜
Set kubectl context to "kind-cilium-kubeproxy-replacement-ebpf"
You can now use your cluster with:
kubectl cluster-info --context kind-cilium-kubeproxy-replacement-ebpf
Have a nice day! 👋
- 查看安装的服务
root@kind:~# kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system cilium-496vl 1/1 Running 0 8m54s
kube-system cilium-6bt7g 1/1 Running 0 8m54s
kube-system cilium-mfc5w 1/1 Running 0 8m54s
kube-system cilium-operator-dd757785c-82wnx 1/1 Running 0 8m54s
kube-system cilium-operator-dd757785c-v8j7r 1/1 Running 0 8m54s
kube-system coredns-64897985d-66c4c 1/1 Running 0 10m
kube-system coredns-64897985d-st6l6 1/1 Running 0 10m
kube-system etcd-cilium-kubeproxy-replacement-ebpf-control-plane 1/1 Running 0 10m
kube-system kube-apiserver-cilium-kubeproxy-replacement-ebpf-control-plane 1/1 Running 0 10m
kube-system kube-controller-manager-cilium-kubeproxy-replacement-ebpf-control-plane 1/1 Running 0 10m
kube-system kube-scheduler-cilium-kubeproxy-replacement-ebpf-control-plane 1/1 Running 0 10m
local-path-storage local-path-provisioner-5ddd94ff66-s6cjg 1/1 Running 0 10m
查看
Pod
服务信息,发现没有kube-proxy
服务,因为我们设置了kubeProxyReplacement=strict
,那么cilium
将完全替代kube-proxy
实现服务负载均衡。并且在kind
安装k8s
集群的时候也需要设置禁用kube-proxy
安装kubeProxyMode: "none"
cilium
配置信息
# kubectl -n kube-system exec -it ds/cilium -- cilium status
KVStore: Ok Disabled
Kubernetes: Ok 1.23 (v1.23.4) [linux/amd64]
Kubernetes APIs: ["cilium/v2::CiliumClusterwideNetworkPolicy", "cilium/v2::CiliumEndpoint", "cilium/v2::CiliumNetworkPolicy", "cilium/v2::CiliumNode", "core/v1::Namespace", "core/v1::Node", "core/v1::Pods", "core/v1::Service", "discovery/v1::EndpointSlice", "networking.k8s.io/v1::NetworkPolicy"]
KubeProxyReplacement: Strict [eth0 172.18.0.3 (Direct Routing)]
Host firewall: Disabled
CNI Chaining: none
CNI Config file: CNI configuration file management disabled
Cilium: Ok 1.13.0-rc5 (v1.13.0-rc5-dc22a46f)
NodeMonitor: Listening for events on 128 CPUs with 64x4096 of shared memory
Cilium health daemon: Ok
IPAM: IPv4: 6/254 allocated from 10.0.0.0/24,
IPv6 BIG TCP: Disabled
BandwidthManager: Disabled
Host Routing: BPF
Masquerading: BPF [eth0] 10.0.0.0/8 [IPv4: Enabled, IPv6: Disabled]
Controller Status: 35/35 healthy
Proxy Status: OK, ip 10.0.0.231, 0 redirects active on ports 10000-20000
Global Identity Range: min 256, max 65535
Hubble: Ok Current/Max Flows: 4095/4095 (100.00%), Flows/s: 6.85 Metrics: Disabled
Encryption: Disabled
Cluster health: 3/3 reachable (2024-06-27T09:06:44Z)
KubeProxyReplacement: Strict [eth0 172.18.0.3 (Direct Routing)]
- Cilium 完全接管所有 kube-proxy 功能,包括服务负载均衡、NodePort 和其他网络策略管理。这种配置适用于你希望最大限度利用 Cilium 的高级网络功能,并完全替代 kube-proxy 的场景。此模式提供更高效的流量转发和更强大的网络策略管理。
Host Routing: BPF
- 使用 BPF 进行主机路由。
Masquerading: BPF [eth0] 10.0.0.0/8 [IPv4: Enabled, IPv6: Disabled]
- 使用 BPF 进行 IP 伪装(NAT),接口 eth0,IP 范围 10.0.0.0/8 不回进行 NAT。IPv4 伪装启用,IPv6 伪装禁用。
k8s
集群安装 Pod
测试网络
# cat cni.yaml
apiVersion: apps/v1
kind: DaemonSet
#kind: Deployment
metadata:
labels:
app: evescn
name: evescn
spec:
#replicas: 1
selector:
matchLabels:
app: evescn
template:
metadata:
labels:
app: evescn
spec:
containers:
- image: harbor.dayuan1997.com/devops/nettool:0.9
name: nettoolbox
securityContext:
privileged: true
---
apiVersion: v1
kind: Service
metadata:
name: serversvc
spec:
type: NodePort
selector:
app: evescn
ports:
- name: cni
port: 8080
targetPort: 80
nodePort: 32000
root@kind:~# kubectl apply -f cni.yaml
daemonset.apps/cilium-with-replacement created
service/serversvc created
root@kind:~# kubectl run net --image=harbor.dayuan1997.com/devops/nettool:0.9
pod/net created
- 查看安装服务信息
root@kind:~# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
evescn-jzms5 1/1 Running 0 7m51s 10.0.2.202 cilium-kubeproxy-replacement-ebpf-worker2 <none> <none>
evescn-q7xhn 1/1 Running 0 7m51s 10.0.0.72 cilium-kubeproxy-replacement-ebpf-worker <none> <none>
evescn-zx9kk 1/1 Running 0 3m23s 10.0.1.63 cilium-kubeproxy-replacement-ebpf-control-plane <none> <none>
net 1/1 Running 0 4s 10.0.2.40 cilium-kubeproxy-replacement-ebpf-worker2 <none> <none>
root@kind:~# kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 9m
serversvc NodePort 10.96.93.201 <none> 8080:32000/TCP 76s
三、测试网络
同节点 Pod
网络通讯
Pod
节点信息
## ip 信息
root@kind:~# kubectl exec -it net -- ip a l
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
8: eth0@if9: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether ae:32:09:5a:08:5c brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.0.2.40/32 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::ac32:9ff:fe5a:85c/64 scope link
valid_lft forever preferred_lft forever
## 路由信息
root@kind:~# kubectl exec -it net -- ip r s
default via 10.0.2.34 dev eth0 mtu 1500
10.0.2.34 dev eth0 scope link
查看 Pod
信息发现在 cilium
中主机的 IP
地址为 32
位掩码,意味着该 IP
地址是单个主机的唯一标识,而不是一个子网。这个主机访问其他 IP
均会走路由到达
Pod
节点所在Node
节点信息
root@kind:~# docker exec -it cilium-kubeproxy-replacement-ebpf-worker2 bash
## ip 信息
root@cilium-kubeproxy-replacement-ebpf-worker2:/# ip a l
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: cilium_net@cilium_host: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 06:d8:6d:5e:2f:fe brd ff:ff:ff:ff:ff:ff
inet6 fe80::4d8:6dff:fe5e:2ffe/64 scope link
valid_lft forever preferred_lft forever
3: cilium_host@cilium_net: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 1a:5e:33:bb:28:40 brd ff:ff:ff:ff:ff:ff
inet 10.0.2.34/32 scope link cilium_host
valid_lft forever preferred_lft forever
inet6 fe80::185e:33ff:febb:2840/64 scope link
valid_lft forever preferred_lft forever
5: lxc_health@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 4e:0e:82:d4:e2:eb brd ff:ff:ff:ff:ff:ff link-netnsid 1
inet6 fe80::4c0e:82ff:fed4:e2eb/64 scope link
valid_lft forever preferred_lft forever
7: lxccb7c1b078d26@if6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether a6:1d:2b:bd:af:60 brd ff:ff:ff:ff:ff:ff link-netns cni-0f4755bb-f628-7d32-54d6-2fdad2ee5fc3
inet6 fe80::a41d:2bff:febd:af60/64 scope link
valid_lft forever preferred_lft forever
9: lxcbf5cb1e7ceb0@if8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether be:06:1e:82:86:b6 brd ff:ff:ff:ff:ff:ff link-netns cni-13e7f3e3-70a3-9965-de9a-bfdaa0569170
inet6 fe80::bc06:1eff:fe82:86b6/64 scope link
valid_lft forever preferred_lft forever
23: eth0@if24: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 02:42:ac:12:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 172.18.0.2/16 brd 172.18.255.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fc00:f853:ccd:e793::2/64 scope global nodad
valid_lft forever preferred_lft forever
inet6 fe80::42:acff:fe12:2/64 scope link
valid_lft forever preferred_lft forever
## 路由信息
root@cilium-kubeproxy-replacement-ebpf-worker2:/# ip r s
default via 172.18.0.1 dev eth0
10.0.0.0/24 via 172.18.0.3 dev eth0
10.0.1.0/24 via 172.18.0.4 dev eth0
10.0.2.0/24 via 10.0.2.34 dev cilium_host src 10.0.2.34
10.0.2.34 dev cilium_host scope link
172.18.0.0/16 dev eth0 proto kernel scope link src 172.18.0.2
Pod
节点进行ping
包测试,查看宿主机路由信息,发现并在数据包会在通过10.0.2.0/24 via 10.0.2.34 dev cilium_host src 10.0.2.34
路由信息转发
root@kind:~# kubectl exec -it net -- ping 10.0.2.202 -c 1
PING 10.0.2.202 (10.0.2.202): 56 data bytes
64 bytes from 10.0.2.202: seq=0 ttl=63 time=0.707 ms
--- 10.0.2.202 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 0.707/0.707/0.707 ms
Pod
节点eth0
网卡抓包
net~$ tcpdump -pne -i eth0
09:19:00.081174 ae:32:09:5a:08:5c > be:06:1e:82:86:b6, ethertype IPv4 (0x0800), length 98: 10.0.2.40 > 10.0.2.202: ICMP echo request, id 55, seq 0, length 64
09:19:00.081472 be:06:1e:82:86:b6 > ae:32:09:5a:08:5c, ethertype IPv4 (0x0800), length 98: 10.0.2.202 > 10.0.2.40: ICMP echo reply, id 55, seq 0, length 64
Node
节点Pod
的veth pair
网卡lxcbf5cb1e7ceb0
抓包
root@cilium-kubeproxy-replacement-ebpf-worker2:/# tcpdump -pne -i lxcbf5cb1e7ceb0
09:19:00.081177 ae:32:09:5a:08:5c > be:06:1e:82:86:b6, ethertype IPv4 (0x0800), length 98: 10.0.2.40 > 10.0.2.202: ICMP echo request, id 55, seq 0, length 64
通过对比抓包信息,我们发现 Pod
节点 eth0
网卡有 icmp
的 request
和 reply
数据包,但是在 lxcbf5cb1e7ceb0
网卡只有 request 数据包
这是因为在内核 5.11.0
以后,具备了 bpf_redict_peer()
和 bpf_redict_neigh()
这两个非常重要的 helper
函数的能力。 cilium
在高版本中开始支持的功能
bpf_redict_peer()
bpf_redirect_peer()
在穿越网络命名空间时,可以从网卡入口到Pod
入口切换网络命名空间,而无需重新安排软件中断点。这样,物理网卡就能一次性将数据包从堆栈推送到驻留在不同Pod
命名空间的应用程序套接字中。这样还能更快地唤醒应用程序,以接收接收到的数据。同样,本地Pod
到Pod
通信的重新安排点从2
个减少到1
个,从而改善了延迟。
如下图所示, bpf_redict_peer()
函数可以将同一个节点的pod 通信的步骤进行省略,报文在从源 pod
出来以后,可以直接 redict (直连)
到目的 pod
的 内部网卡
,而不会经过宿主机上与其对应的 veth pair
网卡,这样在报文的路径就少了一步转发。
同理,我们可以推测出对于目的 Pod 而言,他的 veth pair
网卡也只能抓到 request
数据包,抓不到 reply
数据包,2 个 Pod
直接的通讯流程如下图:
- 查看
10.0.2.202
Pod
主机信息
root@kind:~# kubectl exec -it evescn-jzms5 -- ip a l
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
6: eth0@if7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 7e:55:97:9f:d9:bf brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.0.2.202/32 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::7c55:97ff:fe9f:d9bf/64 scope link
valid_lft forever preferred_lft forever
查看 eth0
信息,可以确定在 Pod
的 eth0
网卡在宿主机的 veth pair
网卡位 id = 7
的网卡,查看前面 Node
节点信息发现为: 7: lxccb7c1b078d26
Node
节点lxccb7c1b078d26
网卡抓包
root@cilium-kubeproxy-replacement-ebpf-worker2:/# tcpdump -pne -i lxccb7c1b078d26
09:41:17.997320 7e:55:97:9f:d9:bf > a6:1d:2b:bd:af:60, ethertype IPv4 (0x0800), length 98: 10.0.2.202 > 10.0.2.40: ICMP echo reply, id 61, seq 0, length 64
- 查看
10.0.2.202
Pod
主机eht0
抓包
evescn-jzms5~$ tcpdump -pne -i eth0
09:41:17.997003 a6:1d:2b:bd:af:60 > 7e:55:97:9f:d9:bf, ethertype IPv4 (0x0800), length 98: 10.0.2.40 > 10.0.2.202: ICMP echo request, id 61, seq 0, length 64
09:41:17.997319 7e:55:97:9f:d9:bf > a6:1d:2b:bd:af:60, ethertype IPv4 (0x0800), length 98: 10.0.2.202 > 10.0.2.40: ICMP echo reply, id 61, seq 0, length 64
启用了
ebpf
后,并且kernel
内核 大于5.10
后,pod
数据包流向和传统理解模式有区别,和eth0
互为veth pair
的网卡,不一定能接受到所有的数据包信息,在生产问题排查中需要注意
不同节点 Pod
网络通讯
-
Pod
节点信息,查看前面: 同节点Pod
网络通讯 -
Pod
节点所在Node
节点信息,查看前面: 同节点Pod
网络通讯 -
Pod
节点进行ping
包测试,查看宿主机路由信息,发现并在数据包会在通过10.0.0.0/24 via 172.18.0.3 dev eth0
路由信息转发
root@kind:~# kubectl exec -it net -- ping 10.0.0.72 -c 1
PING 10.0.0.72 (10.0.0.72): 56 data bytes
64 bytes from 10.0.0.72: seq=0 ttl=62 time=1.916 ms
--- 10.0.0.72 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 1.916/1.916/1.916 ms
Pod
节点eth0
网卡抓包
net~$ tcpdump -pne -i eth0
09:58:03.642047 ae:32:09:5a:08:5c > be:06:1e:82:86:b6, ethertype IPv4 (0x0800), length 98: 10.0.2.40 > 10.0.0.72: ICMP echo request, id 84, seq 0, length 64
09:58:03.642594 be:06:1e:82:86:b6 > ae:32:09:5a:08:5c, ethertype IPv4 (0x0800), length 98: 10.0.0.72 > 10.0.2.40: ICMP echo reply, id 84, seq 0, length 64
Node
节点cilium-kubeproxy-replacement-ebpf-worker2
Pod
的veth pair
网卡lxcbf5cb1e7ceb0
抓包
root@cilium-kubeproxy-replacement-ebpf-worker2:/# tcpdump -pne -i lxcbf5cb1e7ceb0
09:58:03.642050 ae:32:09:5a:08:5c > be:06:1e:82:86:b6, ethertype IPv4 (0x0800), length 98: 10.0.2.40 > 10.0.0.72: ICMP echo request, id 84, seq 0, length 64
通过抓包我们发现,在跨节点的通讯中 veth pair
网卡 lxcbf5cb1e7ceb0
也只有 request 数据包,这是因为 bpf_redict_peer()
还在发挥作用,从宿主机节点来的数据包,只要是到达 Pod 节点的,依旧会跳过 lxc
网卡。
Node
节点cilium-kubeproxy-replacement-ebpf-worker2
eth0
网卡抓包
root@cilium-kubeproxy-replacement-ebpf-worker2:/# tcpdump -pne -i lxc89eb07782005
10:07:48.268822 02:42:ac:12:00:02 > 02:42:ac:12:00:03, ethertype IPv4 (0x0800), length 98: 10.0.2.40 > 10.0.0.72: ICMP echo request, id 92, seq 0, length 64
10:07:48.269136 02:42:ac:12:00:03 > 02:42:ac:12:00:02, ethertype IPv4 (0x0800), length 98: 10.0.0.72 > 10.0.2.40: ICMP echo reply, id 92, seq 0, length 64
查看 eth0
网卡抓包信息,发现数据包下一跳 mac 02:42:ac:12:00:03
,按照之前的路由信息,分析此地址应该为 172.18.0.3
节点 eth0
mac
地址
查看 cilium-kubeproxy-replacement-ebpf-worker2
节点 mac
信息
root@cilium-kubeproxy-replacement-ebpf-worker2:/#arp -n
Address HWtype HWaddress Flags Mask Iface
172.18.0.3 ether 02:42:ac:12:00:03 C eth0
172.18.0.4 ether 02:42:ac:12:00:04 C eth0
172.18.0.1 ether 02:42:2f:fe:43:35 C eth0
Node
节点cilium-kubeproxy-replacement-ebpf-worker
eth0
网卡抓包
root@cilium-kubeproxy-replacement-ebpf-worker2:/# tcpdump -pne -i eth0 icmp
10:10:17.229336 02:42:ac:12:00:02 > 02:42:ac:12:00:03, ethertype IPv4 (0x0800), length 98: 10.0.2.40 > 10.0.0.72: ICMP echo request, id 99, seq 0, length 64
10:10:17.229622 02:42:ac:12:00:03 > 02:42:ac:12:00:02, ethertype IPv4 (0x0800), length 98: 10.0.0.72 > 10.0.2.40: ICMP echo reply, id 99, seq 0, length 64
查看 eth0
网卡抓包信息,数据包源 mac 02:42:ac:12:00:02
为 172.18.0.2
eth0
mac 地址,目的 mac 02:42:ac:12:00:03
,为本机 eth0
mac
地址
查看 cilium-kubeproxy-replacement-worker
节点 ip
信息
root@cilium-kubeproxy-replacement-worker:/# ip a l
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: cilium_net@cilium_host: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether ba:d4:d5:86:e4:c4 brd ff:ff:ff:ff:ff:ff
inet6 fe80::b8d4:d5ff:fe86:e4c4/64 scope link
valid_lft forever preferred_lft forever
3: cilium_host@cilium_net: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether a2:5e:1d:03:8c:f6 brd ff:ff:ff:ff:ff:ff
inet 10.0.0.231/32 scope link cilium_host
valid_lft forever preferred_lft forever
inet6 fe80::a05e:1dff:fe03:8cf6/64 scope link
valid_lft forever preferred_lft forever
5: lxc_health@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether ee:cc:02:10:9d:55 brd ff:ff:ff:ff:ff:ff link-netnsid 1
inet6 fe80::eccc:2ff:fe10:9d55/64 scope link
valid_lft forever preferred_lft forever
7: lxcbf1d9227d372@if6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 0e:ed:9c:12:3a:08 brd ff:ff:ff:ff:ff:ff link-netns cni-db59179b-ceed-fde3-c1c7-e6d3f83506d5
inet6 fe80::ced:9cff:fe12:3a08/64 scope link
valid_lft forever preferred_lft forever
9: lxc7f220bbf2953@if8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 1e:be:cd:8f:63:93 brd ff:ff:ff:ff:ff:ff link-netns cni-cbfca9ee-4591-ba92-8c08-de2aa5d0d090
inet6 fe80::1cbe:cdff:fe8f:6393/64 scope link
valid_lft forever preferred_lft forever
11: lxc6df5bdb74383@if10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether e2:6f:f4:74:3d:fd brd ff:ff:ff:ff:ff:ff link-netns cni-4f819e17-233a-2637-281c-ce0f9c505fa9
inet6 fe80::e06f:f4ff:fe74:3dfd/64 scope link
valid_lft forever preferred_lft forever
13: lxc56a9c4b40736@if12: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether f6:1c:ec:b2:0f:ba brd ff:ff:ff:ff:ff:ff link-netns cni-c56449c3-5791-8823-56f9-44fa8ab5cfd6
inet6 fe80::f41c:ecff:feb2:fba/64 scope link
valid_lft forever preferred_lft forever
25: eth0@if26: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 02:42:ac:12:00:03 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 172.18.0.3/16 brd 172.18.255.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fc00:f853:ccd:e793::3/64 scope global nodad
valid_lft forever preferred_lft forever
inet6 fe80::42:acff:fe12:3/64 scope link
valid_lft forever preferred_lft forever
- 目标
Pod
ip
信息
evescn-q7xhn~$ip a l
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
12: eth0@if13: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether ca:8f:99:91:17:f2 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.0.0.72/32 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::c88f:99ff:fe91:17f2/64 scope link
valid_lft forever preferred_lft forever
目标 Pod
eth0
网卡的 veth pair
是 13: lxc56a9c4b40736
,分别在 cilium-kubeproxy-replacement-worker
节点 lxc56a9c4b40736
网卡和 Pod
eth0
网卡抓包
Node
节点cilium-kubeproxy-replacement-ebpf-worker
目标Pod
的veth pair
网卡lxc56a9c4b40736
抓包
root@cilium-kubeproxy-replacement-ebpf-worker:/# tcpdump -pne -i lxc56a9c4b40736
10:10:17.229512 ca:8f:99:91:17:f2 > f6:1c:ec:b2:0f:ba, ethertype IPv4 (0x0800), length 98: 10.0.0.72 > 10.0.2.40: ICMP echo reply, id 99, seq 0, length 64
- 目标
Pod
节点eth0
网卡抓包
evescn-q7xhn~$ tcpdump -pne -i eth0
10:10:17.229336 f6:1c:ec:b2:0f:ba > ca:8f:99:91:17:f2, ethertype IPv4 (0x0800), length 98: 10.0.2.40 > 10.0.0.72: ICMP echo request, id 99, seq 0, length 64
10:10:17.229511 ca:8f:99:91:17:f2 > f6:1c:ec:b2:0f:ba, ethertype IPv4 (0x0800), length 98: 10.0.0.72 > 10.0.2.40: ICMP echo reply, id 99, seq 0, length 64
因为 bpf_redict_peer()
的原因,veth pair
网卡 lxc56a9c4b40736
上依旧只有 reply
数据包信息,并且在垮节点通讯中还使用了到了 bpf_redict_neigh()
功能, bpf_redict_neigh()
会跳过宿主机的 iptables
,只会查询路由表信息,而不会进入 iptables
规则判断中。这样可以减少数据包的上下文切换等,提高网络数据转发效率。如下图:
bpf_redirect_neigh()
通过将流量注入Linux
内核的邻接子系统来处理Pod
的出口流量,从而找到下一跳并解析网络数据包的第2
层地址。只在tc eBPF
层进行转发,而不将数据包进一步推向网络堆栈,还能为TCP
堆栈提供适当的反向压力,并为TCP
的TSQ
(TCP
小队列)机制提供反馈,以减少TCP
数据包可能出现的过度排队现象。也就是说,TCP
协议栈会得到数据包已离开节点的反馈,而不是在数据包被推送到主机协议栈进行路由选择时过早地提供不准确的反馈。之所以能做到这一点,是因为数据包的套接字关联在向下传递到NIC
驱动程序时可以保持不变。
数据会从 veth pair
网卡,直接进入 ebpf
中,然后经由 ebpf
查询 linux routing
后,直接使用 eth0
发送出去,从而不在走 linux
系统主机的 iptables
规则
Service
网络通讯
- 查看
Service
信息
root@kind:~# kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 9m
serversvc NodePort 10.96.93.201 <none> 8080:32000/TCP 76s
net
服务上请求Pod
所在Node
节点32000
端口
root@kind:~# kubectl exec -ti net -- curl 172.18.0.2:32000
PodName: evescn-jzms5 | PodIP: eth0 10.0.2.202/32
并在 net
服务 eth0
网卡 抓包查看
net~$ tcpdump -pne -i eth0
02:27:19.933659 ae:32:09:5a:08:5c > be:06:1e:82:86:b6, ethertype IPv4 (0x0800), length 74: 10.0.2.40.32794 > 10.0.2.202.80: Flags [S], seq 1073282361, win 64240, options [mss 1460,sackOK,TS val 3661176389 ecr 0,nop,wscale 7], length 0
02:27:19.933882 be:06:1e:82:86:b6 > ae:32:09:5a:08:5c, ethertype IPv4 (0x0800), length 74: 10.0.2.202.80 > 10.0.2.40.32794: Flags [S.], seq 901038452, ack 1073282362, win 65160, options [mss 1460,sackOK,TS val 3465621389 ecr 3661176389,nop,wscale 7], length 0
02:27:19.934061 ae:32:09:5a:08:5c > be:06:1e:82:86:b6, ethertype IPv4 (0x0800), length 66: 10.0.2.40.32794 > 10.0.2.202.80: Flags [.], ack 1, win 502, options [nop,nop,TS val 3661176390 ecr 3465621389], length 0
02:27:19.934440 ae:32:09:5a:08:5c > be:06:1e:82:86:b6, ethertype IPv4 (0x0800), length 146: 10.0.2.40.32794 > 10.0.2.202.80: Flags [P.], seq 1:81, ack 1, win 502, options [nop,nop,TS val 3661176390 ecr 3465621389], length 80: HTTP: GET / HTTP/1.1
02:27:19.934623 be:06:1e:82:86:b6 > ae:32:09:5a:08:5c, ethertype IPv4 (0x0800), length 66: 10.0.2.202.80 > 10.0.2.40.32794: Flags [.], ack 81, win 509, options [nop,nop,TS val 3465621389 ecr 3661176390], length 0
02:27:19.936129 be:06:1e:82:86:b6 > ae:32:09:5a:08:5c, ethertype IPv4 (0x0800), length 302: 10.0.2.202.80 > 10.0.2.40.32794: Flags [P.], seq 1:237, ack 81, win 509, options [nop,nop,TS val 3465621391 ecr 3661176390], length 236: HTTP: HTTP/1.1 200 OK
02:27:19.936308 ae:32:09:5a:08:5c > be:06:1e:82:86:b6, ethertype IPv4 (0x0800), length 66: 10.0.2.40.32794 > 10.0.2.202.80: Flags [.], ack 237, win 501, options [nop,nop,TS val 3661176392 ecr 3465621391], length 0
02:27:19.936496 be:06:1e:82:86:b6 > ae:32:09:5a:08:5c, ethertype IPv4 (0x0800), length 116: 10.0.2.202.80 > 10.0.2.40.32794: Flags [P.], seq 237:287, ack 81, win 509, options [nop,nop,TS val 3465621391 ecr 3661176392], length 50: HTTP
02:27:19.936687 ae:32:09:5a:08:5c > be:06:1e:82:86:b6, ethertype IPv4 (0x0800), length 66: 10.0.2.40.32794 > 10.0.2.202.80: Flags [.], ack 287, win 501, options [nop,nop,TS val 3661176392 ecr 3465621391], length 0
02:27:19.938630 ae:32:09:5a:08:5c > be:06:1e:82:86:b6, ethertype IPv4 (0x0800), length 66: 10.0.2.40.32794 > 10.0.2.202.80: Flags [F.], seq 81, ack 287, win 501, options [nop,nop,TS val 3661176394 ecr 3465621391], length 0
02:27:19.938901 be:06:1e:82:86:b6 > ae:32:09:5a:08:5c, ethertype IPv4 (0x0800), length 66: 10.0.2.202.80 > 10.0.2.40.32794: Flags [F.], seq 287, ack 82, win 509, options [nop,nop,TS val 3465621394 ecr 3661176394], length 0
02:27:19.939118 ae:32:09:5a:08:5c > be:06:1e:82:86:b6, ethertype IPv4 (0x0800), length 66: 10.0.2.40.32794 > 10.0.2.202.80: Flags [.], ack 288, win 501, options [nop,nop,TS val 3661176395 ecr 3465621394], length 0
抓包数据显示, net
服务使用随机端口和 10.0.2.202
80
端口进行 tcp
通讯。
* `KubeProxyReplacement: Strict [eth0 172.18.0.3 (Direct Routing)]`
* Cilium 完全接管所有 kube-proxy 功能,包括服务负载均衡、NodePort 和其他网络策略管理。这种配置适用于你希望最大限度利用 Cilium 的高级网络功能,并完全替代 kube-proxy 的场景。此模式提供更高效的流量转发和更强大的网络策略管理。
cilium
配置 KubeProxyReplacement: Strict [eth0 172.18.0.3 (Direct Routing)]
,通过配置信息确定 cilium
接管 kube-proxy
的功能,使用 cilium
实现 service
转发。我们可以先检查下默认 kube-proxy
的 conntrack
信息和 iptables
信息
- 先检查下
conntrack
信息,发现没有链路信息
root@cilium-kubeproxy-replacement-ebpf-worker2:/# conntrack -L | grep 32000
## 没有数据信息
iptables
信息,也没有iptables
规则信息
root@cilium-kubeproxy-replacement-ebpf-worker2:/# iptables-save | grep 32000
## 没有数据信息
那么 cilium
是如何查询 service
信息,并返回后端 Pod
ip 地址给请求方的?其实 cilium
把数据保存在自身内部,使用 cilium
子命令可以查询到 service
信息
cilium
查询service
信息
root@kind:~# kubectl -n kube-system exec cilium-2xvsw -- cilium service list
ID Frontend Service Type Backend
1 10.96.0.1:443 ClusterIP 1 => 172.18.0.4:6443 (active)
2 10.96.0.10:53 ClusterIP 1 => 10.0.0.228:53 (active)
2 => 10.0.0.103:53 (active)
3 10.96.0.10:9153 ClusterIP 1 => 10.0.0.228:9153 (active)
2 => 10.0.0.103:9153 (active)
4 10.96.102.191:443 ClusterIP 1 => 172.18.0.2:4244 (active)
5 10.96.93.201:8080 ClusterIP 1 => 10.0.0.72:80 (active)
2 => 10.0.2.202:80 (active)
3 => 10.0.1.63:80 (active)
6 172.18.0.2:32000 NodePort 1 => 10.0.0.72:80 (active)
2 => 10.0.2.202:80 (active)
3 => 10.0.1.63:80 (active)
7 0.0.0.0:32000 NodePort 1 => 10.0.0.72:80 (active)
2 => 10.0.2.202:80 (active)
3 => 10.0.1.63:80 (active)
查看上面的 service
信息得到, 172.18.0.2:32000
后端有 3
个 ip
地址信息,并且后端端口为 80
, cilium
劫持到 Pod
需要访问 service
信息,即会查询该 service
对应的后端 Pod
地址和端口返回给客户端,让客户端使用此地址发起 http
请求
四、TC Hook
TC from-container
root@cilium-kubeproxy-replacement-ebpf-worker2:/# tc filter show dev lxcbf5cb1e7ceb0 ingress
filter protocol all pref 1 bpf chain 0
filter protocol all pref 1 bpf chain 0 handle 0x1 cilium-lxcbf5cb1e7ceb0 direct-action not_in_hw id 7232 tag 343a0e7d38824f0e jited
此 TC Hook
挂载在 Pod 主机侧的虚拟网络设备上,也就是 lxc-xxx
这个网络设备,方向为 ingress
- 1.当
packet
要从Pod
出去,首先packet
会经过eth0
到达Pod
对应的lxc-xxx
的TC ingress
,这里的eBPF
程序的Section
是from-container
。 - 2.数据包是从
Pod
发出去的,如果是original
类型的数据包,会在这里完成数据包的DNAT
操作,管理连接跟踪;如果是reply
类型的数据包,会完成reverse DNAT
的操作。 - 3.对于有
Network Policy
验证需要的数据包,会经过tproxy
的方式,将数据包发往cilium proxy
去验证是不是可以access
,对于正常的从服务端返回的reply
数据包,会标记成skip policy
,也就是不需要做验证。具体实现参考policy_can_egress4
方法。
从 Pod
里出来的数据包的目的方向大致可以分为几个路径
- 1.第一个是
Pod
访问本地的Pod
,经过ipv4_local_delivery
来完成数据包的转发 - 2.第二个是
Pod
访问外网,经过Kernel Stack
完成解析后,经过物理网卡的to-netdev
出去 - 3.第三个是
Pod
访问本地的非 Pod 的,也是经过Kernel Stack
完成 - 4.第四个是
Pod
访问非本机的集群其它机器中的Pod
,经过Kernel Stack
完成解析后,经过物理网卡to-netdev
出去 - 5.第五个是
Pod
经过overlay
网络访问非本机的集群其它机器中的Pod
,经过cilium_vxlan
,然后经过Kernel Stack
完成解析后经过物理网卡出去
TC to-container
这是一个 eBPF Section
,这是一个比较特别的 eBPF
的程序。它和 Pod
相关,但是却不是挂载在 TC egress
上的,而是以 eBPF
程序,保存在 eBPF
中的 map
中的, eBPF
中有一种 map
类型,是可以保存 eBPF
程序的 (BPF_MAP_TYPE_PROG_ARRAY
) ,同时 map
中的程序都有自己的 id
编号,可以用来在 tail call
中直接调用,to-container
就是其中之一,而且是每一个 Pod
都有自己独立的 to-container
,和其它和 Pod
相关的 eBPF
程序一样。
它的主要作用就是,在数据包进入 Pod
进入之前,完成 Policy
的验证,如果验证通过了会使用 Linux
的 redirect
方法,将数据包转发到 Pod
的虚拟网络设备上。至于是 lxc-xxx
还是 eth0
,看内核的版本支持不支持 redirect peer
,如果支持 redirect peer
,就会直接转发到 Pod
的 eth0
,如果不支持就会转发到 Pod
的主机侧的 lxc-xxx
上。
TC to-netdev
root@cilium-kubeproxy-replacement-ebpf-worker2:/# tc filter show dev eth0 egress
filter protocol all pref 1 bpf chain 0
filter protocol all pref 1 bpf chain 0 handle 0x1 cilium-eth0 direct-action not_in_hw id 7186 tag d1f9b589b59b2d62 jited
此TC Hook
挂载在物理网卡的 TC egress
上的 eBPF
程序。主要的作用是: 完成主机数据包出主机,对数据包进行处理。
这里主要有两个场景会涉及到 to-netdev:
- 1.第一个就是开启了主机的防火墙,这里的防火墙不是
iptables
实现的,而是基于eBPF
实现的Host Network Policy
,用于处理什么样的数据包是可以出主机的,通过handle_to_netdev_ipv4
方法完成; - 2.第二个就是开启了
NodePort
之后,在需要访问的Backend Pod
不在主机的时候,会在这里完成SNAT
操作,通过handle_nat_fwd
方法和nodeport_nat_ipv4_fwd
方法完成SNAT
。 - 3.和
from-overlay
关系:数据包都是从物理网卡到达主机的,那是怎样到达cilium_vxlan
的?这个是由挂载在物理网卡的from-netdev
,根据隧道类型,将数据包通过redirect
的方式,传递到cilium_vxlan
的。参见:from-netdev
的encap_and_redirect_with_nodeid
方法,其中主要有两个方法,一是 __encap_with_nodeid(ctx, tunnel_endpoint, seclabel, monitor) 方法完成encapsulating
,二是 ctx_redirect(ctx, ENCAP_IFINDEX, 0) 方法完成将数据包redirect
到cilium_vxlan
对应的ENCAP_IFINDEX
这个ifindex
。
TC from-netdev
root@cilium-kubeproxy-replacement-ebpf-worker2:/# tc filter show dev eth0 ingress
filter protocol all pref 1 bpf chain 0
filter protocol all pref 1 bpf chain 0 handle 0x1 cilium-eth0 direct-action not_in_hw id 7177 tag 94772f5b3d31e1ac jited
此TC Hook
挂载在物理网卡的 TC ingress
上的 eBPF
程序。主要的作用是,完成主机数据包到达主机,对数据包进行处理。
这里主要有两个场景会涉及到 from-netdev
- 1.第一个就是开启了主机的防火墙,这里的防火墙不是
iptables
实现的,而是基于eBPF
实现的Host Network Policy
,用于处理什么样的数据包是可以访问主机的 - 2.第二个就是开启了
NodePort
,可以处理外部通过K8S NodePort
的服务访问方式,访问服务,包括DNAT、LB、CT
等操作,将外部的访问流量打到本地的Pod
,或者通过TC
的redirect
的方式,将数据流量转发到to-netdev
进行SNAT
之后,再转发到提供服务的Pod
所在的主机