Cilium VxLAN with eBPF 模式
Cilium VxLAN with eBPF 模式
查看此文档前需要熟悉 VxLAN 相关信息,可以查看此文档 Linux 虚拟网络 VXLAN
一、环境信息
主机 | IP |
---|---|
ubuntu | 172.16.94.141 |
软件 | 版本 |
---|---|
docker | 26.1.4 |
helm | v3.15.0-rc.2 |
kind | 0.18.0 |
kubernetes | 1.23.4 |
ubuntu os | Ubuntu 20.04.6 LTS |
kernel | 5.11.5 内核升级文档 |
二、安装服务
kind
配置文件信息
root@kind:~# cat install.sh
#!/bin/bash
date
set -v
# 1.prep noCNI env
cat <<EOF | kind create cluster --name=cilium-kubeproxy-replacement --image=kindest/node:v1.23.4 --config=-
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
networking:
# kind 默认使用 rancher cni,cni 我们需要自己创建
disableDefaultCNI: true
# kind 安装 k8s 集群需要禁用 kube-proxy 安装,是 cilium 代替 kube-proxy 功能
kubeProxyMode: "none"
nodes:
- role: control-plane
- role: worker
- role: worker
containerdConfigPatches:
- |-
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."harbor.evescn.com"]
endpoint = ["https://harbor.evescn.com"]
EOF
# 2.get controller_node_ip
controller_node_ip=`kubectl get node -o wide --no-headers | grep -E "control-plane|bpf1" | awk -F " " '{print $6}'`
# 3.install cni
helm repo add cilium https://helm.cilium.io > /dev/null 2>&1
helm repo update > /dev/null 2>&1
# Direct Routing Options(--set kubeProxyReplacement=strict --set tunnel=disabled --set autoDirectNodeRoutes=true --set ipv4NativeRoutingCIDR="10.0.0.0/8")
# Host Routing[EBPF](--set bpf.masquerade=true)
helm install cilium cilium/cilium \
--set k8sServiceHost=$controller_node_ip \
--set k8sServicePort=6443 \
--version 1.13.0-rc5 \
--namespace kube-system \
--set debug.enabled=true \
--set debug.verbose=datapath \
--set monitorAggregation=none \
--set ipam.mode=cluster-pool \
--set cluster.name=cilium-kubeproxy-replacement-ebpf-vxlan \
--set kubeProxyReplacement=strict \
--set bpf.masquerade=true
# 4.install necessary tools
for i in $(docker ps -a --format "table {{.Names}}" | grep cilium)
do
echo $i
docker cp /usr/bin/ping $i:/usr/bin/ping
docker exec -it $i bash -c "sed -i -e 's/jp.archive.ubuntu.com\|archive.ubuntu.com\|security.ubuntu.com/old-releases.ubuntu.com/g' /etc/apt/sources.list"
docker exec -it $i bash -c "apt-get -y update >/dev/null && apt-get -y install net-tools tcpdump lrzsz bridge-utils >/dev/null 2>&1"
done
--set
参数解释
-
--set kubeProxyReplacement=strict
- 含义: 启用 kube-proxy 替代功能,并以严格模式运行。
- 用途: Cilium 将完全替代 kube-proxy 实现服务负载均衡,提供更高效的流量转发和网络策略管理。
-
--set bpf.masquerade
- 含义: 启用 eBPF 功能。
- 用途: 使用 eBPF 实现数据路由,提供更高效和灵活的网络地址转换功能。
- 安装
k8s
集群和cilium
服务
root@kind:~# ./install.sh
Creating cluster "cilium-kubeproxy-replacement-ebpf-vxlan" ...
✓ Ensuring node image (kindest/node:v1.23.4) 🖼
✓ Preparing nodes 📦 📦 📦
✓ Writing configuration 📜
✓ Starting control-plane 🕹️
✓ Installing StorageClass 💾
✓ Joining worker nodes 🚜
Set kubectl context to "kind-cilium-kubeproxy-replacement-ebpf-vxlan"
You can now use your cluster with:
kubectl cluster-info --context kind-cilium-kubeproxy-replacement-ebpf-vxlan
Not sure what to do next? 😅 Check out https://kind.sigs.k8s.io/docs/user/quick-start/
- 查看安装的服务
root@kind:~# kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system cilium-5zqzb 1/1 Running 0 9m24s
kube-system cilium-mlmk5 1/1 Running 0 9m24s
kube-system cilium-operator-dd757785c-phdrj 1/1 Running 0 9m24s
kube-system cilium-operator-dd757785c-rn9dh 1/1 Running 0 9m24s
kube-system cilium-xktlw 1/1 Running 0 9m24s
kube-system coredns-64897985d-cjbv8 1/1 Running 0 11m
kube-system coredns-64897985d-hpxvj 1/1 Running 0 11m
kube-system etcd-cilium-kubeproxy-replacement-ebpf-vxlan-control-plane 1/1 Running 0 11m
kube-system kube-apiserver-cilium-kubeproxy-replacement-ebpf-vxlan-control-plane 1/1 Running 0 11m
kube-system kube-controller-manager-cilium-kubeproxy-replacement-ebpf-vxlan-control-plane 1/1 Running 0 11m
kube-system kube-scheduler-cilium-kubeproxy-replacement-ebpf-vxlan-control-plane 1/1 Running 0 11m
local-path-storage local-path-provisioner-5ddd94ff66-gthlr 1/1 Running 0 11m
没有
kube-proxy
服务,因为设置了kubeProxyReplacement=strict
,那么cilium
将完全替代kube-proxy
实现服务负载均衡。并且在kind
安装k8s
集群的时候也需要设置禁用kube-proxy
安装kubeProxyMode: "none"
cilium
配置信息
# kubectl -n kube-system exec -it ds/cilium -- cilium status
KVStore: Ok Disabled
Kubernetes: Ok 1.23 (v1.23.4) [linux/amd64]
Kubernetes APIs: ["cilium/v2::CiliumClusterwideNetworkPolicy", "cilium/v2::CiliumEndpoint", "cilium/v2::CiliumNetworkPolicy", "cilium/v2::CiliumNode", "core/v1::Namespace", "core/v1::Node", "core/v1::Pods", "core/v1::Service", "discovery/v1::EndpointSlice", "networking.k8s.io/v1::NetworkPolicy"]
KubeProxyReplacement: Strict [eth0 172.18.0.2]
Host firewall: Disabled
CNI Chaining: none
CNI Config file: CNI configuration file management disabled
Cilium: Ok 1.13.0-rc5 (v1.13.0-rc5-dc22a46f)
NodeMonitor: Listening for events on 128 CPUs with 64x4096 of shared memory
Cilium health daemon: Ok
IPAM: IPv4: 3/254 allocated from 10.0.0.0/24,
IPv6 BIG TCP: Disabled
BandwidthManager: Disabled
Host Routing: BPF
Masquerading: BPF [eth0] 10.0.0.0/24 [IPv4: Enabled, IPv6: Disabled]
Controller Status: 22/22 healthy
Proxy Status: OK, ip 10.0.0.109, 0 redirects active on ports 10000-20000
Global Identity Range: min 256, max 65535
Hubble: Ok Current/Max Flows: 4095/4095 (100.00%), Flows/s: 1.06 Metrics: Disabled
Encryption: Disabled
Cluster health: 3/3 reachable (2024-06-28T07:39:34Z)
- 1.
KubeProxyReplacement: Strict [eth0 172.18.0.2]
- Cilium 完全接管所有 kube-proxy 功能,包括服务负载均衡、NodePort 和其他网络策略管理。这种配置适用于你希望最大限度利用 Cilium 的高级网络功能,并完全替代 kube-proxy 的场景。此模式提供更高效的流量转发和更强大的网络策略管理。
- 并且不同于之前的
KubeProxyReplacement: Strict [eth0 172.18.0.3 (Direct Routing)]
,已经没有Direct Routing
表示不在使用Direct Routing
进行跨节点通讯,此处使用 VxLAN 模式通讯
- 2.
Host Routing: BPF
- 使用 BPF 进行主机路由。
- 3.
Masquerading: BPF [eth0] 10.0.0.0/8 [IPv4: Enabled, IPv6: Disabled]
- 使用 BPF 进行 IP 伪装(NAT),接口 eth0,IP 范围 10.0.0.0/8 不回进行 NAT。IPv4 伪装启用,IPv6 伪装禁用。
k8s
集群安装 Pod
测试网络
# cat cni.yaml
apiVersion: apps/v1
kind: DaemonSet
#kind: Deployment
metadata:
labels:
app: evescn
name: evescn
spec:
#replicas: 1
selector:
matchLabels:
app: evescn
template:
metadata:
labels:
app: evescn
spec:
containers:
- image: harbor.dayuan1997.com/devops/nettool:0.9
name: nettoolbox
securityContext:
privileged: true
---
apiVersion: v1
kind: Service
metadata:
name: serversvc
spec:
type: NodePort
selector:
app: evescn
ports:
- name: cni
port: 8080
targetPort: 80
nodePort: 32000
root@kind:~# kubectl apply -f cni.yaml
daemonset.apps/cilium-with-replacement created
service/serversvc created
root@kind:~# kubectl run net --image=harbor.dayuan1997.com/devops/nettool:0.9
pod/net created
- 查看安装服务信息
root@kind:~# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
evescn-2454l 1/1 Running 0 3m30s 10.0.0.24 cilium-kubeproxy-replacement-ebpf-vxlan-worker2 <none> <none>
evescn-qff5p 1/1 Running 0 3m30s 10.0.2.210 cilium-kubeproxy-replacement-ebpf-vxlan-worker <none> <none>
net 1/1 Running 0 10s 10.0.2.119 cilium-kubeproxy-replacement-ebpf-vxlan-worker <none> <none>
root@kind:~# kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 15m
serversvc NodePort 10.96.158.92 <none> 8080:32000/TCP 3m40s
三、测试网络
同节点 Pod
网络通讯
可以查看此文档 Cilium Native Routing with eBPF 模式 中,同节点网络通讯,数据包转发流程一致
不同节点 Pod
网络通讯
Pod
节点信息
## ip 信息
root@kind:~# kubectl exec -it net -- ip a l
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
8: eth0@if9: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether ae:32:09:5a:08:5c brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.0.2.40/32 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::ac32:9ff:fe5a:85c/64 scope link
valid_lft forever preferred_lft forever
## 路由信息
root@kind:~# kubectl exec -it net -- ip r s
default via 10.0.2.194 dev eth0 mtu 1450
10.0.2.194 dev eth0 scope link
查看 Pod
信息发现在 cilium
中主机的 IP
地址为 32
位掩码,意味着该 IP
地址是单个主机的唯一标识,而不是一个子网。这个主机访问其他 IP
均会走路由到达
Pod
节点所在Node
节点信息
root@kind:~# docker exec -it cilium-kubeproxy-replacement-ebpf-vxlan-worker bash
## ip 信息
root@cilium-kubeproxy-replacement-ebpf-vxlan-worker:/# ip a l
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: cilium_net@cilium_host: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 4e:cc:d4:f9:0e:52 brd ff:ff:ff:ff:ff:ff
inet6 fe80::4ccc:d4ff:fef9:e52/64 scope link
valid_lft forever preferred_lft forever
3: cilium_host@cilium_net: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether ea:0c:d0:41:5c:59 brd ff:ff:ff:ff:ff:ff
inet 10.0.2.194/32 scope link cilium_host
valid_lft forever preferred_lft forever
inet6 fe80::e80c:d0ff:fe41:5c59/64 scope link
valid_lft forever preferred_lft forever
4: cilium_vxlan: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether a6:cc:fc:d1:d9:4f brd ff:ff:ff:ff:ff:ff
inet6 fe80::a4cc:fcff:fed1:d94f/64 scope link
valid_lft forever preferred_lft forever
6: lxc_health@if5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 22:11:17:5f:1f:15 brd ff:ff:ff:ff:ff:ff link-netnsid 1
inet6 fe80::2011:17ff:fe5f:1f15/64 scope link
valid_lft forever preferred_lft forever
8: lxc2e4ac370ba86@if7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 5a:34:6e:44:ca:47 brd ff:ff:ff:ff:ff:ff link-netns cni-bd3fd18b-78ca-2fd1-3cc6-05a99a711b83
inet6 fe80::5834:6eff:fe44:ca47/64 scope link
valid_lft forever preferred_lft forever
10: lxcf46963b8bfc3@if9: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether ee:7b:eb:76:25:37 brd ff:ff:ff:ff:ff:ff link-netns cni-5af07f3c-65fe-1453-606f-d38b4a698803
inet6 fe80::ec7b:ebff:fe76:2537/64 scope link
valid_lft forever preferred_lft forever
31: eth0@if32: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 02:42:ac:12:00:03 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 172.18.0.3/16 brd 172.18.255.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fc00:f853:ccd:e793::3/64 scope global nodad
valid_lft forever preferred_lft forever
inet6 fe80::42:acff:fe12:3/64 scope link
valid_lft forever preferred_lft forever
## 路由信息
root@cilium-kubeproxy-replacement-ebpf-vxlan-worker:/# ip r s
default via 172.18.0.1 dev eth0
10.0.0.0/24 via 10.0.2.194 dev cilium_host src 10.0.2.194 mtu 1450
10.0.1.0/24 via 10.0.2.194 dev cilium_host src 10.0.2.194 mtu 1450
10.0.2.0/24 via 10.0.2.194 dev cilium_host src 10.0.2.194
10.0.2.194 dev cilium_host scope link
172.18.0.0/16 dev eth0 proto kernel scope link src 172.18.0.3
## ARP 信息
root@cilium-kubeproxy-replacement-ebpf-vxlan-worker:/# arp -n
Address HWtype HWaddress Flags Mask Iface
172.18.0.4 ether 02:42:ac:12:00:04 C eth0
172.18.0.1 ether 02:42:2f:fe:43:35 C eth0
172.18.0.2 ether 02:42:ac:12:00:02 C eth0
## VxLAN 信息
root@cilium-kubeproxy-replacement-ebpf-vxlan-worker:/# ip -d link show | grep -A 3 vxlan
4: cilium_vxlan: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/ether a6:cc:fc:d1:d9:4f brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
vxlan external addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
-
1.查看
Cilium VxLAN
信息发现,Cilium VxLAN
的模式和传统的VxLAN
模式区别很大- 没有 VxLAN 封装数据需要使用的信息
- 并且
cilium_vxlan
网卡没有mac
地址,那么在封装包的时候使用什么地址信息? - 传统模式数据包到达宿主机后通过
route
路由表信息发送到vxlan
接口,但是现在宿主机路由表对端都走的cilium_host
网卡。转发跨主机数据包到cilium_vxlan
网卡功能在cilium
使用底层使用tc hook
实现,类似 同节点通讯 也不会走到cilium_host
网卡一样
-
2.
VxLAN
封装数据需要使用的对端址信息,其实存在cilium
中
root@cilium-kubeproxy-replacement-ebpf-vxlan-worker:/home/cilium# cilium bpf tunnel list
TUNNEL VALUE
10.0.1.0:0 172.18.0.4:0
10.0.0.0:0 172.18.0.2:0
- 3.
VxLAN
封装数据还需要使用vni
信息,此信息也存放在cilium
中
root@cilium-kubeproxy-replacement-ebpf-vxlan-worker:/home/cilium# cilium identity list
ID LABELS
1 reserved:host
2 reserved:world
3 reserved:unmanaged
4 reserved:health
5 reserved:init
6 reserved:remote-node
7 reserved:kube-apiserver
reserved:remote-node
8 reserved:ingress
3178 k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=kube-system
k8s:io.cilium.k8s.policy.cluster=cilium-kubeproxy-replacement-ebpf-vxlan
k8s:io.cilium.k8s.policy.serviceaccount=coredns
k8s:io.kubernetes.pod.namespace=kube-system
k8s:k8s-app=kube-dns
32359 k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=default
k8s:io.cilium.k8s.policy.cluster=cilium-kubeproxy-replacement-ebpf-vxlan
k8s:io.cilium.k8s.policy.serviceaccount=default
k8s:io.kubernetes.pod.namespace=default
k8s:run=net
42512 k8s:app=local-path-provisioner
k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=local-path-storage
k8s:io.cilium.k8s.policy.cluster=cilium-kubeproxy-replacement-ebpf-vxlan
k8s:io.cilium.k8s.policy.serviceaccount=local-path-provisioner-service-account
k8s:io.kubernetes.pod.namespace=local-path-storage
46188 k8s:app=evescn
k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=default
k8s:io.cilium.k8s.policy.cluster=cilium-kubeproxy-replacement-ebpf-vxlan
k8s:io.cilium.k8s.policy.serviceaccount=default
k8s:io.kubernetes.pod.namespace=default
Pod
节点进行ping
包测试
root@kind:~# kubectl exec -it net -- ping 10.0.0.72 -c 1
PING 10.0.0.72 (10.0.0.72): 56 data bytes
64 bytes from 10.0.0.72: seq=0 ttl=62 time=1.916 ms
--- 10.0.0.72 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 1.916/1.916/1.916 ms
Pod
节点eth0
网卡抓包
net~$ tcpdump -pne -i eth0
08:36:54.353393 72:4a:62:7e:62:6e > ee:7b:eb:76:25:37, ethertype IPv4 (0x0800), length 98: 10.0.2.119 > 10.0.0.24: ICMP echo request, id 64, seq 0, length 64
08:36:54.354259 ee:7b:eb:76:25:37 > 72:4a:62:7e:62:6e, ethertype IPv4 (0x0800), length 98: 10.0.0.24 > 10.0.2.119: ICMP echo reply, id 64, seq 0, length 64
Node
节点cilium-kubeproxy-replacement-ebpf-vxlan-worker
的eth0
网卡抓包,并使用 wireshark 工具分析
root@cilium-kubeproxy-replacement-ebpf-vxlan-worker:/# tcpdump -pne -i eth0 -w /tmp/vxlan.cap
root@cilium-kubeproxy-replacement-ebpf-vxlan-worker:/# sz /tmp/vxlan.cap
- 1.
request
数据包信息信息icmp
包中,外部mac
信息中,源mac: 02:42:ac:12:00:02
为cilium-kubeproxy-replacement-ebpf-vxlan-worker
的eth0
网卡mac
,目的mac: 02:42:ac:12:00:03
为对端Pod
宿主机cilium-kubeproxy-replacement-ebpf-vxlan-worker2
的eth0
网卡mac
。依旧使用udp
协议8472
端口进行数据传输,vxlan
信息中vni
为cilium identity list
命令中的id
信息。- 需要注意的此处内层数据包
mac
信息,在传统的vxlan
模式中,此处 内层源mac
信息应该为本机vxlan
网卡mac
地址,内层目的mac
应该为对端vxlan
网卡mac
地址,但是因为cilium vxlan
模式中,vxlan
网卡没有 mac 地址,那么此处使用的是什么地址信息?查看前文 源pod
源node
节点mac
信息,发现这个内层数据包信息,居然同Pod
节点eth0
网卡抓包数据包信息一致,可以得出结论:在cilium vxlan
模式中,vxlan
网卡没有mac
地址,对比与传统vxlan
模式需要替换内层数据包中的mac
信息,cilium vxlan
直接使用源pod eth0
网卡mac
和eth0
网卡对应的veth pair
网卡mac
作为内层数据包mac
地址,减少一次修改mac
地址的上下文切换。
-
2.
reploy
数据包信息信息- 查看
reploy
数据包的vni
信息,发现在cilium vxlan
模式中,一对数据包使用的vni
不一致,这也不同于传统vxlan
模式,数据包中的vni
信息,均是通过cilium identity list
查询到对端id
信息后,使用此id
信息作为vni
信息进行发包 reploy
数据包中vni = 46188
、46188 k8s:app=evescn
即是发起服务pod
evescn-2454l
自身pod
所在identity id
,request
数据包中vni = 32359
、32359 k8s:run=net
也是发起服务pod
net
自身pod
所在identity id
- 查看
-
Node
节点cilium-kubeproxy-replacement-ebpf-vxlan-worker2
eth0
网卡抓包分析
root@cilium-kubeproxy-replacement-ebpf-vxlan-worker2:/# tcpdump -pne -i eth0 -w /tmp/vxlan2.cap
root@cilium-kubeproxy-replacement-ebpf-vxlan-worker2:/# sz /tmp/vxlan2.cap
- 1.
request
数据包信息信息同cilium-kubeproxy-replacement-ebpf-vxlan-worker
eth0
网卡
-
2.
reploy
数据包信息信息中,内层 mac 信息为目的 pod eht0 网卡和对于的 veth pair 网卡 mac 地址 -
Node
节点cilium-kubeproxy-replacement-ebpf-worker2
cilium_vxlan
网卡抓包
root@cilium-kubeproxy-replacement-ebpf-vxlan-worker2:/# tcpdump -pne -i cilium_vxlan
09:24:28.995715 72:4a:62:7e:62:6e > ee:7b:eb:76:25:37, ethertype IPv4 (0x0800), length 98: 10.0.2.119 > 10.0.0.24: ICMP echo request, id 123, seq 0, length 64
09:24:28.996094 b6:f7:9a:c4:59:73 > 36:07:df:de:f9:af, ethertype IPv4 (0x0800), length 98: 10.0.0.24 > 10.0.2.119: ICMP echo reply, id 123, seq 0, length 64
- 1.
request
数据包中的 mac 信息为源pod eth0
网卡mac
和eth0
网卡对应的veth pair
网卡mac
- 2.
reply
数据包中的 mac 信息为目的pod eth0
网卡mac
和eth0
网卡对应的veth pair
网卡mac
查看 cilium-kubeproxy-replacement-worker2
节点 ip
信息
root@cilium-kubeproxy-replacement-worker:/# ip a l
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: cilium_net@cilium_host: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether c2:a6:57:6f:a9:3c brd ff:ff:ff:ff:ff:ff
inet6 fe80::c0a6:57ff:fe6f:a93c/64 scope link
valid_lft forever preferred_lft forever
3: cilium_host@cilium_net: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 32:b1:f8:2d:ab:7e brd ff:ff:ff:ff:ff:ff
inet 10.0.0.109/32 scope link cilium_host
valid_lft forever preferred_lft forever
inet6 fe80::30b1:f8ff:fe2d:ab7e/64 scope link
valid_lft forever preferred_lft forever
4: cilium_vxlan: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether e6:ec:04:8e:b8:86 brd ff:ff:ff:ff:ff:ff
inet6 fe80::e4ec:4ff:fe8e:b886/64 scope link
valid_lft forever preferred_lft forever
6: lxc_health@if5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 02:30:97:b1:cc:17 brd ff:ff:ff:ff:ff:ff link-netnsid 1
inet6 fe80::30:97ff:feb1:cc17/64 scope link
valid_lft forever preferred_lft forever
8: lxc7c697c641cac@if7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 36:07:df:de:f9:af brd ff:ff:ff:ff:ff:ff link-netns cni-1a2f2717-a83f-1aea-102d-e9c4ae83cc51
inet6 fe80::3407:dfff:fede:f9af/64 scope link
valid_lft forever preferred_lft forever
29: eth0@if30: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 02:42:ac:12:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 172.18.0.2/16 brd 172.18.255.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fc00:f853:ccd:e793::2/64 scope global nodad
valid_lft forever preferred_lft forever
inet6 fe80::42:acff:fe12:2/64 scope link
valid_lft forever preferred_lft forever
- 目标
Pod
ip
信息
root@kind:~# kubectl exec -it evescn-2454l -- ip a l
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
7: eth0@if8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether b6:f7:9a:c4:59:73 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.0.0.24/32 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::b4f7:9aff:fec4:5973/64 scope link
valid_lft forever preferred_lft forever
- 数据从
net
服务发出,通过查看本机路由表,送往node
节点。路由:default via 10.0.2.194 dev eth0 mtu 1450
node
节点获取到数据包后,查询路由表后发现非节点的数据包信息,会被在node
节点上的cilium
服务劫持,cilium
的tc hook
会把数据包送往cilium_vxlan
接口cilium_vxlan
接口收到数据包信息后,基于cilium vxlan
工作模式,不会在修改数据包mac
地址信息,因为接口也没有mac
地址。在cilium
系统中查询 vxlan 信息 [cilium bpf tunnel list
] 后,对数据进行封装后发送到eth0
网卡。- 数据封装完成后,会送往
eth0
网卡,并送往对端node
节点。 - 对端
node
节点接受到数据包后,发现这个是一个送往UDP 8472
接口的vxlan
数据包,将数据包交给监听UDP 8472
端口的应用程序或内核模块处理。 - 解封装后发现内部的数据包,目的地址为
10.0.0.24
,发现是本机Pod
地址段,会直接送往目标Pod
[bpf_redict_peer()
]。 - 最终会把数据包送到目地
Pod
主机
Service
网络通讯
可以查看此文档 Cilium Native Routing with eBPF 模式 中,Service
网络通讯,为什么会直接返回 pod ip 地址进行通讯,是设置了 cilium
代替 kube-proxy
功能 --set kubeProxyReplacement=strict
- 查看
Service
信息
root@kind:~# kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 9m
serversvc NodePort 10.96.93.201 <none> 8080:32000/TCP 76s
net
服务上请求Pod
所在Node
节点32000
端口
root@kind:~# k exec -it net -- curl 172.18.0.3:32000
PodName: evescn-qff5p | PodIP: eth0 10.0.2.210/32
root@kind:~# k exec -it net -- curl 172.18.0.2:32000
PodName: evescn-2454l | PodIP: eth0 10.0.0.24/32
并在 node
节点 eth0
网卡抓包查看
在 node
节点 eth0
网卡只能抓取到跨节点的请求,并且此请求依旧是会使用 vxlan
模式进行通讯,而同节点的数据包在 eth0 网卡无法抓到,那么后续 cilium vxlan with ebpf
模式会升级?比如查询到 service
后端节点有对应的服务和 客户端 在同一节点上,优先使用这个 pod ip
给客户端发起请求?
四、TC Hook
TC to-overlay
root@cilium-kubeproxy-replacement-ebpf-vxlan-worker2:/# tc filter show dev cilium_vxlan egress
filter protocol all pref 1 bpf chain 0
filter protocol all pref 1 bpf chain 0 handle 0x1 bpf_overlay.o:[to-overlay] direct-action not_in_hw id 8264 tag 5f26d5ede04f0820 jited
- 1.这是一个
eBPF Section
,在选择使用Cilium
的overlay
网络模型时,会使用到和overlay
相关的虚拟网络设备,这些设备是Cilium
Agent
在启动的时候在主机空间创建出来的,基于VxLAN
的技术完成overlay
网络。 - 2.
to-overlay
的挂载点就是挂载在cilium_vxlan
这个网络设备的Egress
上的。主要完成的工作包括:通过nodeport_nat_ipv4_fwd
和snat_v4_process
完成Egress
方向的SNAT
操作,SNAT
的作用是跨主机访问的时候,需要将源地址改成主机的地址。 - 3.
to-overlay
处理的数据包是通过lxc-xxx
的from-container
(CTX Redirect) 程序redirect
过来的数据包,这说明to-overlay
处理的是Pod
通过vxlan
出去的数据包。
TC from-overlay
root@cilium-kubeproxy-replacement-ebpf-vxlan-worker2:/# tc filter show dev cilium_vxlan ingress
filter protocol all pref 1 bpf chain 0
filter protocol all pref 1 bpf chain 0 handle 0x1 bpf_overlay.o:[from-overlay] direct-action not_in_hw id 8251 tag 10c7559775622132 jited
- 1.这是一个
eBPF Section
,在选择使用Cilium
的overlay
网络模型时,会使用到和overlay
相关的虚拟网络设备,这些设备是Agent
在启动的时候,在主机空间创建出来的,基于VxLAN
的技术完成overlay
网络。 - 2.
from-overlay
的挂载点,就是挂载在cilium_vxlan
这个网络设备的TC ingress
上的,主要处理的数据包是进入Pod
的,在进入Pod
之前会先经过
cilium_vxlan
的from-overlay
处理。
主要完成的工作包括以下几点:
- 1.包括
IPsec
安全相关的处理; - 2.包括将数据包通过
ipv4_local_delivery
方法,传递到后端服务在本机的Pod
中的case
,其中也会经过to-container
; - 3.通过
NodePort
访问服务时,服务不在本地的时候,使用和from-netdev
类似的方式,去处理NodePort
的请求,其中包括DNAT、CT
等操作,最后通过, ep_tail_call(ctx,CILIUM_CALL_IPV4_NODEPORT_NAT) 或者 ep_tail_call(ctx, CILIUM_CALL_IPV4_NODEPORT_DSR) 方法完成数据包的处理,具体是哪种,取决于NodePort
的LB
策略,是SNAT
还是DSR
。 - 4.数据包都是从物理网卡到达主机的,那是怎样到达
cilium_vxlan
的?这个是由挂载在物理网卡的from-netdev
,根据隧道类型,将数据包通过redirect
的方式,传递到cilium_vxlan
的。参见:from-netdev 的 encap_and_redirect_with_nodeid 方法,其中主要有两个方法,一是__encap_with_nodeid(ctx, tunnel_endpoint, seclabel, monitor) 方法完成 encapsulating ,二是 ctx_redirect(ctx, ENCAP_IFINDEX, 0) 方法完成将数据包 redirect 到 cilium_vxlan 对应的 ENCAP_IFINDEX 这个 ifindex。
TC to-netdev
root@cilium-kubeproxy-replacement-ebpf-vxlan-worker:/# tc filter show dev eth0 egress
filter protocol all pref 1 bpf chain 0
filter protocol all pref 1 bpf chain 0 handle 0x1 cilium-eth0 direct-action not_in_hw id 7186 tag d1f9b589b59b2d62 jited
此TC Hook
挂载在物理网卡的 TC egress
上的 eBPF
程序。主要的作用是: 完成主机数据包出主机,对数据包进行处理。
这里主要有两个场景会涉及到 to-netdev:
- 1.第一个就是开启了主机的防火墙,这里的防火墙不是
iptables
实现的,而是基于eBPF
实现的Host Network Policy
,用于处理什么样的数据包是可以出主机的,通过handle_to_netdev_ipv4
方法完成; - 2.第二个就是开启了
NodePort
之后,在需要访问的Backend Pod
不在主机的时候,会在这里完成SNAT
操作,通过handle_nat_fwd
方法和nodeport_nat_ipv4_fwd
方法完成SNAT
。 - 3.和
from-overlay
关系:数据包都是从物理网卡到达主机的,那是怎样到达cilium_vxlan
的?这个是由挂载在物理网卡的from-netdev
,根据隧道类型,将数据包通过redirect
的方式,传递到cilium_vxlan
的。参见:from-netdev
的encap_and_redirect_with_nodeid
方法,其中主要有两个方法,一是 __encap_with_nodeid(ctx, tunnel_endpoint, seclabel, monitor) 方法完成encapsulating
,二是 ctx_redirect(ctx, ENCAP_IFINDEX, 0) 方法完成将数据包redirect
到cilium_vxlan
对应的ENCAP_IFINDEX
这个ifindex
。
TC from-netdev
root@cilium-kubeproxy-replacement-ebpf-vxlan-worker:/# tc filter show dev eth0 ingress
filter protocol all pref 1 bpf chain 0
filter protocol all pref 1 bpf chain 0 handle 0x1 cilium-eth0 direct-action not_in_hw id 7177 tag 94772f5b3d31e1ac jited
此TC Hook
挂载在物理网卡的 TC ingress
上的 eBPF
程序。主要的作用是,完成主机数据包到达主机,对数据包进行处理。
这里主要有两个场景会涉及到 from-netdev
- 1.第一个就是开启了主机的防火墙,这里的防火墙不是
iptables
实现的,而是基于eBPF
实现的Host Network Policy
,用于处理什么样的数据包是可以访问主机的 - 2.第二个就是开启了
NodePort
,可以处理外部通过K8S NodePort
的服务访问方式,访问服务,包括DNAT、LB、CT
等操作,将外部的访问流量打到本地的Pod
,或者通过TC
的redirect
的方式,将数据流量转发到to-netdev
进行SNAT
之后,再转发到提供服务的Pod
所在的主机
to-overlay
from-overlay
流程
源端 cilium monitor
信息
root@cilium-kubeproxy-replacement-ebpf-vxlan-worker:/home/cilium# cilium monitor -vv
CPU 01: MARK 0x0 FROM 189 DEBUG: Conntrack lookup 1/2: src=10.0.2.119:278 dst=10.0.0.24:0 // 1.Pod 内部堆栈处理
CPU 01: MARK 0x0 FROM 189 DEBUG: Conntrack lookup 2/2: nexthdr=1 flags=1
CPU 01: MARK 0x0 FROM 189 DEBUG: CT verdict: New, revnat=0
CPU 01: MARK 0x0 FROM 189 DEBUG: Successfully mapped addr=10.0.0.24 to identity=46188
CPU 01: MARK 0x0 FROM 189 DEBUG: Conntrack create: proxy-port=0 revnat=0 src-identity=32359 lb=0.0.0.0
CPU 01: MARK 0x0 FROM 189 DEBUG: Encapsulating to node 2886860802 (0xac120002) from seclabel 32359 // 2. 封装到 172.18.0.3 节点的 vxlan 包, ---- 下面是包信息
------------------------------------------------------------------------------
Ethernet {Contents=[..14..] Payload=[..86..] SrcMAC=72:4a:62:7e:62:6e DstMAC=ee:7b:eb:76:25:37 EthernetType=IPv4 Length=0}
IPv4 {Contents=[..20..] Payload=[..64..] Version=4 IHL=5 TOS=0 Length=84 Id=64237 Flags=DF FragOffset=0 TTL=64 Protocol=ICMPv4 Checksum=10541 SrcIP=10.0.2.119 DstIP=10.0.0.24 Options=[] Padding=[]}
ICMPv4 {Contents=[..8..] Payload=[..56..] TypeCode=EchoRequest Checksum=5865 Id=278 Seq=0}
Failed to decode layer: No decoder for layer type Payload
CPU 01: MARK 0x0 FROM 189 to-overlay: 98 bytes (98 captured), state new, interface cilium_vxlan, , identity 32359->46188, orig-ip 0.0.0.0 // to-verlay 完成包封装,随后发生到对端网卡
CPU 01: MARK 0x0 FROM 3386 DEBUG: Conntrack lookup 1/2: src=172.18.0.3:49017 dst=172.18.0.2:8472 // 3. 对端返回的数据包信息 After cilium_vxlan, the encapsulated packet need deal with host tcp/stack.
CPU 01: MARK 0x0 FROM 3386 DEBUG: Conntrack lookup 2/2: nexthdr=17 flags=1
CPU 01: MARK 0x0 FROM 3386 DEBUG: CT entry found lifetime=342890, revnat=0
CPU 01: MARK 0x0 FROM 3386 DEBUG: CT verdict: Established, revnat=0
CPU 01: MARK 0x0 FROM 3386 DEBUG: Successfully mapped addr=172.18.0.2 to identity=6
CPU 01: MARK 0x0 FROM 0 DEBUG: Tunnel decap: id=46188 flowlabel=0
CPU 01: MARK 0x0 FROM 0 DEBUG: Attempting local delivery for container id 189 from seclabel 46188
CPU 01: MARK 0x0 FROM 189 DEBUG: Conntrack lookup 1/2: src=10.0.0.24:0 dst=10.0.2.119:278 // 5. ICMP EchoReplay ( After eth0 from-netdev and cilium_vxlan from-overlay ) , 本地 Pod 内部堆栈处理
CPU 01: MARK 0x0 FROM 189 DEBUG: Conntrack lookup 2/2: nexthdr=1 flags=0
CPU 01: MARK 0x0 FROM 189 DEBUG: CT entry found lifetime=342891, revnat=0
CPU 01: MARK 0x0 FROM 189 DEBUG: CT verdict: Reply, revnat=0
------------------------------------------------------------------------------ // 4. 对端返回数据包信息详情
Ethernet {Contents=[..14..] Payload=[..86..] SrcMAC=ee:7b:eb:76:25:37 DstMAC=72:4a:62:7e:62:6e EthernetType=IPv4 Length=0}
IPv4 {Contents=[..20..] Payload=[..64..] Version=4 IHL=5 TOS=0 Length=84 Id=53736 Flags= FragOffset=0 TTL=63 Protocol=ICMPv4 Checksum=37682 SrcIP=10.0.0.24 DstIP=10.0.2.119 Options=[] Padding=[]}
ICMPv4 {Contents=[..8..] Payload=[..56..] TypeCode=EchoReply Checksum=7913 Id=278 Seq=0}
Failed to decode layer: No decoder for layer type Payload
CPU 01: MARK 0x0 FROM 189 to-endpoint: 98 bytes (98 captured), state reply, interface lxcf46963b8bfc3, , identity 46188->32359, orig-ip 10.0.0.24, to endpoint 189
目的端 cilium monitor
信息
root@cilium-kubeproxy-replacement-ebpf-vxlan-worker2:/home/cilium# cilium monitor -vv
CPU 01: MARK 0x0 FROM 630 DEBUG: Conntrack lookup 1/2: src=172.18.0.2:49017 dst=172.18.0.3:8472 // 1. 获取到对端发送过来的数据包信息
CPU 01: MARK 0x0 FROM 630 DEBUG: Conntrack lookup 2/2: nexthdr=17 flags=1
CPU 01: MARK 0x0 FROM 630 DEBUG: CT entry found lifetime=342875, revnat=0
CPU 01: MARK 0x0 FROM 630 DEBUG: CT verdict: Established, revnat=0
CPU 01: MARK 0x0 FROM 630 DEBUG: Successfully mapped addr=172.18.0.3 to identity=6
CPU 01: MARK 0x0 FROM 0 DEBUG: Tunnel decap: id=32359 flowlabel=0
CPU 01: MARK 0x0 FROM 0 DEBUG: Attempting local delivery for container id 2546 from seclabel 32359
CPU 01: MARK 0x0 FROM 2546 DEBUG: Conntrack lookup 1/2: src=10.0.2.119:278 dst=10.0.0.24:0 // 2. ICMP EchoRequest ( After eth0 from-netdev and cilium_vxlan from-overlay ) , 经过 eth0 的 from-netdev 和 cilium_vxlan 的 from-overlay 接包后,替换内层包 mac 地址信息,并发生到本地 Pod 内部堆栈处理
CPU 01: MARK 0x0 FROM 2546 DEBUG: Conntrack lookup 2/2: nexthdr=1 flags=0
CPU 01: MARK 0x0 FROM 2546 DEBUG: CT verdict: New, revnat=0
CPU 01: MARK 0x0 FROM 2546 DEBUG: Conntrack create: proxy-port=0 revnat=0 src-identity=32359 lb=0.0.0.0
------------------------------------------------------------------------------ // 3. 内部包信息,此时 mac 信息已经替换为 本地 Pod eth0 网卡和 对应的 veth pair 网卡 mac 地址
Ethernet {Contents=[..14..] Payload=[..86..] SrcMAC=36:07:df:de:f9:af DstMAC=b6:f7:9a:c4:59:73 EthernetType=IPv4 Length=0}
IPv4 {Contents=[..20..] Payload=[..64..] Version=4 IHL=5 TOS=0 Length=84 Id=64237 Flags=DF FragOffset=0 TTL=63 Protocol=ICMPv4 Checksum=10797 SrcIP=10.0.2.119 DstIP=10.0.0.24 Options=[] Padding=[]}
ICMPv4 {Contents=[..8..] Payload=[..56..] TypeCode=EchoRequest Checksum=5865 Id=278 Seq=0}
Failed to decode layer: No decoder for layer type Payload
CPU 01: MARK 0x0 FROM 2546 to-endpoint: 98 bytes (98 captured), state new, interface lxc7c697c641cac, , identity 32359->46188, orig-ip 10.0.2.119, to endpoint 2546
CPU 01: MARK 0x0 FROM 2546 DEBUG: Conntrack lookup 1/2: src=10.0.0.24:0 dst=10.0.2.119:278 // 4. ICMP EchoReplay 返回数据包
CPU 01: MARK 0x0 FROM 2546 DEBUG: Conntrack lookup 2/2: nexthdr=1 flags=1
CPU 01: MARK 0x0 FROM 2546 DEBUG: CT entry found lifetime=342891, revnat=0
CPU 01: MARK 0x0 FROM 2546 DEBUG: CT verdict: Reply, revnat=0
CPU 01: MARK 0x0 FROM 2546 DEBUG: Successfully mapped addr=10.0.2.119 to identity=32359
CPU 01: MARK 0x0 FROM 2546 DEBUG: Encapsulating to node 2886860803 (0xac120003) from seclabel 46188 // 5. 封装到 172.18.0.2 节点的 vxlan 包,并发生给对端 ---- 下面是包信息
------------------------------------------------------------------------------
Ethernet {Contents=[..14..] Payload=[..86..] SrcMAC=b6:f7:9a:c4:59:73 DstMAC=36:07:df:de:f9:af EthernetType=IPv4 Length=0}
IPv4 {Contents=[..20..] Payload=[..64..] Version=4 IHL=5 TOS=0 Length=84 Id=53736 Flags= FragOffset=0 TTL=64 Protocol=ICMPv4 Checksum=37426 SrcIP=10.0.0.24 DstIP=10.0.2.119 Options=[] Padding=[]}
ICMPv4 {Contents=[..8..] Payload=[..56..] TypeCode=EchoReply Checksum=7913 Id=278 Seq=0}
Failed to decode layer: No decoder for layer type Payload
CPU 01: MARK 0x0 FROM 2546 to-overlay: 98 bytes (98 captured), state reply, interface cilium_vxlan, , identity 46188->32359, orig-ip 0.0.0.0 // to-verlay 完成包封装,随后发生到对端网卡
CPU 01: MARK 0x0 FROM 630 DEBUG: Conntrack lookup 1/2: src=172.18.0.2:49017 dst=172.18.0.3:8472
CPU 01: MARK 0x0 FROM 630 DEBUG: Conntrack lookup 2/2: nexthdr=17 flags=1
CPU 01: MARK 0x0 FROM 630 DEBUG: CT entry found lifetime=342890, revnat=0
CPU 01: MARK 0x0 FROM 630 DEBUG: CT verdict: Established, revnat=0