k8s 网络模型解析之实践
一. 实践说明
首先我们先创建一组资源,包括一个deployment和一个service
apiVersion: apps/v1 kind: Deployment metadata: name: nginx labels: name: nginx spec: selector: matchLabels: name: nginx1 replicas: 1 template: metadata: labels: name: nginx1 spec: nodeName: meizu containers: - name: nginx image: nginx ports: - containerPort: 80 --- apiVersion: v1 kind: Service metadata: name: nginx labels: name: nginx1 spec: ports: - port: 4432 targetPort: 80 selector: name: nginx1
可以看到,我们在指定的node上面创建了一个nginx deployment,并且创建一个服务指向这个pod.然后我们再在其他节点上启动一个pod,在该pod中访问这个service.
src pod -> service -> backend pod
172.30.83.9 -> 10.254.40.119:4432 -> 172.30.20.2:80
下面的所有操作都是nginx 服务所在的pod所运行的node上执行的
二. 物理网卡
1.监听nginx pod/src pod 地址
sudo tcpdump -i enp4s0 ‘dst 172.30.20.9’
sudo tcpdump -i enp4s0 'src 172.30.83.9'
都没有输出
2. 监听service 地址
sudo tcpdump -i enp4s0 ‘dst 10.254.40.119’
没有输出
3.src pod所在的node的物理地址是10.167.226.38,我们现在监听从这个节点发出的所有的到本地8472端口的udp报文,注意8472端口是flannel所监听的端口
sudo tcpdump -i enp4s0 'src 10.167.226.38 and port 8472 and udp'
看到如下输出:
11:25:22.220286 IP xiaomi.49008 > meizu.otv: OTV, flags [I] (0x08), overlay 0, instance 1 IP 172.30.83.0.38200 > 172.30.20.2.http: Flags [S], seq 154323928, win 29200, options [mss 1460,sackOK,TS val 3546750064 ecr 0,nop,wscale 7], length 0 11:25:22.221179 IP xiaomi.49008 > meizu.otv: OTV, flags [I] (0x08), overlay 0, instance 1 IP 172.30.83.0.38200 > 172.30.20.2.http: Flags [.], ack 4141357270, win 229, options [nop,nop,TS val 3546750065 ecr 248682180], length 0 11:25:22.221383 IP xiaomi.49008 > meizu.otv: OTV, flags [I] (0x08), overlay 0, instance 1 IP 172.30.83.0.38200 > 172.30.20.2.http: Flags [P.], seq 0:81, ack 1, win 229, options [nop,nop,TS val 3546750065 ecr 248682180], length 81: HTTP: GET / HTTP/1.1 11:25:22.221933 IP xiaomi.49008 > meizu.otv: OTV, flags [I] (0x08), overlay 0, instance 1 IP 172.30.83.0.38200 > 172.30.20.2.http: Flags [.], ack 234, win 237, options [nop,nop,TS val 3546750066 ecr 248682181], length 0 11:25:22.221949 IP xiaomi.49008 > meizu.otv: OTV, flags [I] (0x08), overlay 0, instance 1 IP 172.30.83.0.38200 > 172.30.20.2.http: Flags [.], ack 847, win 247, options [nop,nop,TS val 3546750066 ecr 248682181], length 0 11:25:22.222347 IP xiaomi.49008 > meizu.otv: OTV, flags [I] (0x08), overlay 0, instance 1 IP 172.30.83.0.38200 > 172.30.20.2.http: Flags [F.], seq 81, ack 847, win 247, options [nop,nop,TS val 3546750067 ecr 248682181], length 0
我们可以看到在物理网卡上,报文的源地址是物理机的ip,目标地址是目标pod所在的物理机的ip。
报文体中定义的源地址是172.30.83.0,这是源pod所在主机的flannel.1网卡的地址,目标地址是172.30.20.2,这个地址是nginx pod所在的地址
4. 再看看tcp协议的输出
sudo tcpdump -i enp4s0 'src 10.167.226.38 and port 8472 and tcp'
没有输出,由此可见,flannel之间的通信是通过udp完成的,而不是tcp。
总结:在物理网卡上,所有的通信都是通过源地址和目标地址所在的主机的物理地址进行通信的,这些报文封装了从管理源pod的docker网卡地址到目的nginx pod地址的flannel报文
三. Flannel 网卡
执行下面的指令,其中172.30.83.0是源pod所在的node的flannel网卡地址
[wlh@meizu ~]$ sudo tcpdump -i flannel.1 'host 172.30.83.0 and tcp' tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on flannel.1, link-type EN10MB (Ethernet), capture size 262144 bytes 11:42:43.362239 IP 172.30.83.0.50350 > 172.30.20.2.http: Flags [S], seq 3941617519, win 29200, options [mss 1460,sackOK,TS val 3547791182 ecr 0,nop,wscale 7], length 0 11:42:43.363702 IP 172.30.20.2.http > 172.30.83.0.50350: Flags [S.], seq 3977445704, ack 3941617520, win 27960, options [mss 1410,sackOK,TS val 249723323 ecr 3547791182,nop,wscale 7], length 0 11:42:43.364106 IP 172.30.83.0.50350 > 172.30.20.2.http: Flags [.], ack 1, win 229, options [nop,nop,TS val 3547791184 ecr 249723323], length 0 11:42:43.364180 IP 172.30.83.0.50350 > 172.30.20.2.http: Flags [P.], seq 1:82, ack 1, win 229, options [nop,nop,TS val 3547791184 ecr 249723323], length 81: HTTP: GET / HTTP/1.1 11:42:43.364218 IP 172.30.20.2.http > 172.30.83.0.50350: Flags [.], ack 82, win 219, options [nop,nop,TS val 249723324 ecr 3547791184], length 0 11:42:43.364482 IP 172.30.20.2.http > 172.30.83.0.50350: Flags [P.], seq 1:234, ack 82, win 219, options [nop,nop,TS val 249723324 ecr 3547791184], length 233: HTTP: HTTP/1.1 200 OK 11:42:43.364608 IP 172.30.20.2.http > 172.30.83.0.50350: Flags [FP.], seq 234:846, ack 82, win 219, options [nop,nop,TS val 249723324 ecr 3547791184], length 612: HTTP 11:42:43.364868 IP 172.30.83.0.50350 > 172.30.20.2.http: Flags [.], ack 234, win 237, options [nop,nop,TS val 3547791185 ecr 249723324], length 0 11:42:43.364888 IP 172.30.83.0.50350 > 172.30.20.2.http: Flags [.], ack 847, win 247, options [nop,nop,TS val 3547791185 ecr 249723324], length 0 11:42:43.365226 IP 172.30.83.0.50350 > 172.30.20.2.http: Flags [F.], seq 82, ack 847, win 247, options [nop,nop,TS val 3547791185 ecr 249723324], length 0 11:42:43.365271 IP 172.30.20.2.http > 172.30.83.0.50350: Flags [.], ack 83, win 219, options [nop,nop,TS val 249723325 ecr 3547791185], length 0
可以看到物理层的报文被解封装后提交给了flannel.1网卡,它处理的报文就是从源pod所在的node的flannel.1网卡地址到目标pod的地址的通信
四. Docker0网卡
其中172.30.83.0是源pod所在的主机的flannel.1网卡的地址
1 [wlh@meizu ~]$ sudo tcpdump -i docker0 'host 172.30.83.0 and tcp' 2 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode 3 listening on docker0, link-type EN10MB (Ethernet), capture size 262144 bytes 4 11:51:00.681066 IP 172.30.83.0.56252 > 172.30.20.2.http: Flags [S], seq 2690808127, win 29200, options [mss 1460,sackOK,TS val 3548288489 ecr 0,nop,wscale 7], length 0 5 11:51:00.681110 IP 172.30.20.2.http > 172.30.83.0.56252: Flags [S.], seq 115108410, ack 2690808128, win 27960, options [mss 1410,sackOK,TS val 250220641 ecr 3548288489,nop,wscale 7], length 0 6 11:51:00.681548 IP 172.30.83.0.56252 > 172.30.20.2.http: Flags [.], ack 1, win 229, options [nop,nop,TS val 3548288490 ecr 250220641], length 0 7 11:51:00.681560 IP 172.30.83.0.56252 > 172.30.20.2.http: Flags [P.], seq 1:82, ack 1, win 229, options [nop,nop,TS val 3548288490 ecr 250220641], length 81: HTTP: GET / HTTP/1.1 8 11:51:00.681608 IP 172.30.20.2.http > 172.30.83.0.56252: Flags [.], ack 82, win 219, options [nop,nop,TS val 250220641 ecr 3548288490], length 0 9 11:51:00.681773 IP 172.30.20.2.http > 172.30.83.0.56252: Flags [P.], seq 1:234, ack 82, win 219, options [nop,nop,TS val 250220642 ecr 3548288490], length 233: HTTP: HTTP/1.1 200 OK 10 11:51:00.681853 IP 172.30.20.2.http > 172.30.83.0.56252: Flags [FP.], seq 234:846, ack 82, win 219, options [nop,nop,TS val 250220642 ecr 3548288490], length 612: HTTP 11 11:51:00.682018 IP 172.30.83.0.56252 > 172.30.20.2.http: Flags [.], ack 234, win 237, options [nop,nop,TS val 3548288490 ecr 250220642], length 0 12 11:51:00.682031 IP 172.30.83.0.56252 > 172.30.20.2.http: Flags [.], ack 847, win 247, options [nop,nop,TS val 3548288490 ecr 250220642], length 0 13 11:51:00.682504 IP 172.30.83.0.56252 > 172.30.20.2.http: Flags [F.], seq 82, ack 847, win 247, options [nop,nop,TS val 3548288491 ecr 250220642], length 0 14 11:51:00.682523 IP 172.30.20.2.http > 172.30.83.0.56252: Flags [.], ack 83, win 219, options [nop,nop,TS val 250220642 ecr 3548288491], length 0
我们可以看到这里的通信是以tcp协议进行并且所有的通信和flannel类似,也是在源主机的flannel网卡地址和目标pod地址之间进行的。
五 容器网卡
容器网卡的输出和前面的比较类似这里不再赘述。
六总结
下面总结一下整个过程
- 首先容器内的进程发送一个访问service的请求,这个被交给容器内网卡进行处理,容器内网卡将请求发送给veth pair的另一端。此时请求是从src pod ip-> service ip。
- 然后根据NAT表的设置, 目的地址被转化为backend pod 的ip,这个请求再传送给docker0网卡。此时请求是 src pod ip -> backend pod ip
- docker0 网卡收到请求后直接将请求发出去,请求根据路由表(route -n)被传送给flannel.1网卡。此时请求是flanne.1 ip -> backend pod ip
- flannel.1网卡将请求发送给flanneld进程进行处理,该进程读取etcd的配置,给请求封装上一个udp协议头,然后发出去, 该报文是 source node ip -> dst node ip
- 本地网卡收到这个请求后将报文从物理网络上发出去,到达远程主机。
flannel.1