Calico 使用IPIP封包时各阶段抓包
集群信息
calico配置
apiVersion: crd.projectcalico.org/v1 kind: IPPool metadata: name: default-ipv4-ippool spec: blockSize: 26 cidr: 10.10.0.0/16 ipipMode: Always natOutgoing: false nodeSelector: all() vxlanMode: Never
calico version: v3.19.1
测试环境
k8s-node-4 192.168.99.204 podA 10.10.55.134
k8s-node-5 192.168.99.205 podB 10.10.86.131
过程抓包
当podA访问podB时,各阶段抓包如下
节点k8s-node-4中的calic211b2bb019抓包
pod中发出的包通过veth pair直接到达宿主机对应的cali*网卡(对应关系可以在pod中通过cat /sys/class/net/eth0/iflink查看)
root@k8s-node-4:~# tcpdump -i calic211b2bb019 tcp and port 80 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on calic211b2bb019, link-type EN10MB (Ethernet), capture size 262144 bytes 12:04:02.681304 IP 10.10.55.134.38064 > 10.10.86.131.http: Flags [S], seq 2695070920, win 64860, options [mss 1410,sackOK,TS val 3501275550 ecr 0,nop,wscale 7], length 0 12:04:02.681779 IP 10.10.86.131.http > 10.10.55.134.38064: Flags [S.], seq 2150062280, ack 2695070921, win 65160, options [mss 1460,sackOK,TS val 2555844366 ecr 3501275550,nop,wscale 7], length 0 12:04:02.681789 IP 10.10.55.134.38064 > 10.10.86.131.http: Flags [.], ack 1, win 507, options [nop,nop,TS val 3501275550 ecr 2555844366], length 0 12:04:02.683390 IP 10.10.55.134.38064 > 10.10.86.131.http: Flags [P.], seq 1:77, ack 1, win 507, options [nop,nop,TS val 3501275552 ecr 2555844366], length 76: HTTP: GET / HTTP/1.1 12:04:02.683711 IP 10.10.86.131.http > 10.10.55.134.38064: Flags [.], ack 77, win 509, options [nop,nop,TS val 2555844368 ecr 3501275552], length 0 12:04:02.683948 IP 10.10.86.131.http > 10.10.55.134.38064: Flags [P.], seq 1:143, ack 77, win 509, options [nop,nop,TS val 2555844368 ecr 3501275552], length 142: HTTP: HTTP/1.1 200 OK 12:04:02.683952 IP 10.10.55.134.38064 > 10.10.86.131.http: Flags [.], ack 143, win 506, options [nop,nop,TS val 3501275553 ecr 2555844368], length 0 12:04:02.684653 IP 10.10.55.134.38064 > 10.10.86.131.http: Flags [F.], seq 77, ack 143, win 506, options [nop,nop,TS val 3501275553 ecr 2555844368], length 0 12:04:02.684896 IP 10.10.86.131.http > 10.10.55.134.38064: Flags [F.], seq 143, ack 78, win 509, options [nop,nop,TS val 2555844369 ecr 3501275553], length 0 12:04:02.684901 IP 10.10.55.134.38064 > 10.10.86.131.http: Flags [.], ack 144, win 506, options [nop,nop,TS val 3501275554 ecr 2555844369], length 0
可以看到src ip是podA ip,dst ip是podB ip
节点k8s-node-4的tunl0抓包
pod发出的包出现在cali*网卡之后,匹配完kube-proxy下发的prerouting等规则后,根据dst ip即podB ip查找节点路由表确定包如何转发
root@k8s-node-4:~# ip r default via 10.0.2.2 dev enp0s3 proto dhcp src 10.0.2.15 metric 100 10.0.2.0/24 dev enp0s3 proto kernel scope link src 10.0.2.15 10.0.2.2 dev enp0s3 proto dhcp scope link src 10.0.2.15 metric 100 blackhole 10.10.55.128/26 proto bird 10.10.55.129 dev cali280cc1befad scope link 10.10.55.131 dev calie5904e003ea scope link 10.10.55.132 dev calida94a24526a scope link 10.10.55.133 dev cali37c7a3c7cb7 scope link 10.10.55.134 dev calic211b2bb019 scope link 10.10.55.135 dev cali2255478075d scope link 10.10.55.136 dev calie8f72551915 scope link 10.10.55.138 dev cali466ba5a1a55 scope link 10.10.55.139 dev cali8d25c92c70a scope link 10.10.76.128/26 via 192.168.99.203 dev tunl0 proto bird onlink 10.10.86.128/26 via 192.168.99.205 dev tunl0 proto bird onlink ### 匹配这条路由规则 10.10.140.64/26 via 192.168.99.202 dev tunl0 proto bird onlink 172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown 192.168.99.0/24 dev enp0s8 proto kernel scope link src 192.168.99.204
如上,匹配到的规则表示 下一跳via是 192.168.99.205(podB所在节点IP),由tunl0设备处理发出,tunl0作为一种隧道设置(注意区别flannel中的tun/tap设备),会在原始包的基础上加上一层ip头,其中ip头中的目的ip就是匹配的路由规则中的下一跳地址。
需要注意的是,如果calico使用的是纯三层的网络,即没有使用ipip,vxlan等进行封包处理,那么via是告诉网卡配置二层数据帧的目的mac地址为podB节点对外网卡的mac地址,这样就可以把目的节点当作网关,直接把pod发出的ip包通过二层转发到目的节点
root@k8s-node-4:~# tcpdump -i tunl0 tcp and port 80 and host 10.10.55.134 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on tunl0, link-type RAW (Raw IP), capture size 262144 bytes 12:45:14.378775 IP 10.10.55.134.46962 > 10.10.86.131.http: Flags [S], seq 486181493, win 64860, options [mss 1410,sackOK,TS val 3503747247 ecr 0,nop,wscale 7], length 0 12:45:14.379210 IP 10.10.86.131.http > 10.10.55.134.46962: Flags [S.], seq 348894761, ack 486181494, win 65160, options [mss 1460,sackOK,TS val 2558316064 ecr 3503747247,nop,wscale 7], length 0 12:45:14.379234 IP 10.10.55.134.46962 > 10.10.86.131.http: Flags [.], ack 1, win 507, options [nop,nop,TS val 3503747248 ecr 2558316064], length 0 12:45:14.380742 IP 10.10.55.134.46962 > 10.10.86.131.http: Flags [P.], seq 1:77, ack 1, win 507, options [nop,nop,TS val 3503747249 ecr 2558316064], length 76: HTTP: GET / HTTP/1.1 12:45:14.381162 IP 10.10.86.131.http > 10.10.55.134.46962: Flags [.], ack 77, win 509, options [nop,nop,TS val 2558316065 ecr 3503747249], length 0 12:45:14.381298 IP 10.10.86.131.http > 10.10.55.134.46962: Flags [P.], seq 1:143, ack 77, win 509, options [nop,nop,TS val 2558316066 ecr 3503747249], length 142: HTTP: HTTP/1.1 200 OK 12:45:14.381363 IP 10.10.55.134.46962 > 10.10.86.131.http: Flags [.], ack 143, win 506, options [nop,nop,TS val 3503747250 ecr 2558316066], length 0 12:45:14.382243 IP 10.10.55.134.46962 > 10.10.86.131.http: Flags [F.], seq 77, ack 143, win 506, options [nop,nop,TS val 3503747251 ecr 2558316066], length 0 12:45:14.382705 IP 10.10.86.131.http > 10.10.55.134.46962: Flags [F.], seq 143, ack 78, win 509, options [nop,nop,TS val 2558316067 ecr 3503747251], length 0 12:45:14.382735 IP 10.10.55.134.46962 > 10.10.86.131.http: Flags [.], ack 144, win 506, options [nop,nop,TS val 3503747251 ecr 2558316067], length 0
可以看到src ip是podA ip,dst ip是podB ip
节点k8s-node-4的enp0s8抓包
节点间通信用的网卡enp0s8上的包已经是经过tunl0封包处理之后的(加上一层ip header),所以使用tcpdump抓包时需要注意指定协议为ip而不能是tcp,因为tcpdump指定为tcp协议时根据格式ip[tcp]解析raw ip包,但是经过ipip模块封包处理之后,raw包格式变成ip[ip[tcp]],所以这个时候指定tcp协议抓包会不到,指定tcp协议相关的过滤参数也会导致抓不到包,比如指定port 80,一个参考的方式是指定协议为ip,通过配合grep来过滤包(tcpdump会把第二层ip头信息打印出来),如下
root@k8s-node-4:~# tcpdump -i enp0s8 ip and host 192.168.99.205 | grep 10.10.86.131.http tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on enp0s8, link-type EN10MB (Ethernet), capture size 262144 bytes 12:51:52.288259 IP k8s-node-4 > k8s-node-5: IP 10.10.55.134.48402 > 10.10.86.131.http: Flags [S], seq 1172730419, win 64860, options [mss 1410,sackOK,TS val 3504145157 ecr 0,nop,wscale 7], length 0 (ipip-proto-4) 12:51:52.288717 IP k8s-node-5 > k8s-node-4: IP 10.10.86.131.http > 10.10.55.134.48402: Flags [S.], seq 3623141981, ack 1172730420, win 65160, options [mss 1460,sackOK,TS val 2558713973 ecr 3504145157,nop,wscale 7], length 0 (ipip-proto-4) 12:51:52.288761 IP k8s-node-4 > k8s-node-5: IP 10.10.55.134.48402 > 10.10.86.131.http: Flags [.], ack 1, win 507, options [nop,nop,TS val 3504145157 ecr 2558713973], length 0 (ipip-proto-4) 12:51:52.288806 IP k8s-node-4 > k8s-node-5: IP 10.10.55.134.48402 > 10.10.86.131.http: Flags [P.], seq 1:77, ack 1, win 507, options [nop,nop,TS val 3504145157 ecr 2558713973], length 76: HTTP: GET / HTTP/1.1 (ipip-proto-4) 12:51:52.289089 IP k8s-node-5 > k8s-node-4: IP 10.10.86.131.http > 10.10.55.134.48402: Flags [.], ack 77, win 509, options [nop,nop,TS val 2558713973 ecr 3504145157], length 0 (ipip-proto-4) 12:51:52.289277 IP k8s-node-5 > k8s-node-4: IP 10.10.86.131.http > 10.10.55.134.48402: Flags [P.], seq 1:143, ack 77, win 509, options [nop,nop,TS val 2558713974 ecr 3504145157], length 142: HTTP: HTTP/1.1 200 OK (ipip-proto-4) 12:51:52.289318 IP k8s-node-4 > k8s-node-5: IP 10.10.55.134.48402 > 10.10.86.131.http: Flags [.], ack 143, win 506, options [nop,nop,TS val 3504145158 ecr 2558713974], length 0 (ipip-proto-4) 12:51:52.289576 IP k8s-node-4 > k8s-node-5: IP 10.10.55.134.48402 > 10.10.86.131.http: Flags [F.], seq 77, ack 143, win 506, options [nop,nop,TS val 3504145158 ecr 2558713974], length 0 (ipip-proto-4) 12:51:52.289856 IP k8s-node-5 > k8s-node-4: IP 10.10.86.131.http > 10.10.55.134.48402: Flags [F.], seq 143, ack 78, win 509, options [nop,nop,TS val 2558713974 ecr 3504145158], length 0 (ipip-proto-4) 12:51:52.289891 IP k8s-node-4 > k8s-node-5: IP 10.10.55.134.48402 > 10.10.86.131.http: Flags [.], ack 144, win 506, options [nop,nop,TS val 3504145159 ecr 2558713974], length 0 (ipip-proto-4)
可以看到第一层ip header中:src ip是k8s-node-4 nodeIP,dst ip是k8s-node-5 nodeIP
第二层ip header中:src ip是podA ip,dst ip是podB ip
当ip包被节点间已有的三层网络转发到目的节点k8s-node-5时,内核会识别出该数据包是被IPIP驱动封包处理过的,驱动会进行解包,从而拿到原始ip包,再通过节点上的如下路由规则将包转发给cali*网卡,最终到达pod中
### calico-node中的felix会为节点上每个pod创建如下类似规则,用于接收传入节点pod的ip包 10.10.86.131 dev cali2be2e0f309a scope link
注意事项
上面是两个不同节点的pod之间访问的过程,如果是在一个节点上直接访问另外一个节点的pod,则会有一点区别
在k8s-node-4上直接访问podB,并在k8s-node-4的enp0s8抓包:
root@k8s-node-4:~# tcpdump -i enp0s8 ip and host 192.168.99.205 -nn | grep 10.10.86.131.80 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on enp0s8, link-type EN10MB (Ethernet), capture size 262144 bytes 13:21:03.055260 IP 192.168.99.204 > 192.168.99.205: IP 10.10.55.128.39594 > 10.10.86.131.80: Flags [S], seq 1659225601, win 64800, options [mss 1440,sackOK,TS val 3518023510 ecr 0,nop,wscale 7], length 0 (ipip-proto-4) 13:21:03.055706 IP 192.168.99.205 > 192.168.99.204: IP 10.10.86.131.80 > 10.10.55.128.39594: Flags [S.], seq 3905354649, ack 1659225602, win 65160, options [mss 1460,sackOK,TS val 3788728882 ecr 3518023510,nop,wscale 7], length 0 (ipip-proto-4) 13:21:03.055755 IP 192.168.99.204 > 192.168.99.205: IP 10.10.55.128.39594 > 10.10.86.131.80: Flags [.], ack 1, win 507, options [nop,nop,TS val 3518023510 ecr 3788728882], length 0 (ipip-proto-4) 13:21:03.055877 IP 192.168.99.204 > 192.168.99.205: IP 10.10.55.128.39594 > 10.10.86.131.80: Flags [P.], seq 1:77, ack 1, win 507, options [nop,nop,TS val 3518023510 ecr 3788728882], length 76: HTTP: GET / HTTP/1.1 (ipip-proto-4) 13:21:03.056176 IP 192.168.99.205 > 192.168.99.204: IP 10.10.86.131.80 > 10.10.55.128.39594: Flags [.], ack 77, win 509, options [nop,nop,TS val 3788728883 ecr 3518023510], length 0 (ipip-proto-4) 13:21:03.056406 IP 192.168.99.205 > 192.168.99.204: IP 10.10.86.131.80 > 10.10.55.128.39594: Flags [P.], seq 1:143, ack 77, win 509, options [nop,nop,TS val 3788728883 ecr 3518023510], length 142: HTTP: HTTP/1.1 200 OK (ipip-proto-4) 13:21:03.056432 IP 192.168.99.204 > 192.168.99.205: IP 10.10.55.128.39594 > 10.10.86.131.80: Flags [.], ack 143, win 506, options [nop,nop,TS val 3518023511 ecr 3788728883], length 0 (ipip-proto-4) 13:21:03.056824 IP 192.168.99.204 > 192.168.99.205: IP 10.10.55.128.39594 > 10.10.86.131.80: Flags [F.], seq 77, ack 143, win 506, options [nop,nop,TS val 3518023511 ecr 3788728883], length 0 (ipip-proto-4) 13:21:03.057111 IP 192.168.99.205 > 192.168.99.204: IP 10.10.86.131.80 > 10.10.55.128.39594: Flags [F.], seq 143, ack 78, win 509, options [nop,nop,TS val 3788728883 ecr 3518023511], length 0 (ipip-proto-4) 13:21:03.057136 IP 192.168.99.204 > 192.168.99.205: IP 10.10.55.128.39594 > 10.10.86.131.80: Flags [.], ack 144, win 506, options [nop,nop,TS val 3518023512 ecr 3788728883], length 0 (ipip-proto-4)
可以看到第一层ip header中:src ip是k8s-node-4 nodeIP,dst ip是k8s-node-5 nodeIP
区别是第二层ip header中:src ip是k8s-node-4的 tunl0 ip,dst ip是podB ip
也就是说直接在节点上访问pod时,会把tunl0 ip作为原始ip包的src ip,原因是让目的节点回包时能够因为src ip(tunl0 ip)属于源节点的pod子网(calico也叫做ip block)而对回报也做ipip封包处理,否则如果src ip还是192.168.99.204的话,回包不经过目的节点的tunl0封包处理,最终在源节点看来就会出现混乱并被丢弃,也就是说:如果ip包在源节点经过ipip模块处理,那么需要保证回包时在目的节点也要经过ipip处理
回看最上面的calico中的natoutgoing配置,表示的是当在pod中访问其他非pod ip时,是否需要做snat,如果配置为true,calico会通过felix在节点中添加相关iptables规则来做snat
如果natoutgoing配置为false,会导致在pod中无法访问集群中的其他节点ip,原因刚好是上面的逆过程,即pod中访问其他的节点ip,如果不经过snat把src ip设置为自身节点的ip,那么在目的节点回包是因为src ip是podIP,那么就会根据路由表把包交由tunl0做封包出来,导致混乱,也就是说:如果ip包在源节点没有经过ipip模块处理,那么需要保证回包时在目的节点也不能经过ipip处理
所以calico已经建议用户在使用IPIP(或vxlan)模式时,需要搭配natoutgoing选项为true,可以参考:
issue:https://github.com/projectcalico/calicoctl/issues/1296
doc:https://docs.projectcalico.org/archive/v3.19/reference/resources/ippool