Flannel网络工作节点子网冲突
最近遇到一个问题,在一个有700多工作节点的使用flannel网络插件的集群中,有两个工作节点相互抢占子网网段,现象总结如下:
- Host A加入集群,分配到了一个网段并且保存在本机/run/flannel/subnet.env文件中;
- Host A掉线了(flanneld服务停止,不再自动更新租约),正常情况下flanneld会每TTL - 1小时自动更新一次租约(默认TTL为24小时);如下:
Apr 8 03:10:15 host-a flanneld: l0408 03:10:15:140781 9477 main.go:387] Lease renewed. new expiration: 2020-04-08 19:10:15:048434562 + 0000 UTC
Apr 9 02:10:15 host-a flanneld: l0408 02:10:15:557775 9477 main.go:387] Lease renewed. new expiration: 2020-04-09 18;10:15:459505825 + 0000 UTC
- 该集群的flannel直接使用etcd v2保存数据,Host A占用的网段信息超过TTL时间无更新后,该信息被ETCD删除;
- Host B加入集群并且获取了原来分配给Host A的网段;如下:
May 01 15:15:30 host-b flanneld[30914]: I0501 15:15:30.010385 30914 vxlan_network.go:56] watching for new subnet lease
May 01 15:15:30 host-b flanneld[30914]: I0501 15:15:30.011516 30914 main.go:395] Waiting for 22h59m59.987893912s to renew lease
- Host A上的flanneld服务重新启动后,识别到/run/flannel/subnet.env中保存的网段信息,直接重用了该网段并向ETCD更新租约,更新租约的同时会更新网段与主机的绑定关系;
May 02 13:18:14 host-a flanneld: I0502 13:18:14:339563 192194 main.go:234] Created subnet manager: Etcd Local Manager with Previous Subnet: 192.1.22.0/24
May 02 13:18:14 host-a flanneld: I0502 13:18:14:374957 192194 localmanager.go:177] Found lease (192.1.22.0/24) matching previously leased subnet, reusing
- 此时Host A和Host B使用了相同的网段,最近一次更新租约的工作节点会抢占到该网段,而另一个工作节点则因此无法与集群中其它节点通信。
查看flannel源码,在flannel/subnet/etcdv2/local_manager.go中的下面这段代码会判断/run/flannel/subnet.env是否有之前已经分配过的网段,如有则直接使用该网段并更新租约;问题在于这里并没有判断该网段是否已经被分配给其它工作节点了。
// no existing match, check if there was a previous subnet to use var sn ip.IP4Net if !m.previousSubnet.Empty() { // use previous subnet if l := findLeaseBySubnet(leases, m.previousSubnet); l != nil { // Make sure the existing subnet is still within the configured network if isSubnetConfigCompat(config, l.Subnet) { log.Infof("Found lease (%v) matching previously leased subnet, reusing", l.Subnet) ttl := time.Duration(0) if !l.Expiration.IsZero() { // Not a reservation ttl = subnetTTL }
这应该是flannel的一个bug,本人已经在github上提交了一个issue,目前尚无解决方案。
https://github.com/coreos/flannel/issues/1289
临时解决办法是删除/run/flannel/subnet.env,重启flanneld服务,这样会分配到一个新的网段。