etcd报错：failed to send out heartbeat on time

报错内容：

2019-06-05 02:09:03.008888 W | rafthttp: health check for peer 8816eaa680e63c73 could not connect: dial tcp 192.168.49.138:2380: connect: connection refused (prober "ROUND_TRIPPER_RAFT_MESSAGE")
2019-06-05 02:09:03.010827 W | rafthttp: health check for peer 8816eaa680e63c73 could not connect: dial tcp 192.168.49.138:2380: connect: connection refused (prober "ROUND_TRIPPER_SNAPSHOT")
2019-06-05 02:09:04.631367 I | rafthttp: peer 8816eaa680e63c73 became active
2019-06-05 02:09:04.631405 I | rafthttp: established a TCP streaming connection with peer 8816eaa680e63c73 (stream MsgApp v2 reader)
2019-06-05 02:09:04.632227 I | rafthttp: established a TCP streaming connection with peer 8816eaa680e63c73 (stream Message reader)
2019-06-05 02:09:04.634697 I | rafthttp: established a TCP streaming connection with peer 8816eaa680e63c73 (stream MsgApp v2 writer)
2019-06-05 02:09:04.635154 I | rafthttp: established a TCP streaming connection with peer 8816eaa680e63c73 (stream Message writer)
2019-06-05 02:09:04.961320 I | etcdserver: updating the cluster version from 3.0 to 3.3
2019-06-05 02:09:04.965052 N | etcdserver/membership: updated the cluster version from 3.0 to 3.3
2019-06-05 02:09:04.965231 I | etcdserver/api: enabled capabilities for version 3.3

2019-06-05 02:20:39.344648 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 237.022208ms, to a3d1fb0d28ed2953)
2019-06-05 02:20:39.344676 W | etcdserver: server is likely overloaded
2019-06-05 02:20:39.344685 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 237.127928ms, to 8816eaa680e63c73)
2019-06-05 02:20:39.344689 W | etcdserver: server is likely overloaded

报错信息主要为：failed to send out heartbeat on time (exceeded the 100ms timeout for 401.80886ms)

心跳检测报错主要与以下因素有关（磁盘速度、cpu性能和网络不稳定问题）：

etcd使用了raft算法，leader会定时地给每个follower发送心跳，如果leader连续两个心跳时间没有给follower发送心跳，etcd会打印这个log以给出告警。通常情况下这个issue是disk运行过慢导致的，leader一般会在心跳包里附带一些metadata，leader需要先把这些数据固化到磁盘上，然后才能发送。写磁盘过程可能要与其他应用竞争，或者因为磁盘是一个虚拟的或者是SATA类型的导致运行过慢，此时只有更好更快磁盘硬件才能解决问题。etcd暴露给Prometheus的metrics指标walfsyncduration_seconds就显示了wal日志的平均花费时间，通常这个指标应低于10ms。
第二种原因就是CPU计算能力不足。如果是通过监控系统发现CPU利用率确实很高，就应该把etcd移到更好的机器上，然后通过cgroups保证etcd进程独享某些核的计算能力，或者提高etcd的priority。
第三种原因就可能是网速过慢。如果Prometheus显示是网络服务质量不行，譬如延迟太高或者丢包率过高，那就把etcd移到网络不拥堵的情况下就能解决问题。但是如果etcd是跨机房部署的，长延迟就不可避免了，那就需要根据机房间的RTT调整heartbeat-interval，而参数election-timeout则至少是heartbeat-interval的5倍。

参考
https://blog.csdn.net/linux_player_c/article/details/79875806

posted @ 2019-06-05 17:39 漂泊的蒲公英阅读(4618) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

漂泊的蒲公英

三人行，必有我师

etcd报错：failed to send out heartbeat on time

公告