K&
随笔 - 71,  文章 - 0,  评论 - 7,  阅读 - 53084

丢失数据文件故障处理:

 

etcdctl ${ep} endpoint health status
{"level":"warn","ts":"2021-05-20T13:58:58.712+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-3f460aae-8ddc-4996-a726-a4aae4691573/172.21.130.169:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 172.21.130.169:2379: connect: connection refused\""}
https://172.21.130.168:2379 is healthy: successfully committed proposal: took = 13.430258ms
https://172.28.17.85:2379 is healthy: successfully committed proposal: took = 15.020918ms
https://172.21.130.169:2379 is unhealthy: failed to commit proposal: context deadline exceeded

 

但是重启服务不好使(state状态是new作用是新建,计算的member id还是那个值,但是集群成员查看还是存活的,所以不行)

 

 

复制代码
[root@master ~]# etcdctl ${ep}  member list -w table
+------------------+---------+--------+-----------------------------+-----------------------------+------------+
|        ID        | STATUS  |  NAME  |         PEER ADDRS          |        CLIENT ADDRS         | IS LEARNER |
+------------------+---------+--------+-----------------------------+-----------------------------+------------+
| 4c978cbca553cd70 | started | etcd-1 | https://172.21.130.169:2380 | https://172.21.130.169:2379 |      false |
| 568fd04cf936e056 | started | etcd-3 |   https://172.28.17.85:2380 |   https://172.28.17.85:2379 |      false |
| cc0bba643b3d8ce1 | started | etcd-2 | https://172.21.130.168:2380 | https://172.21.130.168:2379 |      false |
+------------------+---------+--------+-----------------------------+-----------------------------+------------+
复制代码

 

 

 

May 20 14:00:28 master etcd: {"level":"fatal","ts":"2021-05-20T14:00:28.992+0800","caller":"etcdmain/etcd.go:271","msg":"discovery failed","error":"member 4c978cbca553cd70 has already been bootstrapped","stacktrace":"go.etcd.io/etcd/etcdmain.startEtcdOrProxyV2\n\t/tmp/etcd-release-3.4.16/etcd/release/etcd/etcdmain/etcd.go:271\ngo.etcd.io/etcd/etcdmain.Main\n\t/tmp/etcd-release-3.4.16/etcd/release/etcd/etcdmain/main.go:46\nmain.main\n\t/tmp/etcd-release-3.4.16/etcd/release/etcd/main.go:28\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:200"}

 

 

处理办法:

 

1、修改ETCD_INITIAL_CLUSTER_STATE="existing"参数,重启服务即可

 

2、删除所有节点数据,重启所有节点(等于重建,所以在生产环境不现实除非你有etcd快照,但是你恢复时间内的数据还是没有)不推荐

 

3、删除节点重新添加,虽然能恢复但是一样费事(看个人爱好)

 

 

ETCD灾备恢复问题

 

这是数据目录所在

 

复制代码
[root@master etcd]# pwd && tree
/var/local/etcd
.
├── bin
├── cfg
│   └── etcd.conf
├── data
└── ssl
    ├── ca-key.pem
    ├── ca.pem
    ├── client-key.pem
    ├── client.pem
    ├── member-key.pem
    ├── member.pem
    ├── server-key.pem
    └── server.pem

4 directories, 9 files
复制代码

 

即使空目录也不行,备份的时候软件带了数据的绝对路径。如果在执行命令的时候--data-dir指定的目录不存在会自动创建,如果存在会阻止命令执行

[root@master etcd]# etcdctl snapshot restore 2021-05-20.db --data-dir="/var/local/etcd/data"
Error: data-dir "/var/local/etcd/data" exists

 

如果还是想放在之前的目录下,记得删除数据所在的文件夹就是data那个目录。这就是之前灾难恢复文章中,恢复创建新的 etcd 数据目录这句话的意思(新建数据目录)

 

复制代码
[root@master ~]# tree /var/local/etcd/ && etcdctl snapshot restore 2021-05-20.db --data-dir="/var/local/etcd/data"
/var/local/etcd/
├── bin
├── cfg
│   └── etcd.conf
└── ssl
    ├── ca-key.pem
    ├── ca.pem
    ├── client-key.pem
    ├── client.pem
    ├── member-key.pem
    ├── member.pem
    ├── server-key.pem
    └── server.pem

3 directories, 9 files
{"level":"info","ts":1621503919.9106126,"caller":"snapshot/v3_snapshot.go:296","msg":"restoring snapshot","path":"2021-05-20.db","wal-dir":"/var/local/etcd/data/member/wal","data-dir":"/var/local/etcd/data","snap-dir":"/var/local/etcd/data/member/snap"}
{"level":"info","ts":1621503919.9167044,"caller":"membership/cluster.go:392","msg":"added member","cluster-id":"cdf818194e3a8c32","local-member-id":"0","added-peer-id":"8e9e05c52164694d","added-peer-peer-urls":["http://localhost:2380"]}
{"level":"info","ts":1621503919.9224617,"caller":"snapshot/v3_snapshot.go:309","msg":"restored snapshot","path":"2021-05-20.db","wal-dir":"/var/local/etcd/data/member/wal","data-dir":"/var/local/etcd/data","snap-dir":"/var/local/etcd/data/member/snap"}
[root@master ~]# 
[root@master ~]# tree /var/local/etcd
/var/local/etcd
├── bin
├── cfg
│   └── etcd.conf
├── data
│   └── member
│       ├── snap
│       │   ├── 0000000000000001-0000000000000001.snap
│       │   └── db
│       └── wal
│           └── 0000000000000000-0000000000000000.wal
└── ssl
    ├── ca-key.pem
    ├── ca.pem
    ├── client-key.pem
    ├── client.pem
    ├── member-key.pem
    ├── member.pem
    ├── server-key.pem
    └── server.pem

7 directories, 12 files
复制代码

 

 

参数不一致也有导致启动失败的可能。

 

 

posted on   K&  阅读(8375)  评论(0编辑  收藏  举报
编辑推荐:
· .NET Core 中如何实现缓存的预热?
· 从 HTTP 原因短语缺失研究 HTTP/2 和 HTTP/3 的设计差异
· AI与.NET技术实操系列:向量存储与相似性搜索在 .NET 中的实现
· 基于Microsoft.Extensions.AI核心库实现RAG应用
· Linux系列:如何用heaptrack跟踪.NET程序的非托管内存泄露
阅读排行:
· TypeScript + Deepseek 打造卜卦网站:技术与玄学的结合
· 阿里巴巴 QwQ-32B真的超越了 DeepSeek R-1吗?
· 【译】Visual Studio 中新的强大生产力特性
· 2025年我用 Compose 写了一个 Todo App
· 张高兴的大模型开发实战:(一)使用 Selenium 进行网页爬虫

< 2025年3月 >
23 24 25 26 27 28 1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30 31 1 2 3 4 5
点击右上角即可分享
微信分享提示