Python使用Etcd集群基础教程

1、背景介绍

最近接手了一个项目，项目是使用Python开发的，其中使用到了Etcd，但是项目之前开发的方式，只能够支持单节点连接Etcd，不能够在Etcd节点发生故障时，自动转移。因此需要实现基于现有etcd sdk 开发一个能够实现故障转移的功能，或者更换etcd sdk来实现故障转移等功能。

先来看看项目之前使用到的 etcd 库，即 python-etcd3，通过给出的示例，没有看到可以连接多节点的方式，深入到源码后，也没有发现可以连接多节点的方式，基本上可以断定之前使用到的 etcd sdk 不支持集群方式了。因为项目中不仅仅是使用到了简单的 get、put、delete 等功能，还用到了 watch、lock等功能，所以最好是找到一个可以替换的 sdk，这样开发周期可以缩短，并且也可以减少工作量。

2、寻找可替换的SDK

网上搜了下，发现用的比较多的几个库都不支持集群方式连接，而且也蛮久没有更新了。比如： etcd3-py、 python-etcd3。

那重新找一个 etcd 的sdk 吧，然后在 github 上面搜索，最开始按照默认推荐顺序看了好几个源代码，都是不支持集群方式连接的。

都有点心灰意冷了，突然想到可以换一下 github 的推荐顺序，换成最近有更新的，然后我换成了 Recently updated 搜索，然后从前往后看，在第二页看到了一个库，点击去看了下源代码，发现是通过 grpc 方式调用的 etcd server，点进去看 client.py 文件，看到有一个类是： MultiEndpointEtcd3Client，突然眼前一亮，难道可以，然后更加文档安装了对于的 sdk ,测试发现可以集群连接。

发现可以集群连接后，接下来就是看看项目中用到的其他功能，可以正常使用不，比如： watch、lock 。测试发现都可以正常使用。

接下来就是集成到项目中了，这里就不仔细介绍，大家根据自己实际情况自行调整。

3、etcd-sdk-python 连接集群

官方教程

后面看了 python-etcd3 代码，和这个代码是一样的

etcd-sdk-python 连接集群方式比较简单，需要先创建 Endpoint，然后作为参数，传给 MultiEndpointEtcd3Client。

from pyetcd import MultiEndpointEtcd3Client, Endpoint
from pyetcd.exceptions import ConnectionFailedError

# time_retry 的意思是，当这个节点连接失败后，多少秒后再次去尝试连接
e1 = Endpoint(host="192.168.91.66", port=12379, secure=False, time_retry=30)
e2 = Endpoint(host="192.168.91.66", port=22379, secure=False, time_retry=30)
e3 = Endpoint(host="192.168.91.66", port=32379, secure=False, time_retry=30)

# failover 的意思是，当节点发生故障时，是否进行故障转移，这个参数一定要设置为True，否则当一个节点发生故障时，会报错
c = MultiEndpointEtcd3Client([e1, e2, e3], failover=True)

l = c.lease(10)

data = {"data": 8000}
c.put("/test_ttl", json.dumps(data).encode("utf-8"), lease=l)

time.sleep(5)
b = c.get("/test_ttl")
print(dir(b))

print(dir(b[0]))
print(dir(b[1]))
print(b[1].lease_id)

4、实现一个简约的自动续约的分布式锁

class EtcdGlobalMutex(object):

    def __init__(self, etcd_client, lock_key, ttl=5, acquire_timeout=2):
        self.etcd_client = etcd_client
        self.lock_key = lock_key if not lock_key.startswith("/") else lock_key.lstrip("/")
        self.ttl = ttl if ttl else 5
        self.acquire_timeout = acquire_timeout if acquire_timeout else 2
        self.locker = etcd_client.lock(self.lock_key, ttl)
        self.data = {
            "lock_key": self.lock_key,
            "create_time": datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S.%fZ"),
            "uuid": str(uuid.uuid4()),
            "pid": os.getpid()
        }

    def acquire(self):
        self.data.update({"update_time": datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S.%fZ")})
        self.locker.uuid = json.dumps(self.data).encode("utf-8")  # 覆盖掉，方便查看
        logger.info("EtcdGlobalMutex, acquire data: {}".format(self.data))

        self.locker.acquire(timeout=self.acquire_timeout)

    def refresh_lock(self):
        seconds = math.ceil(self.ttl / 2)
        if seconds == 1 and self.ttl == 1:
            seconds = 0.5

        while True:
            try:
                if not self.locker.is_acquired():  # 测试发现有莫名其妙的 DeleteEvent 事件
                    raise Exception("locker is not valid")
                self.locker.refresh()
            except ConnectionFailedError as e:
                logger.warning(f"refresh_lock. lock_key:{self.lock_key}. ConnectionFailedError, err:{e}")
                time.sleep(0.1)
                continue

            except Exception as e1:
                # 非期望错误，退出
                logger.warning(f"refresh_lock. lock_key:{self.lock_key}. unexpected error. err:{e1}")

                os._exit(1)  # kill 整个进程

            time.sleep(seconds)

    def try_lock(self):
        while True:
            try:
                self.acquire()
                break
            except ConnectionFailedError as e:
                logger.warning(f"try_lock. lock_key:{self.lock_key}. ConnectionFailedError. err:{e}")
                time.sleep(1)
            except Exception as e:
                logger.warning(f"try_lock. lock_key:{self.lock_key}. unexpected error. err:{e}")
                os._exit(1)

        if self.locker.is_acquired():
            logger.info(f"try_lock. lock_key:{self.lock_key}. Lock acquired successfully")
            # 启动刷新锁的线程
            t1 = Thread(target=self.refresh_lock, daemon=True)
            t1.start()
        else:
            logger.info(f"try_lock. lock_key:{self.lock_key}. Failed to acquire lock")
            self._watch_key()

    def _watch_key(self):
        real_key = f"/locks/{self.lock_key}"
        cancel = None
        try:
            logger.info(f"watch_key. lock_key:{self.lock_key}")
            events_iterator, cancel = self.etcd_client.watch(real_key)
            for event in events_iterator:
                logger.info(f"watch_key. lock_key:{self.lock_key}. event: {event}")
                cancel()
                break
        except ConnectionFailedError as e:
            logger.info(f"watch_key. lock_key:{self.lock_key}, ConnectionFailedError err:{e}")
            if cancel:
                cancel()
            time.sleep(1)

            self.etcd_client._clear_old_stubs()

        self.try_lock()



def main():
    name = 'lock_name'
    e = EtcdGlobalMutex(c, name, ttl=10)
    e.try_lock()

    while True:
        print("Main thread sleeping")
        time.sleep(2)


if __name__ == "__main__":
    main()

5、watch key 如何实现？

如果只是单纯的实现一个 watch key 功能，没啥好说的，看看官方给的 api 就可以，因为测试的时候，发现如果一个 etcd 节点挂掉，而这个节点有正好是连接的节点，会出现报错，这个时候需要做一些异常捕获处理。

import math
from threading import Thread
import time

from pyetcd import MultiEndpointEtcd3Client, Endpoint
from pyetcd.exceptions import ConnectionFailedError
from pyetcd.events import PutEvent

e1 = Endpoint(host="192.168.91.66", port=12379, secure=False, time_retry=2)
e2 = Endpoint(host="192.168.91.66", port=22379, secure=False, time_retry=2)
e3 = Endpoint(host="192.168.91.66", port=32379, secure=False, time_retry=2)

c = MultiEndpointEtcd3Client([e1, e2, e3], failover=True)

look_key = "look_key"

def watch(self):
    print('xxx is watching')
    cancel = None
    try:
        events_iterator, cancel = c.watch_prefix(look_key)

        self.watch_key(events_iterator)
    except  ConnectionFailedError as e:
        #　重点就是这里的异常处理
        print(f"xxx. ConnectionFailedError, err:{e}")
        if cancel:
            cancel()
        time.sleep(1)

        c._clear_old_stubs()
        watch()
    except Exception as e1:
        # 非期望错误，退出，防止线程不能退出
        print(f"xxx.  unexpected error. err:{e1}")
        if cancel:
            cancel()

        return


def watch_key(self, events_iterator):
    print("coming watch_key")
    for watch_msg in events_iterator:
        print(watch_msg)
        if type(watch_msg) != PutEvent:
            # 如果不是watch响应的Put信息, 忽略
            continue

        # xxx 处理监听到的信息

通过上面的学习，对　etcd-sdk-python 有一个基础的认识。

虽然上面的代码是实现了对应的功能，但是由于 etcd-sdk-python 支持的并不是很好，所以代码需要做很多特殊处理。后面有时间深入源代码层面，优化下相关的实现。

哈哈，再次感谢大佬的贡献！

6、部署 etcd 集群

集群部署可以看我之前写的文章 02、etcd单机部署和集群部署。

文章中是使用了 goreman 来管理多进程的，如何模拟一个节点挂掉了，其实为了方便测试，我们可以将使用到的 profile 文件，分成三个。profile_etcd1、profile_etcd2、profile_etcd3 。通过使用 goreman 来管理。

goreman -f profile_etcd1 start

posted @ 2024-06-22 12:42 画个一样的我阅读(418) 评论(0) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

相关博文：

· 02、etcd单机部署和集群部署

· 记一次etcd全局锁使用不当导致的事故

· etcd框架实践【Go版】

· 【微服务】Etcd实现服务器注册和发现|Etcd、Eureka、Consul、Zookeeper 比较

· etcd v3版本生产级集群搭建以及实现一键启动脚本

阅读排行：
· [翻译] 为什么 Tracebit 用 C# 开发
· Deepseek官网太卡，教你白嫖阿里云的Deepseek-R1满血版
· 2分钟学会 DeepSeek API，竟然比官方更好用！
· .NET 使用 DeepSeek R1 开发智能 AI 客户端
· 刚刚！百度搜索“换脑”引爆AI圈，正式接入DeepSeek R1满血版

公告

昵称：画个一样的我
园龄： 4年1个月
粉丝： 31
关注： 3

+加关注

2025年2月

日

一

二

三

四

五

六

huageyiyangdewo

Python使用Etcd集群基础教程

1、背景介绍

2、寻找可替换的SDK

3、etcd-sdk-python 连接集群

4、实现一个简约的自动续约的分布式锁

5、watch key 如何实现？

6、部署 etcd 集群

公告

搜索

常用链接

最新随笔

我的标签

随笔分类

随笔档案

阅读排行榜

评论排行榜

推荐排行榜

最新评论