卿哥聊技术

导航

 

我的博客均是原创,转载请注名出处

前言

zookeeper 在分布式系统中主要提供了consensus support, 即一致性协议实现。提起一致性协议,首先想到底是大名顶顶的paxos(E.g. google's chubby), 更新的是raft(e.g. etcd), zookeeper是在paxos 之后,raft之前,它的协议名为ZAB(zookeeper atomic broadcast),笔者觉得它是汲取了paxos之精华。当然17年好像出了几篇更新的paper,因为笔者还没读,不敢妄加揣测consensus protocol的future trend,不过zookeeper因为在hadoop eco-system里的广泛应用,作为一个经典的存在,还是值得聊一聊

安装

wget http://www.gtlib.gatech.edu/pub/apache/zookeeper/stable/zookeeper-3.4.10.tar.gz

export ZK_HOME=$YOUR_FAV_DIR

cp conf/zoo_sample.cfg conf/zoo.cfg

修改一下dataDir即可,tickTime is in milliseconds, minimum session timeout is 2 * tickTime, client port 默认2181

make sure java installed and JAVA_HOME env variable exists. following command could use to find java home:

$(dirname $(dirname $(readlink -f $(which javac))))

运行

zkServer.sh start

(output looks like Using config: .../zoo.cfg Starting zookeeper ... STARTED)

zookeeper pid could be got from following command:

ps -ef | grep zookeeper | grep -v grep | awk '{print $2}'

Or run:

jps

listed as QuorumPeerMain

用java-based shell 连接:

zkCli.sh -server localhost:2181

停止运行

zkServer.sh stop

设置多点cluster, called zookeeper ensemble, the minimum recommended number is 3, 5 is most common in production env.

server.1=zoo1:3000:4000
server.2=zoo2:3000:4000
server.3=zoo3:3000:4000

为了测试的需要,也可以单机上设置多点(好多分布式数据库都是这么玩的,不然你让开发者怎么搞):

server.1=localhost:2345:3345
server.2=localhost:2346:3346
server.3=localhost:2347:3347

当然单机上略微麻烦点,需要搞3个cfg, 比如zoo1.cfg, zoo2.cfg, zoo3.cfg, 同时3个myid,  zoo1/myid, zoo2/myid, zoo3/myid,在每个cfg中dataDir都指向相应的zoo1, zoo2, zoo3,同时clientPort为2181,2182,2183, 启动的时候:

zkCli.sh -server localhost:2181, localhost:2182, localhost:2183

 Zookeeper 总揽

Detail: http://zookeeper.apache.org/doc/r3.4.10/zookeeperOver.html

Clients connect to a single ZooKeeper server. The client maintains a TCP connection through which it sends requests, gets responses, gets watch events, and sends heart beats. If the TCP connection to the server breaks, the client will connect to a different server.

Zookeeper 数据模型

如果把zookeeper的数据模型看作是文件系统,那么zode就是其中的一个节点,有四种znode:

  • persistent
  • ephemeral
  • persistent_sequential
  • ephemeral_sequential

persistent znode 的lifetime是知道它们被显示delete, 任何有权限删除的client都可以删除,它主要适用于存储highly available and accessible数据,比如configuration data

ephemeral zonde visible by all clients(受ACL限制), 一旦客户session退出,就会被删除, 当然delete也可以删除,它可以适用于membership service,节点加入,删除,一目了然。

Zookeeper watch

zookeeper client could choose to watch a znode and get notification when there is change, E.g. clientA sets watch on /group, now if clientA does getChildren on /group, [clientA]. Now if clientB joins /group at this time, then clientA will be notified, and client A could getChildren again, now [clientA, clientB] will be returned.

  • the watch is ordered in FIFO and dispatched in order
  • watch notification delivered to client before other changes happen, although once client received, there could be other things happen there
  • order of the watch events is ordered in order seen by zookeeper

 另外session timeout是有zookeeper服务器端检测的,所以一个当一个client和一台服务器断开时,watch会被trigger,但是client如果重新加入了另一台同cluster的服务器,由于session ID,所以client的信息会重新被找回(只要不超过expire time)

Kazoo

zookeeper 支持java,C,python(社区支持)还有其它语言(社区支持),这里以Kazoo这个python client binding举例:

watcher:

import signal
from kazoo.client import KazooClient
from kazoo.recipe.watchers import ChildrenWatch

zoo_path = '/MyPath'
zk = KazooClient(hosts='localhost:2181')
zk.start()
zk.ensure_path(zoo_path)
@zk.ChildrenWatch(zoo_path)
def child_watch_func(children):
    print "List of Children %s" % children
while True:
    signal.pause()

output looks like

/usr/bin/python child_watch.py
List of Children []
List of Children [u'child1']
List of Children [u'child1', u'child2']
List of Children [u'child1', u'child3', u'child2']
List of Children [u'child1', u'child2']
List of Children [u'child1']
List of Children []
On another terminal: WatchedEvent state:SyncConnected type:None path:null [zk: localhost:2181(CONNECTED) 0] create /MyPath/child1 "" Created /MyPath/child1 [zk: localhost:2181(CONNECTED) 1] create /MyPath/child2 "" Created /MyPath/child2 [zk: localhost:2181(CONNECTED) 2] create /MyPath/child3 "" Created /MyPath/child3 [zk: localhost:2181(CONNECTED) 3] delete /MyPath/child3 [zk: localhost:2181(CONNECTED) 4] delete /MyPath/child2 [zk: localhost:2181(CONNECTED) 5] delete /MyPath/child1

Leader election:

from kazoo.client import KazooClient
import time
import uuid

import logging
logging.basicConfig()

my_id = uuid.uuid4()

def leader_func():
    print "I am leader : %s" % my_id
    while True:
        print "%s is still working" % my_id
        time.sleep(3)

zk = KazooClient(hosts='127.0.0.1:2181')
zk.start()

election = zk.Election("/electionpath")

election.run(leader_func)

zk.stop()

output looks like:

$ /usr/bin/python leader_election.py 
I am leader : 961165dc-c6f7-4858-a315-0be40fca1a47
961165dc-c6f7-4858-a315-0be40fca1a47 is still working
961165dc-c6f7-4858-a315-0be40fca1a47 is still working
961165dc-c6f7-4858-a315-0be40fca1a47 is still working
961165dc-c6f7-4858-a315-0be40fca1a47 is still working
961165dc-c6f7-4858-a315-0be40fca1a47 is still working

ctrl-C

Then another work will become leader

Zookeeper 应用

Barrier 

single barrier: to start with, a znode named say "/zk_barrier" is created, clients all watch it, once it not exists, clients do something.

double barrier: block until N number of clients join, then do something, block until N number of clients leave, then do something

Queue

A znode say "queue-znode" is designated to hold a queue, producer adds item to the queue by creating child znode under queue-znode with sequential flag, consumer retrieve the item by getting and deleting a child from queue-znode

Lock

distributed lock, assuming there is a "/lock_node", then each client try creat EPHEMERAL_SEQUENTIAL node, and call get children, if the node someone created is the lowest number, then it's the owner of lock, the next number 需要watch 它之前紧挨着的number,如果那个number突然消失,get_all,重复这个过程

Leader election

和Lock道理差不多,最小的当leader

Group membership

/membership node, getChildren("/membership", true), get with watch, client 刚来时会注册一个ephemeral znode,这样大家都会得到通知,同时getChildren可以告诉当前的membership情况。

2 Phase commit

Service Discovery

比如hosts在/services下面注册服务通过创立ephemeral node,如果是caching可以是/services/caching,如果是storage可以是/services/storage,当clients访问系统的时候,register watch在/services path上,这样新的host加入或离开就会被client知道

posted on 2018-01-22 11:33  卿哥聊技术  阅读(271)  评论(0编辑  收藏  举报