Zookeeper Study Notes
Zookeeper is a general-purpose coordination service.
The ZooKeeper service comprises an ensemble of servers that use replication to achieve high availability and performance.
What do we mean by coordination as a service?
Example: VMware-FT's test-and-set (a.k.a. c-a-s) server
If one replica can't talk to the other, grabs t-a-s lock, becomes sole server
Must be exclusive to avoid two primaries (e.g. if network partition)
Must be fault-tolerant
Example: GFS (more speculative)
Perhaps agreement on which meta-data replica should be master
Perhaps recording list of chunk servers, which chunks, who is primary
Other examples: MapReduce, YMB, Crawler, etc.
Who is the master, list or workers (Group Membership)
Master failover (Failure Dectection & Leader Election)
Division of works; status of tasks (Configuration Management)
A general-purpose service would save much effort!
Could we use a linearizable key/value store as a generic coordination service?
For example, to choose new GFS master if multiple replicas want to take over?
perhaps
Put("master", my IP address)
if Get("master") == my IP address:
act as master
problem: a racing Put() may execute after the Get()
2nd Put() overwrites first, so two masters, oops
Put() and Get() are not a good API for mutual exclusion!
problem: what to do if master fails?
perhaps master repeatedly Put()s a fresh timestamp?
lots of polling...
problem: clients need to know when master changes
periodic Get()s?
lots of polling...
Zookeeper API overview
data model: a file-system-like tree of znodes
file names, file content, directories, path names
typical use: configuration info in znodes
set of machines that participate in the application
which machine is the primary
each znode has a version number
types of znodes:
regular (a.k.a. persistent)
ephemeral
sequential: name + seqno
watches
allow clients to receive timely notifications of changes w/o requiring polling
one-time trigger associated with a session
client lib will restablish watches on a new server if the original connection gets lost
sessions
have an associated timeout, for detecting faulty clients
persist across Zookeeper servers (a connection loss does not end a session)
Operations on znodes
create(path, data, flags)
exclusive -- only first create indicates success
delete(path, version)
if znode.version = version, then delete
exists(path, watch)
watch=true means also send notification if path is later created/deleted
getData(path, watch)
setData(path, data, version)
if znode.version = version, then update
getChildren(path, watch)
sync()
sync then read ensures writes before sync are visible to same client's read
client could instead submit a write
ZooKeeper API are well tuned to synchronization:
+ exclusive file creation; exactly one concurrent create returns success
+ getData()/setData(x, version) supports mini-transactions
+ sessions automate actions when clients fail (e.g. release lock on failure)
+ sequential files create order among multiple clients
+ watches -- avoid polling
Ordering guarantees
* Linearizable writes (use ZAB to totally order writes)
clients send writes to the leader
the leader chooses an order, numbered by "zxid"
the leader sends writes to replicas, which all execute in zxid order
* FIFO client order (Session consistency)
each client specifies an order for its operations (reads AND writes)
writes:
writes appear in the global write order in client-specified order
reads:
each read executes at a particular point in the write order
a client's successive reads execute at non-decreasing points in the order
a client's read executes after all previous writes by that client
a server may block a client's read to wait for previous write, or sync()
Q: Why does this make sense?
I.e. why OK for reads to return stale data?
why OK for client 1 to see new data, then client 2 sees older data?
Note that the staleness of reads are bounded
syncLimit:
Amount of time, in ticks (see tickTime), to allow followers to sync with ZooKeeper
If followers fall too far behind a leader, they will be dropped
A:
At a high level:
not as painful for programmers as it may seem
very helpful for read performance!
Zookeeper process reads locally at each server
read capacity scales linearly with the num of zk servers
Why is ZooKeeper useful despite loose consistency (compared to linearizability)?
sync() causes subsequent client reads to see preceding writes.
useful when a read must see latest data
sync() makes linearizable history possible, but it hurts performance
Writes are well-behaved, e.g. exclusive test-and-set operations
writes really do execute in order, on latest data.
Read order rules ensure "read your own writes". (Read-your-write Consistency)
Read order rules help reasoning.
A few consequences for the ordering guarentees:
Leader must preserve client write order across leader failure.
Replicas must enforce "a client's reads never go backwards in zxid order" (Monotonic Reads)
despite replica failure.
Client must track highest zxid it has read
to help ensure next read doesn't go backwards
even if sent to a different replica
Example 1: Add one to a number stored in a ZooKeeper znode
what if the read returns stale data?
write will write the wrong value!
what if another client concurrently updates?
will one of the increments be lost?
while true:
x, v := getData("f")
if setData(x + 1, version=v):
break
this is a "mini-transaction", effect is atomic read-modify-write
Example 2: Simple Locks
acquire():
while true:
if create("lf", ephemeral=true), success
if exists("lf", watch=true)
wait for notification
release(): (voluntarily or session timeout)
delete("lf")
Example 3: Locks without Herd Effect
1. create a "sequential" file
2. list files
3. if no lower-numbered, lock is acquired!
4. if exists(next-lower-numbered, watch=true)
5. wait for event...
6. goto 2
Note on using ZK locks
Different from single-machine thread locks!
If lock holder fails, system automatically releases locks.
So locks are not really enforcing atomicity of other activities.
To make writes atomic, use "ready" trick or mini-transactions.
Useful for master/leader election.
New leader must inspect state and clean up.
Or soft locks, for performance but not correctness
e.g. only one worker does each Map or Reduce task (but OK if done twice)
e.g. a URL crawled by only one worker (but OK if done twice)
Zookeeper Performance Optimizations
Reads are performed on a local replica of the database
Clients can send async writes to leader (async = don't have to wait).
Leader batches up many requests to reduce net and disk-write overhead.
Assumes lots of active clients.
Fuzzy snapshots (and idempotent updates) so snapshot doesn't stop writes.
Is the resulting performance good?
Table 1 in the paper
High read throughput -- and goes up with number of servers!
Lower write throughput -- and goes down with number of servers!
21,000 writes/second is pretty good!
Maybe limited by time to persist log to hard drives.
But still MUCH higher than 10 milliseconds per disk write -- batching.
ZooKeeper is a successful design
see ZooKeeper's Wikipedia page for a list of projects that use it
Rarely eliminates all the complexity from distribution.
e.g. GFS master still needs to replicate file meta-data.
e.g. GFS primary has its own plan for replicating chunks.
But does bite off a bunch of common cases:
Master election.
Persistent master state (if state is small).
Who is the current master? (name service).
Worker registration.
Work queues.
----------------------------------------------------------------------------
Persistence
write-ahead log of commited operations
weriodic snapshots of the in-memory database
Idempotent Operations
operation: <txnType, path, value, newVersionNumber>
e.g. <SetDataTXN, /foo, f3, 2>
leader transforms a write request to a txn and fills updated state of the znode into the txn
Fuzzy Snapshots
ZooKeeper creates the snapshot from its in-memory database while allowing writes to the database
depth-first scan of the whole tree
atomatically read metadata and data of each znode, write them to disk
snapshots may not corresspond to state of zk at any point in time
but it's ok:
After rebbot, Zookeeper apply commit logs from the point at which the snapshot started
The replay fixes the fuzzy snapshot to be a consistent snapshot of the application state
Details of batching and pipelining for performance
There are two things going on here. First, the ZooKeeper leader (really the leader's Zab layer) batches together multiple client operations in order to send them efficiently over the network, and in order to efficiently write them to disk. For both network and disk, it's often far more efficient to send a batch of N small items all at once than it is to send or write them one at a time. This kind of batching is only effective if the leader sees many client requests at the same time; so it depends on there being lots of active clients.
The second aspect of pipelining is that ZooKeeper makes it easy for each client to keep many write requests outstanding at a time, by supporting asynchronous operations. From the client's point of view, it can send lots of write requests without having to wait for the responses (which arrive later, as notifications after the writes commit). From the leader's point of view, that client behavior gives the leader lots of requests to accumulate into big efficient batches.
Notification Corner Case
There is one case where a watch may be missed: a watch for the existence of a znode not yet created will be missed if the znode is created and deleted while disconnected.
book p103 [TBD]
Q: Why do authors of the paper say Zookeeper is wait-free ?
A:
The precise definition of wait-free: A wait-free implementation of a concurrent data object is one that guarantees that any process can complete any operation in a finite number of steps, regardless of the execution speeds of the other processes. This definition was introduced in the following paper by Herlihy:
https://cs.brown.edu/~mph/Herlihy91/p124-herlihy.pdf
Definition of wait-freedom from Wikipedia: Wait-freedom is the strongest non-blocking guarantee of progress, combining guaranteed system-wide throughput with starvation-freedom. An algorithm is wait-free if every operation has a bound on the number of steps the algorithm will take before the operation completes.
Zookeeper is wait-free because it processes one client's requests without needing to wait for other clients to take action. This is partially a consequence of the API: despite being designed to support client/client coordination and synchronization, no ZooKeeper API call is defined in a way that would require one client to wait for another. In contrast, a system that supported a lock acquire operation that waited for the current lock holder to release the lock would not be wait-free.
Ultimately, however, ZooKeeper clients often need to wait for each other, and ZooKeeper does provide a waiting mechanism -- watches. The main effect of wait-freedom on the API is that watches are factored out from other operations. The combination of atomic test-and-set updates (e.g. file creation and writes condition on version) with watches allows clients to synthesize more complex blocking abstractions (e.g. locks and barriers).
Q: Zookeeper session timeout v.s. Chubby lease timeout ?
A:
[TBD]
Q: How to do leader election with Zookeeper
A:
[TBD]
See https://zookeeper.apache.org/doc/current/recipes.html
Q: Any order guarantees for Zookeeper notifications ?
A:
If a client is watching for a change, the client will see the notification event before it sees the new state of the system after the change is made.
Q. Zab v.s. Raft/Paxos ?
A:
[TBD]
See https://cwiki.apache.org/confluence/display/ZOOKEEPER/Zab+vs.+Paxos
Q: The ZooKeeper ensemble can be configured such that leaders do not allow connections from clients ?
A:
Yes. leaderServes (Cluster Option):
Leader accepts client connections. Default value is "yes". The leader machine coordinates updates. For higher update throughput at thes slight expense of read throughput the leader can be configured to not accept clients and focus on coordination. The default to this option is yes, which means that a leader will accept client connections.
See https://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html
Q: Zookeeper slow read (sync+read) is linearizable ?
A:
No. The use of the sync operation before performing a read does not guarantee linearizable reads, as the following snippet taken from the zookeeper book states:
"There is a caveat to the use of sync, which is fairly technical and deeply entwined with ZooKeeper internals. (Feel free to skip it.) Because ZooKeeper is supposed to serve reads fast and scale for read-dominated workloads, the implementation of sync has been simplified and it doesn't really traverse the execution pipeline as a regular update operation, like create, setData, or delete. It simply reaches the leader, and the leader queues a response back to the follower that sent it. There is a small chance that the leader thinks that it is the leader l, but doesn't have support from a quorum any longer because the quorum now supports a different leader, lʹ . In this case, the leader l might not have all updates that have been processed, and the sync call might not be able to honor its guarantee."
Q: Zookeeper v.s. Chubby
A:
Chubby is a lock service; Zookeeper is not a lock service, it is a coordination service, clients can use its API to implement locks
Zookeeper' consistency model (sequential consistency, linearizable writes) is more relaxed than Chubby (use paxos, consistency mode tbd)
Zookeeper provides watches to enable efficient waiting, Chubby dont have such a notification mechanism
References
Paper: "ZooKeeper: wait-free coordination for internet-scale systems" (USENIX ATC 2010)
Zookeeper programmer's guide: https://zookeeper.apache.org/doc/current/zookeeperProgrammers.html
https://cs.brown.edu/~mph/Herlihy91/p124-herlihy.pdf (wait free, universal objects, etc.)
MIT 6.824 Zookeeper case study: https://pdos.csail.mit.edu/6.824/notes/l-zookeeper.txt
The Zookeeper book: https://t.hao0.me/files/zookeeper.pdf