Zookeeper Study Notes

 

Zookeeper is a general-purpose coordination service.
The ZooKeeper service comprises an ensemble of servers that use replication to achieve high availability and performance.

 

What do we mean by coordination as a service?
  Example: VMware-FT's test-and-set (a.k.a. c-a-s) server
    If one replica can't talk to the other, grabs t-a-s lock, becomes sole server
    Must be exclusive to avoid two primaries (e.g. if network partition)
    Must be fault-tolerant
  Example: GFS (more speculative)
    Perhaps agreement on which meta-data replica should be master
    Perhaps recording list of chunk servers, which chunks, who is primary
  Other examples: MapReduce, YMB, Crawler, etc.
    Who is the master, list or workers (Group Membership)
    Master failover (Failure Dectection & Leader Election)
    Division of works; status of tasks (Configuration Management)
  A general-purpose service would save much effort!

 

Could we use a linearizable key/value store as a generic coordination service?
  For example, to choose new GFS master if multiple replicas want to take over?
  perhaps
    Put("master", my IP address)
    if Get("master") == my IP address:
      act as master
  problem: a racing Put() may execute after the Get()
    2nd Put() overwrites first, so two masters, oops
    Put() and Get() are not a good API for mutual exclusion!
  problem: what to do if master fails?
    perhaps master repeatedly Put()s a fresh timestamp?
    lots of polling...
  problem: clients need to know when master changes
    periodic Get()s?
    lots of polling...

 

Zookeeper API overview
  data model: a file-system-like tree of znodes
    file names, file content, directories, path names
  typical use: configuration info in znodes
    set of machines that participate in the application
    which machine is the primary
  each znode has a version number
  types of znodes:
    regular (a.k.a. persistent)
    ephemeral
    sequential: name + seqno
  watches
    allow clients to receive timely notifications of changes w/o requiring polling
    one-time trigger associated with a session
    client lib will restablish watches on a new server if the original connection gets lost
  sessions
    have an associated timeout, for detecting faulty clients
    persist across Zookeeper servers (a connection loss does not end a session)

 

Operations on znodes
  create(path, data, flags)
    exclusive -- only first create indicates success
  delete(path, version)
    if znode.version = version, then delete
  exists(path, watch)
    watch=true means also send notification if path is later created/deleted
  getData(path, watch)
  setData(path, data, version)
    if znode.version = version, then update
  getChildren(path, watch)
  sync()
    sync then read ensures writes before sync are visible to same client's read
    client could instead submit a write


ZooKeeper API are well tuned to synchronization:
  + exclusive file creation; exactly one concurrent create returns success
  + getData()/setData(x, version) supports mini-transactions
  + sessions automate actions when clients fail (e.g. release lock on failure)
  + sequential files create order among multiple clients
  + watches -- avoid polling


Ordering guarantees
  * Linearizable writes (use ZAB to totally order writes)
    clients send writes to the leader
    the leader chooses an order, numbered by "zxid"
    the leader sends writes to replicas, which all execute in zxid order
  * FIFO client order (Session consistency)
    each client specifies an order for its operations (reads AND writes)
    writes:
      writes appear in the global write order in client-specified order
    reads:
      each read executes at a particular point in the write order
      a client's successive reads execute at non-decreasing points in the order
      a client's read executes after all previous writes by that client
        a server may block a client's read to wait for previous write, or sync()

Q: Why does this make sense?
  I.e. why OK for reads to return stale data?
    why OK for client 1 to see new data, then client 2 sees older data?
  Note that the staleness of reads are bounded
    syncLimit:
      Amount of time, in ticks (see tickTime), to allow followers to sync with ZooKeeper
          If followers fall too far behind a leader, they will be dropped

A:

  At a high level:
    not as painful for programmers as it may seem
    very helpful for read performance!
      Zookeeper process reads locally at each server
      read capacity scales linearly with the num of zk servers

  Why is ZooKeeper useful despite loose consistency (compared to linearizability)?
    sync() causes subsequent client reads to see preceding writes.
      useful when a read must see latest data
      sync() makes linearizable history possible, but it hurts performance
    Writes are well-behaved, e.g. exclusive test-and-set operations
      writes really do execute in order, on latest data.
    Read order rules ensure "read your own writes". (Read-your-write Consistency)
    Read order rules help reasoning.

A few consequences for the ordering guarentees:
  Leader must preserve client write order across leader failure.
  Replicas must enforce "a client's reads never go backwards in zxid order" (Monotonic Reads)
    despite replica failure.
  Client must track highest zxid it has read
    to help ensure next read doesn't go backwards
    even if sent to a different replica


Example 1: Add one to a number stored in a ZooKeeper znode
  what if the read returns stale data?
    write will write the wrong value!
  what if another client concurrently updates?
    will one of the increments be lost?

  while true:
    x, v := getData("f")
    if setData(x + 1, version=v):
      break
  this is a "mini-transaction", effect is atomic read-modify-write

 

Example 2: Simple Locks
  acquire():
    while true:
      if create("lf", ephemeral=true), success
      if exists("lf", watch=true)
        wait for notification

  release(): (voluntarily or session timeout)
    delete("lf")

 

Example 3: Locks without Herd Effect

  1. create a "sequential" file
  2. list files
  3. if no lower-numbered, lock is acquired!
  4. if exists(next-lower-numbered, watch=true)
  5.     wait for event...
  6. goto 2

 

Note on using ZK locks
  Different from single-machine thread locks!
  If lock holder fails, system automatically releases locks.
  So locks are not really enforcing atomicity of other activities.
  To make writes atomic, use "ready" trick or mini-transactions.
  Useful for master/leader election.
    New leader must inspect state and clean up.
  Or soft locks, for performance but not correctness
    e.g. only one worker does each Map or Reduce task (but OK if done twice)
    e.g. a URL crawled by only one worker (but OK if done twice)


Zookeeper Performance Optimizations
  Reads are performed on a local replica of the database
  Clients can send async writes to leader (async = don't have to wait).
  Leader batches up many requests to reduce net and disk-write overhead.
  Assumes lots of active clients.
  Fuzzy snapshots (and idempotent updates) so snapshot doesn't stop writes.

 

Is the resulting performance good?
  Table 1 in the paper
  High read throughput -- and goes up with number of servers!
  Lower write throughput -- and goes down with number of servers!
  21,000 writes/second is pretty good!
    Maybe limited by time to persist log to hard drives.
    But still MUCH higher than 10 milliseconds per disk write -- batching.


ZooKeeper is a successful design
  see ZooKeeper's Wikipedia page for a list of projects that use it
  Rarely eliminates all the complexity from distribution.
    e.g. GFS master still needs to replicate file meta-data.
    e.g. GFS primary has its own plan for replicating chunks.
  But does bite off a bunch of common cases:
    Master election.
    Persistent master state (if state is small).
    Who is the current master? (name service).
    Worker registration.
    Work queues.


----------------------------------------------------------------------------

 

Persistence

  write-ahead log of commited operations
  weriodic snapshots of the in-memory database

 

Idempotent Operations
  operation: <txnType, path, value, newVersionNumber>
    e.g. <SetDataTXN, /foo, f3, 2>
  leader transforms a write request to a txn and fills updated state of the znode into the txn

 

Fuzzy Snapshots
  ZooKeeper creates the snapshot from its in-memory database while allowing writes to the database
  depth-first scan of the whole tree
  atomatically read metadata and data of each znode, write them to disk
  snapshots may not corresspond to state of zk at any point in time
    but it's ok:
      After rebbot, Zookeeper apply commit logs from the point at which the snapshot started
      The replay fixes the fuzzy snapshot to be a consistent snapshot of the application state


Details of batching and pipelining for performance  

There are two things going on here. First, the ZooKeeper leader (really the leader's Zab layer) batches together multiple client operations in order to send them efficiently over the network, and in order to efficiently write them to disk. For both network and disk, it's often far more efficient to send a batch of N small items all at once than it is to send or write them one at a time. This kind of batching is only effective if the leader sees many client requests at the same time; so it depends on there being lots of active clients. 

The second aspect of pipelining is that ZooKeeper makes it easy for each client to keep many write requests outstanding at a time, by supporting asynchronous operations. From the client's point of view, it can send lots of write requests without having to wait for the responses (which arrive later, as notifications after the writes commit). From the leader's point of view, that client behavior gives the leader lots of requests to accumulate into big efficient batches. 

 

Notification Corner Case

There is one case where a watch may be missed: a watch for the existence of a znode not yet created will be missed if the znode is created and deleted while disconnected.

book p103 [TBD]


Q: Why do authors of the paper say Zookeeper is wait-free ?

A:

The precise definition of wait-free: A wait-free implementation of a concurrent data object is one that guarantees that any process can complete any operation in a finite number of steps, regardless of the execution speeds of the other processes. This definition was introduced in the following paper by Herlihy:

https://cs.brown.edu/~mph/Herlihy91/p124-herlihy.pdf 

Definition of wait-freedom from Wikipedia: Wait-freedom is the strongest  non-blocking guarantee of progress, combining guaranteed system-wide throughput with starvation-freedom. An algorithm is wait-free if every operation has a bound on the number of steps the algorithm will take before the operation completes. 

Zookeeper is wait-free because it processes one client's requests without needing to wait for other clients to take action. This is partially a consequence of the API: despite being designed to support client/client coordination and synchronization, no ZooKeeper API call is defined in a way that would require one client to wait for another. In contrast, a system that supported a lock acquire operation that waited for the current lock holder to release the lock would not be wait-free.

Ultimately, however, ZooKeeper clients often need to wait for each other, and ZooKeeper does provide a waiting mechanism -- watches. The main effect of wait-freedom on the API is that watches are factored out from other operations. The combination of atomic test-and-set updates (e.g. file creation and writes condition on version) with watches allows clients to synthesize more complex blocking abstractions (e.g. locks and barriers).

 

Q: Zookeeper session timeout v.s. Chubby lease timeout ?
A:
  [TBD]

 

Q: How to do leader election with Zookeeper
A:

  [TBD]

  See https://zookeeper.apache.org/doc/current/recipes.html

 

Q: Any order guarantees for Zookeeper notifications ?

A:

If a client is watching for a change, the client will see the notification event before it sees the new state of the system after the change is made.


Q. Zab v.s. Raft/Paxos ?
A:
  [TBD]

  See https://cwiki.apache.org/confluence/display/ZOOKEEPER/Zab+vs.+Paxos


Q: The ZooKeeper ensemble can be configured such that leaders do not allow connections from clients ?
A:
      Yes. leaderServes (Cluster Option):
  Leader accepts client connections. Default value is "yes". The leader machine coordinates updates. For higher update throughput at thes slight expense of read throughput the leader can be configured to not accept clients and focus on coordination. The default to this option is yes, which means that a leader will accept client connections.
  See https://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html


Q: Zookeeper slow read (sync+read) is linearizable ?
A:
  No. The use of the sync operation before performing a read does not guarantee linearizable reads, as the following snippet taken from the zookeeper book states:

  "There is a caveat to the use of sync, which is fairly technical and deeply entwined with ZooKeeper internals. (Feel free to skip it.) Because ZooKeeper is supposed to serve reads fast and scale for read-dominated workloads, the implementation of sync has been simplified and it doesn't really traverse the execution pipeline as a regular update operation, like create, setData, or delete. It simply reaches the leader, and the leader queues a response back to the follower that sent it. There is a small chance that the leader thinks that it is the leader l, but doesn't have support from a quorum any longer because the quorum now supports a different leader, lʹ . In this case, the leader l might not have all updates that have been processed, and the sync call might not be able to honor its guarantee."

 

Q: Zookeeper v.s. Chubby
A:
  Chubby is a lock service; Zookeeper is not a lock service, it is a coordination service, clients can use its API to implement locks
  Zookeeper' consistency model (sequential consistency, linearizable writes) is more relaxed than Chubby (use paxos, consistency mode tbd)
  Zookeeper provides watches to enable efficient waiting, Chubby dont have such a notification mechanism


References

      Paper: "ZooKeeper: wait-free coordination for internet-scale systems" (USENIX ATC 2010)
  Zookeeper programmer's guide: https://zookeeper.apache.org/doc/current/zookeeperProgrammers.html
  https://cs.brown.edu/~mph/Herlihy91/p124-herlihy.pdf (wait free, universal objects, etc.)
  MIT 6.824 Zookeeper case study: https://pdos.csail.mit.edu/6.824/notes/l-zookeeper.txt
  The Zookeeper book: https://t.hao0.me/files/zookeeper.pdf

posted @ 2019-03-20 15:00  william-cheung  阅读(267)  评论(0编辑  收藏  举报