分布式系统知识总结
5105结课之后就该总结一下的,太懒了
(基本来自5105的笔记,省略了一些不太用得到的知识点,另外补充了一些6.824中出现的重要内容)
(Mark一个DDIA的读书笔记:https://www.cnblogs.com/happenlee/category/1124283.html,https://github.com/Little-Wallace/ddia-note,感觉看笔记就够了...)
1. RPC
Persistent communication(持久通信):Messages are stored until receiver is ready. Sender/receiver don’t have to be up at the same time
Transient communication(瞬时通信):Message is stored only so long as both sending/receiving applications are executing. Discard message if it can’t be delivered to receiver
Synchronous communication(同步通信):Sender blocks until message is delivered to receiver Variant: block until receiver processes the message
Asynchronous communication(异步通信):Sender continues immediately after it has submitted the message
RPC:client远程调用server上的函数
同步RPC和异步RPC:
Stub: Hide communication details (相当于thrift生成的.java文件的内容)
- Client stub: Converts function call to remote communication | Passes parameters to server machine | Receives results
- Server stub: Receives parameters and request from client | Calls the desired server function | Returns results to client
Binding:set up communication between client and server
为解决不同机器上数据格式不一样的问题,stub需要进行parameter marshalling(将参数使用一种standard data format打包进消息)
parameter passing(略)
2. Message Oriented Communication, multicast
在rpc中,如果无法保证server正在运行,那么同步rpc会导致client被阻塞。所以需要消息传递系统
Message Oriented Transient Communication
1. Socket 应用程序通过网络发送数据时,可以把数据写入套接字,然后再从套接字读取数据。
2. MPI
3. ZeroMQ:Provide enhancement to sockets for a few simple communication patterns.
- ZeroMQ中的连接是异步的。connection request messages are queued at the sender’s side
- Built on top of TCP/IP -> ZeroMQ will automatically set up connection before message transmission
- In ZeroMQ, a socket can be bound to multiple addresses -> support many to one/one to many(multicasting)
Message Oriented Persistent Communication
Message-Queuing Model: Support asynchronous persistent communication -> Intermediate storage for message while sender or receiver are inactive.
Multicast
在传输信息时,结点会组织成一个覆盖网络(overlay network,建立在物理网络基础上的一层,可以类比VPN),然后用它来传播信息给成员。
Multicast Tree(略)
Flooding
In this case, each node simply forwards a message m to each of its neighbors, except to the one from which it received m. Furthermore, if a node keeps track of the messages it received and forwarded, it can simply ignore duplicates.
- - Naïve flooding: Number of messages = number of edges O(N^2)
- - Probabilistic flooding: Send message with probability p_flood
Gossiping
模仿疾病传播的过程。用于large network
Ref:http://www.10tiao.com/html/151/201903/2665515873/1.html
1. Anti-Entropy
2. Rumor Spreading
3. Directional Gossiping
(略)
3. Naming
Naming system: resolve a name to its address Flat Naming / Structured Naming / Attribute-based Naming
Broadcasting && Multicasting
Broadcast the name -> the named entity(eg: machine) responds its address
Eg: ARP:计算机A向局域网广播,想要连接IP为172.18.72.5的计算机B。局域网中所有电脑都会收到该广播,但只有计算机B会响应,返回它的MAC地址。A收到这个报文后,就将B的mac地址记录下来,存在A的ARP缓存表。
Distributed Hash Table
参考5105 pa2:https://www.cnblogs.com/pdev/p/10621547.html
和hash专题:https://www.cnblogs.com/pdev/p/11332264.html
Linux NFS:用于访问远程主机的文件
(略)
DNS
Namespace: 路径名root: <cn, edu, hfut, ci> -> ci.hfut.edu.cn
(略)
4. MapReduce
参考MapReduce的笔记:https://www.cnblogs.com/pdev/p/11087826.html
和5105 pa1:https://www.cnblogs.com/pdev/p/11331792.html
5. Mutex election
Mutex Algorithm: 访问互斥资源用
(略)
Election Algorithm:用于在分布式系统中选出coordinator
每个node事先都知道所有node的id,但不知道哪些node还活着,所以要选举而不是直接本地sort取最大的...
(略)
5. consistency
Data Replication:建立多个副本 Data Consistency:保证任意时间/地点看到的事物是完全一致的
Data-centric consistency models:
解决共享数据的读/写操作中的一致性问题, from server’s perspective
- 1.1 strict consistency:要求全局同步,开销太高,仅存在于理论中
- 1.2 sequential consistency:Any process see the same sequence(order) of operations. (课本P204)
- 任何读写操作的有效交叉都是可接受的(不关心具体执行结果,No notion of absolute time),但所有进程都要看到相同的RW顺序(只关心一致性)。同一进程内对同一个变量的读写保持该进程本身的顺序。
- 1.3 casual consistency(因果一致性):Causally related writes must be seen in the same order by all processes. (课本P206)
- 不同机器必须以相同顺序看到具有因果关系(eg:对同一资源先W后R,会产生依赖关系)的写操作。可以以不同顺序看到并发的(无因果关系)写操作[这一点与sequential consistency不一样]。(同一进程内的R/W操作一定是有依赖关系的)
- 1.4 FIFO consistency: All write in one process are seen in the same order by all processes. 不同进程间的write顺序不要求
- 1.5 Synchronization-based consistency: All local writes are flushed out. All remote writes are gathered in. (in sync barrier)
- 1.6 Weak consistency:
Client-centric consistency models: 研究同一client在server的不同副本上操作的问题。大多数是read, write-write conflict很少, 只要求最终一致性(what’s visible to client), 相对data-centric开销更低(weaker consistency)
Read Set:client读过的writes Write Set:client发起的写操作
- Eventual consistency:如果很长时间内没有写过,整个系统会逐渐达到一致。用户能接受一段时间的不一致(更新较慢)
- An update should eventually propagate to all replicas. But nothing is assumed about the timeliness of update propagation.
- 3.2 Monotonic Read: If a process reads a value of x, any successive read of x by it will return the same or a more recent value. (eg client在不同地方读邮件。replica3从有更新副本的replica1上fetch ReadSet里的所有更改)
- 3.3 Monotonic Write: If a process writes to x, this write will be completed before any successive write to x by it. 关注写操作的顺序 (E.g.: All outgoing posts from different locations。replica3从有更新副本的replica1上fetch WriteSet里的所有操作)
- 3.4 Read you write: A write to x by a process will always be seen by a successive read of x by it. (eg see my earlier posts)
- 3.5 Write follows read: If a process reads a value of x, any successive write to x by it will take place on the same or a more recent value. (eg: Your post will reflect any postings you’ve read earlier)
Continuous consistency: 衡量系统内(不同replica之间)数据不一致性的程度,以及表述系统能容忍哪些不一致性的模型
Ref: https://blog.xiaohansong.com/Continuous-Consistency.html
Consistent unit (conit): 一致性单元表示的是在一致性模型中度量的数据单元(eg: 一个副本上的某个变量)
对于每个一致性单元,持续一致性可以用三维向量定义为:一致性 = (数值偏差,顺序偏差,新旧偏差)。当所有偏差都为 0 时,就达到了线性一致性的要求。
Ordering based consistency protocol(针对顺序一致性模型)
主备份协议/本地写协议
(略)
Replicated-Write protocol: Writes can be performed at multiple replicas (主备份协议中write只在一个replica上进行)
Quorum protocol
Operations are sent (from one replica) to a subset of replicas. 读写操作都要在一坨replicas上进行
For N replicas, where Read quorum need NR replicas to agree, and Write quorum need NW replicas to agree. Need to satisfy:
- NR + NW > N (Avoid read-write conflicts, 一坨NR读, 一坨NW写时, 读不到最新版本)
- NW > N/2 (Avoid write-write conflicts, 两坨NW同时读到的版本可能不一样)
Quorum的实现:
参考5105 pa3:https://www.cnblogs.com/pdev/p/11331871.html
6. Fault Tolerance
Crash failure: server直接停机。在停机之前返回的结果都是正确的。可以被eventually detected
Byzantine(Arbitrary) failure: Incorrect but undetectable. Server可能没停机,但返回错误的结果,很难判断出这种failure
CAP Theorem
任何基于网络的数据共享系统最多只能满足数据一致性(Consistency)、可用性(Availability, 服务一直可用且保证正常响应时间)和网络分区容忍(Partition Tolerance, 当服务分散在多个replica/网络分区中, 且某个节点/网络分区出故障时, 仍能对外提供满足一致性和可用性的服务)三个特性中的两个。
提高分区容忍性的办法就是一个数据项复制到多个节点上,那么出现分区之后,这一数据项就可能分布到各个区里。分区容忍就提高了。然而,要把数据复制到多个节点,就会带来一致性的问题,就是多个节点上面的数据可能是不一致的。要保证一致,每次写操作就都要等待全部节点写成功,而这等待又会带来可用性的问题。
Consensus Problem
多个server在Unreliable communication channel的情况下如何达成agreement on a common value
Consensus:只要所有replica最终达成一致即可。具体的值不一定(it depends)
1. Paxos
Assume there are only crash failures. proposer发起提案(value)给所有Acceptor,超过半数Acceptor获得批准后,proposer将提案写入Acceptor内,最终所有Acceptor获得一致性的确定性取值,且后续不允许再修改。
1. 准备阶段(占坑阶段)
第一阶段A:Proposer选择一个value v,向所有的Acceptor广播Prepare(value=v, timestamp=t)请求。
第一阶段B:Acceptor接收到Prepare(v, t)请求:若t比之前接收的Prepare请求都要大,则 1)承诺将不会接收比t旧的提议,2)将v写入本地,3)返回the value of previous highest timestamp accepted proposal(这个value比(v, t)要旧一个. 是为了保证该proposer见过目前为止最新的value);否则不予理会。
2. 接受阶段(提交阶段)
第二阶段A:整个协议最为关键的点:Proposer得到了Acceptor响应
如果未超过半数Acceptor响应,直接转为提议失败;
如果收到超过多数Acceptor的promise,又分为不同情况(都针对该majority里的acceptor):
如果这一majority里所有Acceptor都未接收过值(都为null),那么向所有Acceptor发起自己的值(propose(v, t))
如果这一majority里有至少一个Acceptor接收过值,那么proposer从所有接受过的值中选择对应的timestamp最大的(vo, to)作为accept 提议的值,timestamp仍然为t(propose(vo, t))。但此时Proposer就不能提议自己的值,只能信任Acceptor通过的值(因为value一但获得过确定性取值,就不能再被更改);
第二阶段B:Acceptor接收到propose后:如果它的版本号不等于自己第一阶段记录的版本号t(违反了自己的承诺),不接受该请求;相等则写入本地。
1. 理解第一阶段Acceptor的处理流程:如果本地已经写入了,不再接受和同意后面的所有请求,并返回本地写入的值;如果本地未写入,则本地记录该请求的版本号,并不再接受其他版本号的请求,简单来说只信任最后一次提交的版本号的请求,使其他版本号写入失效;
2. 理解第二阶段proposer的处理流程:未超过半数Acceptor响应,提议失败;超过半数的Acceptor值都为空才提交自身要写入的值,否则选择非空值里版本号最大的值提交,最大的区别在于是提交的值是自身的还是使用以前提交的。
有些情况下paxos会有两种可能的共识结果,但这是可以接受的(只要所有replica最终达成一致即可。具体的值不一定)
Suppose there are 2 proposers(P1, P2) and 3 acceptors(A1, A2, A3). All the 3 acceptors have null value at first. STEP1: P1 propagate Prepare(v1, t1) to 3 acceptors, but only A1 received(due to network failure), wrote (v1, t1) to its local place, and returned a promise. A2, A3 remains null. STEP2: P1 received only 1 respond, so this preparation is failed. STEP3: P2 propagate Prepare(v2, t2) to 3 acceptors, but only A2 received(due to network failure), wrote (v2, t2) to its local place, and returned a promise. A3 remains null and A1 remains (v1, t1). STEP4: P2 received only 1 respond, so this preparation is failed. STEP5: P1 propagate Prepare(v3, t3) to 3 acceptors again, but only 2 of them received(due to network failure). Then there will be 3 situations: 1) A1(v1, t1) and A2(v2, t2) received: (v2, t2) will be chosen in the second phase, and the final value is v2. 2) A1(v1, t1) and A3(null, null) received: (v1, t1) will be chosen in the second phase, and the final value is v1. 3) A2(v2, t2) and A3(null, null) received: (v2, t2) will be chosen in the second phase, and the final value is v2.
2. Byzantine Fault Tolerance
for Byzantine Failure. 保证在有几个节点是traitor的情况下, 整个系统仍保持consistency
定理:OM(m): 对于一共有3M+1个节点的系统,最多可以容纳M个traitor
正常节点:会严格转发它所收到的value。 Traitor:会搞破坏(收到了a,但转发给别人b,c,d)
Byzantine Agreement Algorithm:在存在Traitor的网络中,如何保证可靠的传输?
3. Raft
7. Reliable Communication
Reliable RPC
(略)
Reliable Multicast
(略)
8. Recovery
checkpoint
logging
(略)
9. Coordination
Clock Sync
Logical Clock -- Timestamps
(略)
posted on 2019-08-10 15:24 Pentium.Labs 阅读(1095) 评论(0) 编辑 收藏 举报