[部分转]OSD源码解读1--通信流程
[部分转自 https://hustcat.github.io/ceph-internal-network/ ]
由于Ceph的历史很久,网络没有采用现在常用的事件驱动(epoll)的模型,而是采用了与MySQL类似的多线程模型,每个连接(socket)有一个读线程,不断从socket读取,一个写线程,负责将数据写到socket。多线程实现简单,但并发性能就不敢恭维了。
Messenger是网络模块的核心数据结构,负责接收/发送消息。OSD主要有两个Messenger:ms_public处于与客户端的消息,ms_cluster处理与其它OSD的消息。
初始化过程
初始化的过程在ceph-osd.cc中:
//创建一个 Messenger 对象,由于 Messenger 是抽象类,不能直接实例化,提供了一个 ::create 的方法来创建子类
Messenger *ms_public = Messenger::create(g_ceph_context, public_msg_type, entity_name_t::OSD(whoami), "client", getpid(), Messenger::HAS_HEAVY_TRAFFIC | Messenger::HAS_MANY_CONNECTIONS);//处理client消息 //??会选择SimpleMessager类实例,SimpleMessager类中会有一个叫做Accepter的成员,会申请该对象,并且初始化。 Messenger *ms_cluster = Messenger::create(g_ceph_context, cluster_msg_type, entity_name_t::OSD(whoami), "cluster", getpid(), Messenger::HAS_HEAVY_TRAFFIC | Messenger::HAS_MANY_CONNECTIONS); //下面几个是检测心跳的 Messenger *ms_hb_back_client = Messenger::create(·····) Messenger *ms_hb_front_client = Messenger::create(·····) Messenger *ms_hb_back_server = Messenger::create(······) Messenger *ms_hb_front_server = Messenger::create(·····) Messenger *ms_objecter = Messenger::create(······) ·········· //绑定到固定ip // 这个ip最终会绑定在Accepter中。然后在Accepter->bind函数中,会对这个ip初始化一个socket, //并且保存为listen_sd。接着会启动Accepter->start(),这里会启动Accepter的监听线程,这个线 //程做的事情放在Accepter->entry()函数中 /*
messenger::bindv() -- messenger::bind()
--simpleMessenger::bind()
---- accepter.bind()
----创建socket---- socket bind() --- socket listen()
*/ if (ms_public->bindv(public_addrs) < 0) forker.exit(1);
if (ms_cluster->bindv(cluster_addrs) < 0)
forker.exit(1); ··········· //创建dispatcher子类对象 osd = new OSD(g_ceph_context, store, whoami, ms_cluster, ms_public, ms_hb_front_client, ms_hb_back_client, ms_hb_front_server, ms_hb_back_server, ms_objecter, &mc, data_path, journal_path); ········· // 启动 Reaper 线程 ms_public->start(); ms_hb_front_client->start(); ms_hb_back_client->start(); ms_hb_front_server->start(); ms_hb_back_server->start(); ms_cluster->start(); ms_objecter->start(); //初始化 OSD模块 /** a). 初始化 OSD 模块 b). 通过 SimpleMessenger::add_dispatcher_head() 注册自己到 SimpleMessenger::dispatchers 中, 流程如下: Messenger::add_dispatcher_head() --> ready() --> dispatch_queue.start()(新 DispatchQueue 线程) --> Accepter::start()(启动start线程) --> accept --> SimpleMessenger::add_accept_pipe --> Pipe::start_reader --> Pipe::reader() 在 ready() 中: 通过 Messenger::reader(), 1) DispatchQueue 线程会被启动,用于缓存收到的消息消息 2) Accepter 线程启动,开始监听新的连接请求. */ // start osd err = osd->init(); ············ // 进入 mainloop, 等待退出 ms_public->wait(); // Simplemessenger::wait() ms_hb_front_client->wait(); ms_hb_back_client->wait(); ms_hb_front_server->wait(); ms_hb_back_server->wait(); ms_cluster->wait(); ms_objecter->wait();
消息处理
相关数据结构
网络模块的核心是SimpleMessager:
(1)它包含一个Accepter对象,它会创建一个单独的线程,用于接收新的连接(Pipe)。
void *Accepter::entry() { ... int sd = ::accept(listen_sd, (sockaddr*)&addr.ss_addr(), &slen); if (sd >= 0) { errors = 0; ldout(msgr->cct,10) << "accepted incoming on sd " << sd << dendl; msgr->add_accept_pipe(sd); ... //创建新的Pipe Pipe *SimpleMessenger::add_accept_pipe(int sd) { lock.Lock(); Pipe *p = new Pipe(this, Pipe::STATE_ACCEPTING, NULL); p->sd = sd; p->pipe_lock.Lock(); p->start_reader(); p->pipe_lock.Unlock(); pipes.insert(p); accepting_pipes.insert(p); lock.Unlock(); return p; }
(2)包含所有的连接对象(Pipe),每个连接Pipe有一个读线程/写线程。读线程负责从socket读取数据,然后放消息放到DispatchQueue分发队列。写线程负责从发送队列取出Message,然后写到socket。
class Pipe : public RefCountedObject { /** * The Reader thread handles all reads off the socket -- not just * Messages, but also acks and other protocol bits (excepting startup, * when the Writer does a couple of reads). * All the work is implemented in Pipe itself, of course. */ class Reader : public Thread { Pipe *pipe; public: Reader(Pipe *p) : pipe(p) {} void *entry() { pipe->reader(); return 0; } } reader_thread; ///读线程 friend class Reader; /** * The Writer thread handles all writes to the socket (after startup). * All the work is implemented in Pipe itself, of course. */ class Writer : public Thread { Pipe *pipe; public: Writer(Pipe *p) : pipe(p) {} void *entry() { pipe->writer(); return 0; } } writer_thread; ///写线程 friend class Writer; ... ///发送队列 map<int, list<Message*> > out_q; // priority queue for outbound msgs DispatchQueue *in_q; ///接收队列
(3)包含一个分发队列DispatchQueue,分发队列有一个专门的分发线程(DispatchThread),将消息分发给Dispatcher(OSD)完成具体逻辑处理。
收到连接请求
请求的监听和处理由 SimpleMessenger::ready –> Accepter::entry 实现
1 void SimpleMessenger::ready() 2 { 3 ldout(cct,10) << "ready " << get_myaddr() << dendl; 4 5 // 启动 DispatchQueue 线程 6 dispatch_queue.start(); 7 8 lock.Lock(); 9 if (did_bind) 10 11 // 启动 Accepter 线程监听客户端连接, 见下面的 Accepter::entry 12 accepter.start(); 13 lock.Unlock(); 14 } 15 16 17 /* 18 poll.h 19 20 struct polldf{ 21 int fd; 22 short events; 23 short revents; 24 } 25 这个结构中fd表示文件描述符,events表示请求检测的事件, 26 revents表示检测之后返回的事件, 27 如果当某个文件描述符有状态变化时,revents的值就不为空。 28 29 */ 30 31 //监听 32 void *Accepter::entry() 33 { 34 ldout(msgr->cct,1) << __func__ << " start" << dendl; 35 36 int errors = 0; 37 38 struct pollfd pfd[2]; 39 memset(pfd, 0, sizeof(pfd)); 40 // listen_sd 是 Accepter::bind()中创建绑定的 socket 41 pfd[0].fd = listen_sd;//想监听的文件描述符 42 pfd[0].events = POLLIN | POLLERR | POLLNVAL | POLLHUP;//所关心的事件 43 pfd[1].fd = shutdown_rd_fd; 44 pfd[1].events = POLLIN | POLLERR | POLLNVAL | POLLHUP; 45 while (!done) {//开始循环监听 46 ldout(msgr->cct,20) << __func__ << " calling poll for sd:" << listen_sd << dendl; 47 int r = poll(pfd, 2, -1); 48 if (r < 0) { 49 if (errno == EINTR) { 50 continue; 51 } 52 ldout(msgr->cct,1) << __func__ << " poll got error" 53 << " errno " << errno << " " << cpp_strerror(errno) << dendl; 54 ceph_abort(); 55 } 56 ldout(msgr->cct,10) << __func__ << " poll returned oke: " << r << dendl; 57 ldout(msgr->cct,20) << __func__ << " pfd.revents[0]=" << pfd[0].revents << dendl; 58 ldout(msgr->cct,20) << __func__ << " pfd.revents[1]=" << pfd[1].revents << dendl; 59 60 if (pfd[0].revents & (POLLERR | POLLNVAL | POLLHUP)) { 61 ldout(msgr->cct,1) << __func__ << " poll got errors in revents " 62 << pfd[0].revents << dendl; 63 ceph_abort(); 64 } 65 if (pfd[1].revents & (POLLIN | POLLERR | POLLNVAL | POLLHUP)) { 66 // We got "signaled" to exit the poll 67 // clean the selfpipe 68 char ch; 69 if (::read(shutdown_rd_fd, &ch, sizeof(ch)) == -1) { 70 if (errno != EAGAIN) 71 ldout(msgr->cct,1) << __func__ << " Cannot read selfpipe: " 72 << " errno " << errno << " " << cpp_strerror(errno) << dendl; 73 } 74 break; 75 } 76 if (done) break; 77 78 // accept 79 sockaddr_storage ss; 80 socklen_t slen = sizeof(ss); 81 int sd = accept_cloexec(listen_sd, (sockaddr*)&ss, &slen); 82 if (sd >= 0) { 83 errors = 0; 84 ldout(msgr->cct,10) << __func__ << " incoming on sd " << sd << dendl; 85 86 msgr->add_accept_pipe(sd);//客户端连接成功,函数在simplemessenger.cc中,建立pipe,告知要处理的socket为sd。启动pipe的读线程。 87 } else { 88 int e = errno; 89 ldout(msgr->cct,0) << __func__ << " no incoming connection? sd = " << sd 90 << " errno " << e << " " << cpp_strerror(e) << dendl; 91 if (++errors > msgr->cct->_conf->ms_max_accept_failures) { 92 lderr(msgr->cct) << "accetper has encoutered enough errors, just do ceph_abort()." << dendl; 93 ceph_abort(); 94 } 95 } 96 } 97 98 ldout(msgr->cct,20) << __func__ << " closing" << dendl; 99 // socket is closed right after the thread has joined. 100 // closing it here might race 101 if (shutdown_rd_fd >= 0) { 102 ::close(shutdown_rd_fd); 103 shutdown_rd_fd = -1; 104 } 105 106 ldout(msgr->cct,10) << __func__ << " stopping" << dendl; 107 return 0; 108 }
随后创建pipe()开始消息的处理
1 Pipe *SimpleMessenger::add_accept_pipe(int sd) 2 { 3 lock.Lock(); 4 Pipe *p = new Pipe(this, Pipe::STATE_ACCEPTING, NULL); 5 p->sd = sd; 6 p->pipe_lock.Lock(); 7 /* 8 调用 Pipe::start_reader() 开始读取消息, 将会创建一个读线程开始处理. 9 Pipe::start_reader() --> Pipe::reader 10 */ 11 p->start_reader(); 12 p->pipe_lock.Unlock(); 13 pipes.insert(p); 14 accepting_pipes.insert(p); 15 lock.Unlock(); 16 return p; 17 }
创建消息读取和处理线程
处理消息由 Pipe::start_reader() –> Pipe::reader() 开始,此时已经是在 Reader 线程中. 首先会调用 accept() 做一些简答的处理然后创建 Writer() 线程,等待发送回复 消息. 然后读取消息, 读取完成之后, 将收到的消息封装在 Message 中,交由 dispatch_queue() 处理.
dispatch_queue() 找到注册者,将消息转交给它处理,处理完成唤醒 Writer() 线程发送回复消息.
1 /* read msgs from socket. 2 * also, server. 3 */ 4 /* 5 处理消息由 Pipe::start_reader() –> Pipe::reader() 开始,此时已经是在 Reader 线程中. 首先会调用 accept() 6 做一些简答的处理然后创建 Writer() 线程,等待发送回复 消息. 然后读取消息, 读取完成之后, 将收到的消息封装在 Message 7 中,交由 dispatch_queue() 处理.dispatch_queue() 找到注册者,将消息转交给它处理,处理完成唤醒 Writer() 线程发送回复消息.s 8 */ 9 void Pipe::reader() 10 { 11 pipe_lock.Lock(); 12 13 14 /* 15 Pipe::accept() 会调用 Pipe::start_writer() 创建 writer 线程, 进入 writer 线程 16 后,会 cond.Wait() 等待被激活,激活的流程看下面的说明. Writer 线程的创建见后后面 17 Pipe::accept() 的分析 18 */ 19 if (state == STATE_ACCEPTING) { 20 accept(); 21 ceph_assert(pipe_lock.is_locked()); 22 } 23 24 // loop. 25 while (state != STATE_CLOSED && 26 state != STATE_CONNECTING) { 27 ceph_assert(pipe_lock.is_locked()); 28 29 // sleep if (re)connecting 30 if (state == STATE_STANDBY) { 31 ldout(msgr->cct,20) << "reader sleeping during reconnect|standby" << dendl; 32 cond.Wait(pipe_lock); 33 continue; 34 } 35 36 // get a reference to the AuthSessionHandler while we have the pipe_lock 37 std::shared_ptr<AuthSessionHandler> auth_handler = session_security; 38 39 pipe_lock.Unlock(); 40 41 char tag = -1; 42 ldout(msgr->cct,20) << "reader reading tag..." << dendl; 43 // 读取消息类型,某些消息会马上激活 writer 线程先处理 44 if (tcp_read((char*)&tag, 1) < 0) {//读取失败 45 pipe_lock.Lock(); 46 ldout(msgr->cct,2) << "reader couldn't read tag, " << cpp_strerror(errno) << dendl; 47 fault(true); 48 continue; 49 } 50 51 if (tag == CEPH_MSGR_TAG_KEEPALIVE) {//keepalive 信息 52 ldout(msgr->cct,2) << "reader got KEEPALIVE" << dendl; 53 pipe_lock.Lock(); 54 connection_state->set_last_keepalive(ceph_clock_now()); 55 continue; 56 } 57 if (tag == CEPH_MSGR_TAG_KEEPALIVE2) {//keepalive 信息 58 ldout(msgr->cct,30) << "reader got KEEPALIVE2 tag ..." << dendl; 59 ceph_timespec t; 60 int rc = tcp_read((char*)&t, sizeof(t)); 61 pipe_lock.Lock(); 62 if (rc < 0) { 63 ldout(msgr->cct,2) << "reader couldn't read KEEPALIVE2 stamp " 64 << cpp_strerror(errno) << dendl; 65 fault(true); 66 } else { 67 send_keepalive_ack = true; 68 keepalive_ack_stamp = utime_t(t); 69 ldout(msgr->cct,2) << "reader got KEEPALIVE2 " << keepalive_ack_stamp 70 << dendl; 71 connection_state->set_last_keepalive(ceph_clock_now()); 72 cond.Signal();//直接激活writer线程处理 73 } 74 continue; 75 } 76 if (tag == CEPH_MSGR_TAG_KEEPALIVE2_ACK) { 77 ldout(msgr->cct,2) << "reader got KEEPALIVE_ACK" << dendl; 78 struct ceph_timespec t; 79 int rc = tcp_read((char*)&t, sizeof(t)); 80 pipe_lock.Lock(); 81 if (rc < 0) { 82 ldout(msgr->cct,2) << "reader couldn't read KEEPALIVE2 stamp " << cpp_strerror(errno) << dendl; 83 fault(true); 84 } else { 85 connection_state->set_last_keepalive_ack(utime_t(t)); 86 } 87 continue; 88 } 89 90 // open ... 91 if (tag == CEPH_MSGR_TAG_ACK) { 92 ldout(msgr->cct,20) << "reader got ACK" << dendl; 93 ceph_le64 seq; 94 int rc = tcp_read((char*)&seq, sizeof(seq)); 95 pipe_lock.Lock(); 96 if (rc < 0) { 97 ldout(msgr->cct,2) << "reader couldn't read ack seq, " << cpp_strerror(errno) << dendl; 98 fault(true); 99 } else if (state != STATE_CLOSED) { 100 handle_ack(seq); 101 } 102 continue; 103 } 104 105 else if (tag == CEPH_MSGR_TAG_MSG) { 106 ldout(msgr->cct,20) << "reader got MSG" << dendl; 107 // 收到 MSG 消息 108 Message *m = 0; 109 // 将消息读取到 new 到的 Message 对象 110 int r = read_message(&m, auth_handler.get()); 111 112 pipe_lock.Lock(); 113 114 if (!m) { 115 if (r < 0) 116 fault(true); 117 continue; 118 } 119 120 m->trace.event("pipe read message"); 121 122 if (state == STATE_CLOSED || 123 state == STATE_CONNECTING) { 124 in_q->dispatch_throttle_release(m->get_dispatch_throttle_size()); 125 m->put(); 126 continue; 127 } 128 129 // check received seq#. if it is old, drop the message. 130 // note that incoming messages may skip ahead. this is convenient for the client 131 // side queueing because messages can't be renumbered, but the (kernel) client will 132 // occasionally pull a message out of the sent queue to send elsewhere. in that case 133 // it doesn't matter if we "got" it or not. 134 if (m->get_seq() <= in_seq) { 135 ldout(msgr->cct,0) << "reader got old message " 136 << m->get_seq() << " <= " << in_seq << " " << m << " " << *m 137 << ", discarding" << dendl; 138 in_q->dispatch_throttle_release(m->get_dispatch_throttle_size()); 139 m->put(); 140 if (connection_state->has_feature(CEPH_FEATURE_RECONNECT_SEQ) && 141 msgr->cct->_conf->ms_die_on_old_message) 142 ceph_abort_msg("old msgs despite reconnect_seq feature"); 143 continue; 144 } 145 if (m->get_seq() > in_seq + 1) { 146 ldout(msgr->cct,0) << "reader missed message? skipped from seq " 147 << in_seq << " to " << m->get_seq() << dendl; 148 if (msgr->cct->_conf->ms_die_on_skipped_message) 149 ceph_abort_msg("skipped incoming seq"); 150 } 151 152 m->set_connection(connection_state.get()); 153 154 // note last received message. 155 in_seq = m->get_seq(); 156 157 // 先激活 writer 线程 ACK 这个消息 158 cond.Signal(); // wake up writer, to ack this 159 160 ldout(msgr->cct,10) << "reader got message " 161 << m->get_seq() << " " << m << " " << *m 162 << dendl; 163 in_q->fast_preprocess(m); 164 165 /* 166 如果该次请求是可以延迟处理的请求,将 msg 放到 Pipe::DelayedDelivery::delay_queue, 167 后面通过相关模块再处理 168 注意,一般来讲收到的消息分为三类: 169 1. 直接可以在 reader 线程中处理,如上面的 CEPH_MSGR_TAG_ACK 170 2. 正常处理, 需要将消息放入 DispatchQueue 中,由后端注册的消息处理,然后唤醒发送线程发送 171 3. 延迟发送, 下面的这种消息, 由定时时间决定什么时候发送 172 */ 173 if (delay_thread) { 174 utime_t release; 175 if (rand() % 10000 < msgr->cct->_conf->ms_inject_delay_probability * 10000.0) { 176 release = m->get_recv_stamp(); 177 release += msgr->cct->_conf->ms_inject_delay_max * (double)(rand() % 10000) / 10000.0; 178 lsubdout(msgr->cct, ms, 1) << "queue_received will delay until " << release << " on " << m << " " << *m << dendl; 179 } 180 delay_thread->queue(release, m); 181 } else { 182 183 /* 184 正常处理的消息, 185 若can_fast_dispatch: ; 186 187 否则放到 Pipe::DispatchQueue *in_q 中, 以下是整个消息的流程 188 DispatchQueue::enqueue() 189 --> mqueue.enqueue() -> cond.Signal()(激活唤醒DispatchQueue::dispatch_thread 线程) 190 --> DispatchQueue::dispatch_thread::entry() 该线程得到唤醒 191 --> Messenger::ms_deliver_XXX 192 --> 具体的 Dispatch 实例, 如 Monitor::ms_dispatch() 193 --> Messenger::send_message() 194 --> SimpleMessenger::submit_message() 195 --> Pipe::_send() 196 --> Pipe::out_q[].push_back(m) -> cond.Signal 激活 writer 线程 197 --> ::sendmsg()//发送到 socket 198 */ 199 if (in_q->can_fast_dispatch(m)) { 200 reader_dispatching = true; 201 pipe_lock.Unlock(); 202 in_q->fast_dispatch(m); 203 pipe_lock.Lock(); 204 reader_dispatching = false; 205 if (state == STATE_CLOSED || 206 notify_on_dispatch_done) { // there might be somebody waiting 207 notify_on_dispatch_done = false; 208 cond.Signal(); 209 } 210 } else { 211 in_q->enqueue(m, m->get_priority(), conn_id);//交给 dispatch_queue 处理 212 } 213 } 214 } 215 216 else if (tag == CEPH_MSGR_TAG_CLOSE) { 217 ldout(msgr->cct,20) << "reader got CLOSE" << dendl; 218 pipe_lock.Lock(); 219 if (state == STATE_CLOSING) { 220 state = STATE_CLOSED; 221 state_closed = true; 222 } else { 223 state = STATE_CLOSING; 224 } 225 cond.Signal(); 226 break; 227 } 228 else { 229 ldout(msgr->cct,0) << "reader bad tag " << (int)tag << dendl; 230 pipe_lock.Lock(); 231 fault(true); 232 } 233 } 234 235 236 // reap? 237 reader_running = false; 238 reader_needs_join = true; 239 unlock_maybe_reap(); 240 ldout(msgr->cct,10) << "reader done" << dendl; 241 }
Pipe::accept() 做一些简单的协议检查和认证处理,之后创建 Writer() 线程: Pipe::start_writer() –> Pipe::Writer
1 //ceph14中内容比这里多很多 2 int Pipe::accept() 3 { 4 ldout(msgr->cct,10) << "accept" << dendl; 5 // 检查自己和对方的协议版本等信息是否一致等操作 6 // ...... 7 8 while (1) { 9 // 协议检查等操作 10 // ...... 11 12 /** 13 * 通知注册者有新的 accept 请求过来,如果 Dispatcher 的子类有实现 14 * Dispatcher::ms_handle_accept(),则会调用该方法处理 15 */ 16 msgr->dispatch_queue.queue_accept(connection_state.get()); 17 18 // 发送 reply 和认证相关的消息 19 // ...... 20 21 if (state != STATE_CLOSED) { 22 /** 23 * 前面的协议检查,认证等都完成之后,开始创建 Writer() 线程等待注册者 24 * 处理完消息之后发送 25 * 26 */ 27 start_writer(); 28 } 29 ldout(msgr->cct,20) << "accept done" << dendl; 30 31 /** 32 * 如果该消息是延迟发送的消息, 且相关的发送线程没有启动,启动之 33 * Pipe::maybe_start_delay_thread() 34 * --> Pipe::DelayedDelivery::entry() 35 */ 36 maybe_start_delay_thread(); 37 return 0; // success. 38 } 39 }
随后writer线程等待被唤醒发送回复消息
1 /* write msgs to socket. 2 * also, client. 3 */ 4 void Pipe::writer() 5 { 6 pipe_lock.Lock(); 7 while (state != STATE_CLOSED) {// && state != STATE_WAIT) { 8 ldout(msgr->cct,10) << "writer: state = " << get_state_name() 9 << " policy.server=" << policy.server << dendl; 10 11 // standby? 12 if (is_queued() && state == STATE_STANDBY && !policy.server) 13 state = STATE_CONNECTING; 14 15 // connect? 16 if (state == STATE_CONNECTING) { 17 ceph_assert(!policy.server); 18 connect(); 19 continue; 20 } 21 22 if (state == STATE_CLOSING) { 23 // write close tag 24 ldout(msgr->cct,20) << "writer writing CLOSE tag" << dendl; 25 char tag = CEPH_MSGR_TAG_CLOSE; 26 state = STATE_CLOSED; 27 state_closed = true; 28 pipe_lock.Unlock(); 29 if (sd >= 0) { 30 // we can ignore return value, actually; we don't care if this succeeds. 31 int r = ::write(sd, &tag, 1); 32 (void)r; 33 } 34 pipe_lock.Lock(); 35 continue; 36 } 37 38 if (state != STATE_CONNECTING && state != STATE_WAIT && state != STATE_STANDBY && 39 (is_queued() || in_seq > in_seq_acked)) { 40 41 42 // 对 keepalive, keepalive2, ack 包的处理 43 // keepalive? 44 if (send_keepalive) { 45 int rc; 46 if (connection_state->has_feature(CEPH_FEATURE_MSGR_KEEPALIVE2)) { 47 pipe_lock.Unlock(); 48 rc = write_keepalive2(CEPH_MSGR_TAG_KEEPALIVE2, 49 ceph_clock_now()); 50 } else { 51 pipe_lock.Unlock(); 52 rc = write_keepalive(); 53 } 54 pipe_lock.Lock(); 55 if (rc < 0) { 56 ldout(msgr->cct,2) << "writer couldn't write keepalive[2], " 57 << cpp_strerror(errno) << dendl; 58 fault(); 59 continue; 60 } 61 send_keepalive = false; 62 } 63 if (send_keepalive_ack) { 64 utime_t t = keepalive_ack_stamp; 65 pipe_lock.Unlock(); 66 int rc = write_keepalive2(CEPH_MSGR_TAG_KEEPALIVE2_ACK, t); 67 pipe_lock.Lock(); 68 if (rc < 0) { 69 ldout(msgr->cct,2) << "writer couldn't write keepalive_ack, " << cpp_strerror(errno) << dendl; 70 fault(); 71 continue; 72 } 73 send_keepalive_ack = false; 74 } 75 76 // send ack? 77 if (in_seq > in_seq_acked) { 78 uint64_t send_seq = in_seq; 79 pipe_lock.Unlock(); 80 int rc = write_ack(send_seq); 81 pipe_lock.Lock(); 82 if (rc < 0) { 83 ldout(msgr->cct,2) << "writer couldn't write ack, " << cpp_strerror(errno) << dendl; 84 fault(); 85 continue; 86 } 87 in_seq_acked = send_seq; 88 } 89 90 // 从 Pipe::out_q 中得到一个取出包准备发送 91 // grab outgoing message 92 Message *m = _get_next_outgoing(); 93 if (m) { 94 m->set_seq(++out_seq); 95 if (!policy.lossy) { 96 // put on sent list 97 sent.push_back(m); 98 m->get(); 99 } 100 101 // associate message with Connection (for benefit of encode_payload) 102 m->set_connection(connection_state.get()); 103 104 uint64_t features = connection_state->get_features(); 105 106 if (m->empty_payload()) 107 ldout(msgr->cct,20) << "writer encoding " << m->get_seq() << " features " << features 108 << " " << m << " " << *m << dendl; 109 else 110 ldout(msgr->cct,20) << "writer half-reencoding " << m->get_seq() << " features " << features 111 << " " << m << " " << *m << dendl; 112 113 // 对包进行一些加密处理 114 // encode and copy out of *m 115 m->encode(features, msgr->crcflags); 116 117 // 包头 118 // prepare everything 119 const ceph_msg_header& header = m->get_header(); 120 const ceph_msg_footer& footer = m->get_footer(); 121 122 // Now that we have all the crcs calculated, handle the 123 // digital signature for the message, if the pipe has session 124 // security set up. Some session security options do not 125 // actually calculate and check the signature, but they should 126 // handle the calls to sign_message and check_signature. PLR 127 if (session_security.get() == NULL) { 128 ldout(msgr->cct, 20) << "writer no session security" << dendl; 129 } else { 130 if (session_security->sign_message(m)) { 131 ldout(msgr->cct, 20) << "writer failed to sign seq # " << header.seq 132 << "): sig = " << footer.sig << dendl; 133 } else { 134 ldout(msgr->cct, 20) << "writer signed seq # " << header.seq 135 << "): sig = " << footer.sig << dendl; 136 } 137 } 138 // 取出要发送的二进制数据 139 bufferlist blist = m->get_payload(); 140 blist.append(m->get_middle()); 141 blist.append(m->get_data()); 142 143 pipe_lock.Unlock(); 144 145 m->trace.event("pipe writing message"); 146 // 发送包: Pipe::write_message() --> Pipe::do_sendmsg --> ::sendmsg() 147 ldout(msgr->cct,20) << "writer sending " << m->get_seq() << " " << m << dendl; 148 int rc = write_message(header, footer, blist); 149 150 pipe_lock.Lock(); 151 if (rc < 0) { 152 ldout(msgr->cct,1) << "writer error sending " << m << ", " 153 << cpp_strerror(errno) << dendl; 154 fault(); 155 } 156 m->put(); 157 } 158 continue; 159 } 160 161 // 等待被 Reader 或者 Dispatcher 唤醒 162 // wait 163 ldout(msgr->cct,20) << "writer sleeping" << dendl; 164 cond.Wait(pipe_lock); 165 } 166 167 ldout(msgr->cct,20) << "writer finishing" << dendl; 168 169 // reap? 170 writer_running = false; 171 unlock_maybe_reap(); 172 ldout(msgr->cct,10) << "writer done" << dendl; 173 }
消息的处理
reader将消息交给dispatch_queue处理,流程如下:
可以ms_can_fast_dispatch()的执行 ms_fast_dispatch(),其他的进入in_q.
pipe::reader() -------> pipe::in_q -> enqueue()
1 void DispatchQueue::enqueue(const Message::ref& m, int priority, uint64_t id) 2 { 3 Mutex::Locker l(lock); 4 if (stop) { 5 return; 6 } 7 ldout(cct,20) << "queue " << m << " prio " << priority << dendl; 8 add_arrival(m); 9 // 将消息按优先级放入 DispatchQueue::mqueue 中 10 if (priority >= CEPH_MSG_PRIO_LOW) { 11 mqueue.enqueue_strict(id, priority, QueueItem(m)); 12 } else { 13 mqueue.enqueue(id, priority, m->get_cost(), QueueItem(m)); 14 } 15 16 // 唤醒 DispatchQueue::entry() 处理消息 17 cond.Signal(); 18 } 19 20 /* 21 * This function delivers incoming messages to the Messenger. 22 * Connections with messages are kept in queues; when beginning a message 23 * delivery the highest-priority queue is selected, the connection from the 24 * front of the queue is removed, and its message read. If the connection 25 * has remaining messages at that priority level, it is re-placed on to the 26 * end of the queue. If the queue is empty; it's removed. 27 * The message is then delivered and the process starts again. 28 */ 29 void DispatchQueue::entry() 30 { 31 lock.Lock(); 32 while (true) { 33 while (!mqueue.empty()) { 34 QueueItem qitem = mqueue.dequeue(); 35 if (!qitem.is_code()) 36 remove_arrival(qitem.get_message()); 37 lock.Unlock(); 38 39 if (qitem.is_code()) { 40 if (cct->_conf->ms_inject_internal_delays && 41 cct->_conf->ms_inject_delay_probability && 42 (rand() % 10000)/10000.0 < cct->_conf->ms_inject_delay_probability) { 43 utime_t t; 44 t.set_from_double(cct->_conf->ms_inject_internal_delays); 45 ldout(cct, 1) << "DispatchQueue::entry inject delay of " << t 46 << dendl; 47 t.sleep(); 48 } 49 switch (qitem.get_code()) { 50 case D_BAD_REMOTE_RESET: 51 msgr->ms_deliver_handle_remote_reset(qitem.get_connection()); 52 break; 53 case D_CONNECT: 54 msgr->ms_deliver_handle_connect(qitem.get_connection()); 55 break; 56 case D_ACCEPT: 57 msgr->ms_deliver_handle_accept(qitem.get_connection()); 58 break; 59 case D_BAD_RESET: 60 msgr->ms_deliver_handle_reset(qitem.get_connection()); 61 break; 62 case D_CONN_REFUSED: 63 msgr->ms_deliver_handle_refused(qitem.get_connection()); 64 break; 65 default: 66 ceph_abort(); 67 } 68 } else { 69 const Message::ref& m = qitem.get_message(); 70 if (stop) { 71 ldout(cct,10) << " stop flag set, discarding " << m << " " << *m << dendl; 72 } else { 73 uint64_t msize = pre_dispatch(m); 74 75 /** 76 * 交给 Messenger::ms_deliver_dispatch() 处理,后者会找到 77 * Monitor/OSD 等的 ms_dispatch() 开始对消息的逻辑处理 78 * Messenger::ms_deliver_dispatch() 79 * --> OSD::ms_dispatch() 80 */ 81 msgr->ms_deliver_dispatch(m); 82 post_dispatch(m, msize); 83 } 84 } 85 86 lock.Lock(); 87 } 88 if (stop) 89 break; 90 91 // 等待被 DispatchQueue::enqueue() 唤醒 92 // wait for something to be put on queue 93 cond.Wait(lock); 94 } 95 lock.Unlock(); 96 }
看下消息怎么订阅者的消息怎么放入out_q
1 messenger::ms_deliver_dispatch() 2 --->OSD::ms_dispatch() 3 --->OSD::_dispatch() 4 ----某种command处理??(未举例) 5 --> SimpleMessenger::send_message() 6 --> SimpleMessenger::_send_message() 7 --> SimpleMessenger::submit_message() 8 --> Pipe::_send()
9 ---> cond.signal()唤醒writer线程
1 void OSD::_dispatch(Message *m) 2 { 3 ceph_assert(osd_lock.is_locked()); 4 dout(20) << "_dispatch " << m << " " << *m << dendl; 5 6 switch (m->get_type()) { 7 // -- don't need OSDMap -- 8 9 // map and replication 10 case CEPH_MSG_OSD_MAP: 11 handle_osd_map(static_cast<MOSDMap*>(m)); 12 break; 13 14 // osd 15 case MSG_OSD_SCRUB: 16 handle_scrub(static_cast<MOSDScrub*>(m)); 17 break; 18 19 case MSG_COMMAND: 20 handle_command(static_cast<MCommand*>(m)); 21 return; 22 23 // -- need OSDMap -- 24 25 case MSG_OSD_PG_CREATE: 26 { 27 OpRequestRef op = op_tracker.create_request<OpRequest, Message*>(m); 28 if (m->trace) 29 op->osd_trace.init("osd op", &trace_endpoint, &m->trace); 30 // no map? starting up? 31 if (!osdmap) { 32 dout(7) << "no OSDMap, not booted" << dendl; 33 logger->inc(l_osd_waiting_for_map); 34 waiting_for_osdmap.push_back(op); 35 op->mark_delayed("no osdmap"); 36 break; 37 } 38 39 // need OSDMap 40 dispatch_op(op); 41 } 42 } 43 }
消息的接收
接收流程如下:Pipe的读线程从socket读取Message,然后放入接收队列,再由分发线程取出Message交给Dispatcher处理。
消息的发送
发送流程如下:
其它资料
这篇文章解析Ceph: 网络层的处理简单介绍了一下Ceph的网络,但对Pipe与Connection的关系描述似乎不太准确,Pipe是对socket的封装,Connection更加上层、抽象。