go netpoll源码解析

前言

　　网络一直是一个老生常谈的内容，谈及网络编程，始终离不开网络编程的五种模型：阻塞IO(Blocking IO)，非阻塞IO(Nonblocking IO)，多路复用IO(Multiplexing IO)，事件驱动IO(Signal driven IO)和异步IO(Asyncronous IO)。而构建一个高性能的网络库，基本都是基于多路复用(epoll, kqueue)或异步IO来实现，在IO之后，通过线程池来实现IO数据处理。go 语言内置的网络库netpoll，整体框架上也是采用这样一种模式。在IO收发上，底层封装了epoll, kqueue, IOCP等操作系统级别的多路复用和异步IO。在IO数据处理上，得益于goroutine的轻量级和高效的并发调度模式，不是采用线程池的思路，而是其特有的one goroutine per connection(每一个连接启用一个goroutine进行处理)。本文主要从源码角度(以epoll为例)分析go netpoll的设计和封装，解释其one goroutine per connection工作原理及其优劣性，然后针对go netpoll的缺点总结现在更为适合的主从reactor模式。期间会涉及网络编程相关知识(主要是多路复用IO)，go语言使用与runtime调度的基本原理，需要读者有一定的背景知识。

1. 基本编程模式与数据结构

　　首先直观对比一下go net的网络编程和epoll的异同。关于epoll基础和IO处理模式，可以参考文献[1][2]。使用go net网络编程，流程如下：

package main

import (
    "log"
    "net"
)

func main() {
    Listener, err := net.Listen("tcp", "localhost:9000")
    if err != nil {
        log.Println("Listen err: ", err)
        return
    }
    for {
        conn, err := Listener.Accept()
        if err != nil {
            log.Println("Accept err: ", err)
            continue
        }
        go HandleConn(conn)
    }
}

func HandleConn(conn net.Conn) {
    // 处理函数, 调用Read()->处理业务逻辑->调用Write写回
}

可见，go的网络编程是极其简单的，在短短的几行内，底层封装了epoll需要的所有操作，主要包括：

调用socket()创建服务端socket，并进行bind()；
调用epoll_create创建epfd，将服务端socket绑定到epfd，并进行监听；
每当新的连接到来，调用accept绑定客户端socket，注册到epfd。

所有的操作，都离不开两个重要的组成：socket描述符和epoll描述符epfd。go netpoll对这些内容进行了封装，主要形成了以下结构体：

// netFD, 可以认为是go net的网络描述符
type netFD struct {
pfd poll.FD // 底层封装了socket描述符

// immutable until Close
family int
sotype int
isConnected bool // handshake completed or use of association with peer
net string
laddr Addr
raddr Addr
}

type FD struct {
// Lock sysfd and serialize access to Read and Write methods.
fdmu fdMutex

// System file descriptor. Immutable until Close.
Sysfd int // 保存linux的fd

// I/O poller.
pd pollDesc // pollDesc变量，用该变量的指针，将所有的socket封装成链表，每个socket对应的goroutine，也封装在pollDesc中，可以说是go net的灵魂部分

// Writev cache.
iovecs *[]syscall.Iovec

// Semaphore signaled when file is closed.
csema uint32

// Non-zero if this file has been set to blocking mode.
isBlocking uint32

// Whether this is a streaming descriptor, as opposed to a
// packet-based descriptor like a UDP socket. Immutable.
IsStream bool

// Whether a zero byte read indicates EOF. This is false for a
// message based socket connection.
ZeroReadIsEOF bool

// Whether this is a file rather than a network socket.
isFile bool
}

//pollDesc
type pollDesc struct {
link *pollDesc // in pollcache, protected by pollcache.lock. pollDesc是一个链表，每当1有新的socket连接时，就会有一个新的pollDesc

// The lock protects pollOpen, pollSetDeadline, pollUnblock and deadlineimpl operations.
// This fully covers seq, rt and wt variables. fd is constant throughout the PollDesc lifetime.
// pollReset, pollWait, pollWaitCanceled and runtime·netpollready (IO readiness notification)
// proceed w/o taking the lock. So closing, everr, rg, rd, wg and wd are manipulated
// in a lock-free way by all operations.
// TODO(golang.org/issue/49008): audit these lock-free fields for continued correctness.
// NOTE(dvyukov): the following code uses uintptr to store *g (rg/wg),
// that will blow up when GC starts moving objects.
lock mutex // protects the following fields，用于对pd进行加锁保护
fd uintptr // 保存网络文件描述符
closing bool
everr bool // marks event scanning error happened
user uint32 // user settable cookie
rseq uintptr // protects from stale read timers
rg uintptr // pdReady, pdWait, G waiting for read or nil. Accessed atomically. 用于表示监听的fd是否就绪，只有pdReady, pdWait，等待读取的G，nil取值
rt timer // read deadline timer (set if rt.f != nil)
rd int64 // read deadline
wseq uintptr // protects from stale write timers
wg uintptr // pdReady, pdWait, G waiting for write or nil. Accessed atomically. 作用同rg
wt timer // write deadline timer
wd int64 // write deadline
self *pollDesc // storage for indirect interface. See (*pollDesc).makeArg.
}

其中netFD可以认为是go net 的网络描述符，包含了对应的socket；而pollDesc则是更底层的结构体，通过pollDesc组成的链表，将所有socket链接到一起，并且将每个socket绑定的goroutine，封装到pollDesc.wg和pollDesc.rg中(当某个goroutine处于等待IO就绪状态时，pollDesc.rg和wg就用于保存该socket对应的goroutine的结构体g)。然后net.Listen反回的Listener，net.Accept返回的net.Conn，都包含了netFD结构体：

// TCPListener is a TCP network listener. Clients should typically
// use variables of type Listener instead of assuming TCP.
type TCPListener struct {
    fd *netFD
    lc ListenConfig
}

// TCPConn is an implementation of the Conn interface for TCP network
// connections.
type TCPConn struct {
    conn
}

type conn struct {
    fd *netFD
}

接下来重点介绍，go net是如何通过net.Listen和net.Accept实现以上所有操作的。

2. 等待端(阻塞端)

　　等待端就是当epoll监听的读写事件到来前，阻塞当前的goroutine；当读写事件到来时，调用对应的处理函数(Accept, Read, Write)。这个过程涵盖了epoll编程中的：

服务端socket创建，绑定；epoll的初始化；将服务端socket添加到epoll的监听队列中。都通过net.Listen完成。
通过调用accept函数，将服务端goroutine阻塞，直到客户端连接到来。然后将新的客户端连接添加到epoll的监听队列(这个过程跟服务端socket添加到epoll的监听队列一致，底层也是复用同一个函数pollDesc.init)。
调用Read，Write函数，将客户端goroutine阻塞，直到客户端读写事件到来，执行业务逻辑(这个过程跟服务端socket的accept逻辑一致，底层也是复用同一个函数pollDesc.WaitRead和pollDesc.waitWrite)。

下面分别分析对应的go net实现。

2.1 net.Listen

　　Listen函数首先创建了服务端监听描述符servsock，用这个servsock初始化自己的netFD；然后在listenStream()中实现servsock的绑定；调用netFD.init()，在netFD.init()中首先创建了epoll监听描述符epfd，并创建了一个管道用于跟epfd进行通信(该管道主要用于打断epoll_wait的阻塞过程)；然后在netFD.init()中实现对servsock的监听，并初始化了pollDesc结构体返回。

以上步骤基本上是对epoll使用过程的封装，对比epoll使用过程，每个阶段的具体实现如下：

2.1.1 创建servsocket

// linux中
servsock=socket(PF_INET,SOCK_STREAM,0) // 创建socket

/* go netpoll中
* 调用链：Listen() -> ListenConfig.Listen() -> listenTCP() -> internetSocket() -> socket() 
* socket()函数如下
*/ 
func socket(ctx context.Context, net string, family, sotype, proto int, ipv6only bool, laddr, raddr sockaddr, ctrlFn func(string, string, syscall.RawConn) error) (fd *netFD, err error) {
   s, err := sysSocket(family, sotype, proto) //创建servsock
   if err != nil {
      return nil, err
   }
   if err = setDefaultSockopts(s, family, sotype, ipv6only); err != nil {
      poll.CloseFunc(s)
      return nil, err
   }
   
   // 初始化netFD
   if fd, err = newFD(s, family, sotype, net); err != nil {
      poll.CloseFunc(s)
      return nil, err
   }
   ...
   // 调用listenStream()
    if laddr != nil && raddr == nil {
       switch sotype {
       case syscall.SOCK_STREAM, syscall.SOCK_SEQPACKET:
          if err := fd.listenStream(laddr, listenerBacklog(), ctrlFn); err != nil {
             fd.Close()
             return nil, err
          }
          return fd, nil
       case syscall.SOCK_DGRAM:
          if err := fd.listenDatagram(laddr, ctrlFn); err != nil {
             fd.Close()
             return nil, err
          }
          return fd, nil
       }
    }
    ...
}

2.1.2 绑定servsocket

// Linux中
bind(servsock,(struct sockaddr*)&servaddr,sizeof(servaddr) // 将servsock与一个sockaddr的结构体绑定

/* go netpoll中
*/
func (fd *netFD) listenStream(laddr sockaddr, backlog int, ctrlFn func(string, string, syscall.RawConn) error) error {
   var err error
   if err = setDefaultListenerSockopts(fd.pfd.Sysfd); err != nil {
      return err
   }
   var lsa syscall.Sockaddr
   if lsa, err = laddr.sockaddr(fd.family); err != nil {
      return err
   }
   if ctrlFn != nil {
      c, err := newRawConn(fd)
      if err != nil {
         return err
      }
      if err := ctrlFn(fd.ctrlNetwork(), laddr.String(), c); err != nil {
         return err
      }
   }
   if err = syscall.Bind(fd.pfd.Sysfd, lsa); err != nil {
      return os.NewSyscallError("bind", err)
   }
   if err = listenFunc(fd.pfd.Sysfd, backlog); err != nil {
      return os.NewSyscallError("listen", err)
   }
   // 在fd.init中实现创建epfd，监听servsock
   if err = fd.init(); err != nil {
      return err
   }
   lsa, _ = syscall.Getsockname(fd.pfd.Sysfd)
   fd.setAddr(fd.addrFunc()(lsa), nil)
   return nil
}

2.1.3 创建epoll实例epfd

// linux中
epoll_event event
event.events=EPOLLIN //输入监听
event.data.fd=servsock
epfd=epoll_create(20) //创建epoll实例

/* go netpoll中
* 调用链：netFD.init()->pollDesc.init()-> runtime_pollServerInit()->netpollinit()
*/

// pollDesc.init()
func (pd *pollDesc) init(fd *FD) error {
    // serverInit定义为 var serverInit sync.Once，保证在init函数中runtime_pollServerInit只执行一次；后续的net.Accpet也会调用pollDesc.init，主要目的是调用runtime_pollOpen将客户端连接添加到epfd中
   serverInit.Do(runtime_pollServerInit) // 创建epfd，底层根据操作系统，调用netpollinit()
   ctx, errno := runtime_pollOpen(uintptr(fd.Sysfd)) // 在epfd中监听servsock
   if errno != 0 {
      if ctx != 0 {
         runtime_pollUnblock(ctx)
         runtime_pollClose(ctx)
      }
      return errnoErr(syscall.Errno(errno))
   }
   pd.runtimeCtx = ctx
   return nil
}

// 创建epfd并创建一个管道跟epfd通信, 等效于Linux的epoll_create
func netpollinit() {
   epfd = epollcreate1(_EPOLL_CLOEXEC) // 创建epfd实例
   if epfd < 0 {
      epfd = epollcreate(1024)
      if epfd < 0 {
         println("runtime: epollcreate failed with", -epfd)
         throw("runtime: netpollinit failed")
      }
      closeonexec(epfd)
   }
   r, w, errno := nonblockingPipe() // 创建管道，用于打断epoll_wait的监听等待
   if errno != 0 {
      println("runtime: pipe failed with", -errno)
      throw("runtime: pipe failed")
   }
   ev := epollevent{
      events: _EPOLLIN,
   }
   *(**uintptr)(unsafe.Pointer(&ev.data)) = &netpollBreakRd
   errno = epollctl(epfd, _EPOLL_CTL_ADD, r, &ev) // 管道用于监听读事件
   if errno != 0 {
      println("runtime: epollctl failed with", -errno)
      throw("runtime: epollctl failed")
   }
   netpollBreakRd = uintptr(r)
   netpollBreakWr = uintptr(w)
}

go netpoll底层通过识别不同的操作系统，对netpollinit函数有不同的实现，比如在Linux中，调用的是epoll_create(), Mac OS为kqueue(), Windows为IOCP，其他函数在底层调用时，也会根据操作系统自动绑定到具体的实现上。

2.1.4 将servsock添加到epfd的监听对象中

// linux中
epoll_ctl(epfd,EPOLL_CTL_ADD,servsock,&event); //向epfd添加fd的输入监听事件

/* go netpoll中
* 调用链：pollDesc.init() -> runtime_pollOpen() -> poll_runtime_pollOpen() -> netpollopen()
*/

func poll_runtime_pollOpen(fd uintptr) (*pollDesc, int) {
   pd := pollcache.alloc() // 申请一个pollDesc, pollDesc是一个链表结构，预先申请好的内存
   lock(&pd.lock)
   wg := atomic.Loaduintptr(&pd.wg)
   if wg != 0 && wg != pdReady {
      throw("runtime: blocked write on free polldesc")
   }
   rg := atomic.Loaduintptr(&pd.rg)
   if rg != 0 && rg != pdReady {
      throw("runtime: blocked read on free polldesc")
   }
   pd.fd = fd // 将servsock绑定到pollDesc
   pd.closing = false
   pd.everr = false
   pd.rseq++
   atomic.Storeuintptr(&pd.rg, 0)
   pd.rd = 0
   pd.wseq++
   atomic.Storeuintptr(&pd.wg, 0)
   pd.wd = 0
   pd.self = pd
   unlock(&pd.lock) // 初始化 pd.rg和pd.wg，这两个变量用于指示pd.fd是否已经存在可读和可写事件，用于唤阻塞在这个fd上的goroutine

   errno := netpollopen(fd, pd) // 将servsock添加到epfd的监视队列中
   if errno != 0 {
      pollcache.free(pd)
      return nil, int(errno)
   }
   return pd, 0
}

// netpollopen的具体实现
func netpollopen(fd uintptr, pd *pollDesc) int32 {
   var ev epollevent
   ev.events = _EPOLLIN | _EPOLLOUT | _EPOLLRDHUP | _EPOLLET
   *(**pollDesc)(unsafe.Pointer(&ev.data)) = pd // 将pollDesc结构体绑定到epoll_event.data上
   return -epollctl(epfd, _EPOLL_CTL_ADD, int32(fd), &ev) // 调用epollctl
}

这一部分会反复用到，因为每来一个新的socket连接，都会使用相同的步骤将其添加到epfd中。

总结

　　pollDesc.init()函数在新建一个socket的时候都会用到(包括服务端和客户端socket创建)，主要做的事情就是初始化netFD和pollDesc结构体，并将pollDesc添加到epoll的等待队列中（通过epoll_event.data=pollDesc）。其中服务端和客户端socket连接唯一的不同在于服务端pollDesc.init()通过sync.Once()语义执行了epoll_create()，并绑定了一个pipe()到epoll的监听队列中。

注：epoll在create时添加了一个pipe监听，主要原因是为了能打断epoll_wait()，比如timer过期检查的时候。通过调用netpollBreak()往pipe中写入数据，epoll_wait()由于pipe有写事件就会返回。

2.2 net.Accept

　　net.Accept用于处理客户端的连接请求，主要处理的的步骤如下：

创建客户端连接的socket，用于初始化客户端netFD。
调用netFD.init()函数，由net.Listen的分析可知，这一步主要用于将客户端连接的socket添加到epfd中。

对应的源码如下：

// Listener.Accept调用的是netFD.cccept。
func (fd *netFD) accept() (netfd *netFD, err error) {
   d, rsa, errcall, err := fd.pfd.Accept() // 底层为linux accept，获取到客户端socket
   if err != nil {
      if errcall != "" {
         err = wrapSyscallError(errcall, err)
      }
      return nil, err
   }
　　// 初始化netFD
   if netfd, err = newFD(d, fd.family, fd.sotype, fd.net); err != nil {
      poll.CloseFunc(d)
      return nil, err
   }
   // 跟net.Listen()中的net.init()是一个函数，主要用于将客户端socket添加到epfd中
   if err = netfd.init(); err != nil { 
      netfd.Close()
      return nil, err
   }
   lsa, _ := syscall.Getsockname(netfd.pfd.Sysfd)
   netfd.setAddr(netfd.addrFunc()(lsa), netfd.addrFunc()(rsa))
   return netfd, nil
}

如果编写过epoll的编程代码，应该知道accept()和read() ,write()本质都是在监听的IO端口上，发生了EPOLLIN事件，所以它们底层做的事情都是一样的，都是循环“读取数据->处理数据->写回处理结果”的过程，fd.pfd.Accept()的底层实现如下：

// Accept wraps the accept network call.
func (fd *FD) Accept() (int, syscall.Sockaddr, string, error) {
   if err := fd.readLock(); err != nil {
      return -1, nil, "", err
   }
   defer fd.readUnlock()  // 添加读写锁

    // prepareRead主要检查是否读写超时
   if err := fd.pd.prepareRead(fd.isFile); err != nil {
      return -1, nil, "", err
   }
   for {
       // 由于服务端socket设置成了非阻塞模式，所以会立即返回，因此需要for循环
      s, rsa, errcall, err := accept(fd.Sysfd) 
      if err == nil {
         return s, rsa, "", err
      }
      switch err {
      case syscall.EINTR:
         continue
      case syscall.EAGAIN:
         if fd.pd.pollable() {
            //  waitRead将当前goroutine挂起，直到有读取事件到来时，再度恢复运行
            if err = fd.pd.waitRead(fd.isFile); err == nil {
               continue
            }
         }
      case syscall.ECONNABORTED:
         // This means that a socket on the listen
         // queue was closed before we Accept()ed it;
         // it's a silly error, so try again.
         continue
      }
      return -1, nil, errcall, err
   }
}

　　可见，Accept()函数就是在for循环中，等待客户端连接的到来；在客户端连接到来之前，通过waitRead()让Accept()对应的goroutine挂起；等客户端连接来临时，再调用linux的accept()函数，获取客户端连接。以上Accept()函数，还有几个小细节：

1. 为什么每个socket加读写锁

　　一般的socket编程，在单线程的情况下，读写是基本不可能并发执行的，只有在读写分离的情况下，一个线程用于读，一个线程用于写，某个socket才会同时读写。但是go net的编程逻辑是one goroutine per connection，对外表现成同步阻塞，一个goroutine一旦在等待读或写的状态，goroutine就被挂起了，因此相当于是禁止了读写同时进行的，因此添加读写锁，防止应用层代码不正确的使用。

2. pollDesc.waitRead()函数的作用

　　该方法实际上是让等待读写的goroutine挂起，直到发生相应的读写事件再唤醒。这个会在Read()和Write()部分详细介绍。

2.3 Read和Write

　　通过net.Listen和net.Accept，一个服务端socket fd成功添加到了epfd中，每当有新的客户端来临，将其添加到epoll的监听队列，并生成一个专门的goroutine处理该客户端socket的读写。客户端的读写过程本质上是跟accept()是一致的，都是挂起对应的goroutine，等待读写事件来临，底层都是调用pollDesc.waitRead和pollDesc.waitWrite。以Read()函数为例，调用过程为：

netFD.Read()->FD.Read()，源码如下：

func (fd *FD) Read(p []byte) (int, error) {
   // 添加读写锁 
   if err := fd.readLock(); err != nil {
      return 0, err
   }
   defer fd.readUnlock()
   if len(p) == 0 {
      // If the caller wanted a zero byte read, return immediately
      // without trying (but after acquiring the readLock).
      // Otherwise syscall.Read returns 0, nil which looks like
      // io.EOF.
      // TODO(bradfitz): make it wait for readability? (Issue 15735)
      return 0, nil
   }
   // 通过pollDesc.prepareRead()检查fd并初始化对应的fd状态
   if err := fd.pd.prepareRead(fd.isFile); err != nil {
      return 0, err
   }
   if fd.IsStream && len(p) > maxRW {
      p = p[:maxRW]
   }
   for {
      // 调用系统的读写函数
      n, err := ignoringEINTRIO(syscall.Read, fd.Sysfd, p)
      if err != nil {
         n = 0
         if err == syscall.EAGAIN && fd.pd.pollable() {
             // 没有发生读写事件，通过waitRead()挂起当前goroutine
            if err = fd.pd.waitRead(fd.isFile); err == nil {
               continue
            }
         }
      }
      err = fd.eofError(n, err)
      return n, err
   }
}

整个过程跟Accpet基本一致，调用过程的伪代码都是：

{
    pollDesc.prepareRead() 
    ...
    for { 
        read()/write()/accpet()
        if no_read/no_write/no_accpet {
            pollDesc.waitRead()/waitWrite()
        }
    }
}

而pollDesc.waitRead（pollDesc.waitWrite）就是等待端的主角，它将一个socket和其对应的goroutine联系到了一起。

2.4 pollDesc.waitRead和pollDesc.waitWrite

　　waitRead()的目的是挂起当前的goroutine，直到读写事件的到来，这是将非阻塞包装成同步阻塞的关键。调用链为pollDesc.waitRead()->pollDesc.wait()->runtime_pollWait()->poll_runtime_pollWait()，源码如下：

func poll_runtime_pollWait(pd *pollDesc, mode int) int {
   errcode := netpollcheckerr(pd, int32(mode))
   if errcode != pollNoError {
      return errcode
   }
   // As for now only Solaris, illumos, and AIX use level-triggered IO.
   if GOOS == "solaris" || GOOS == "illumos" || GOOS == "aix" {
      netpollarm(pd, mode)
   }
   for !netpollblock(pd, int32(mode), false) {
      errcode = netpollcheckerr(pd, int32(mode))
      if errcode != pollNoError {
         return errcode
      }
      // Can happen if timeout has fired and unblocked us,
      // but before we had a chance to run, timeout has been reset.
      // Pretend it has not happened and retry.
   }
   return pollNoError
}

// returns true if IO is ready, or false if timedout or closed
// waitio - wait only for completed IO, ignore errors
// Concurrent calls to netpollblock in the same mode are forbidden, as pollDesc
// can hold only a single waiting goroutine for each mode.
func netpollblock(pd *pollDesc, mode int32, waitio bool) bool {
   // gpp指向pd对应的rg或wg，在介绍pollDesc时描述过，rg和wg只有pdReady, pdWait, 等待读写的G或nil四种状态，用于表示该io是否可用，从而选择挂起或唤醒对应的goroutine
   gpp := &pd.rg
   if mode == 'w' {
      gpp = &pd.wg
   }

   // set the gpp semaphore to pdWait
   for {
      // 已经就绪了，直接返回true
      // Consume notification if already ready.
      if atomic.Casuintptr(gpp, pdReady, 0) {
         return true
      }
      // 否则，将gpp设置为pdWait
      if atomic.Casuintptr(gpp, 0, pdWait) {
         break
      }

      // Double check that this isn't corrupt; otherwise we'd loop
      // forever.
      if v := atomic.Loaduintptr(gpp); v != pdReady && v != 0 {
         throw("runtime: double wait")
      }
   }

   // need to recheck error states after setting gpp to pdWait
   // this is necessary because runtime_pollUnblock/runtime_pollSetDeadline/deadlineimpl
   // do the opposite: store to closing/rd/wd, membarrier, load of rg/wg
   if waitio || netpollcheckerr(pd, mode) == pollNoError {
      // 通过gopark()挂起当前的goroutine，直到新的读写事件来临
      gopark(netpollblockcommit, unsafe.Pointer(gpp), waitReasonIOWait, traceEvGoBlockNet, 5)
   }
   // 新的读写事件到来，获取gpp的值，判断其是否为pdReady，并将pollDesc.rd重置
   // be careful to not lose concurrent pdReady notification
   old := atomic.Xchguintptr(gpp, 0)
   if old > pdWait {
      throw("runtime: corrupted polldesc")
   }
   return old == pdReady
}

可以看到，底层是通过gopark来阻塞对应的goroutine的，gopark内部直接调用runtime调度相关的函数，用于挂起或执行对应的goroutine。

2.4.1 gopark如何挂起或执行goroutine

gopark的源代码如下：

func gopark(unlockf func(*g, unsafe.Pointer) bool, lock unsafe.Pointer, reason waitReason, traceEv byte, traceskip int) {
    if reason != waitReasonSleep {
        checkTimeouts() // timeouts may expire while two goroutines keep the scheduler busy
    }
    mp := acquirem()
    gp := mp.curg
    status := readgstatus(gp)
    if status != _Grunning && status != _Gscanrunning {
        throw("gopark: bad g status")
    }
    mp.waitlock = lock
    mp.waitunlockf = unlockf
    gp.waitreason = reason
    mp.waittraceev = traceEv
    mp.waittraceskip = traceskip
    releasem(mp)
    // can't do anything that might move the G between Ms here.
    mcall(park_m)
}

gopark调用了park_m，而park_m的源码如下：

// park continuation on g0.
func park_m(gp *g) {
    _g_ := getg()

    if trace.enabled {
        traceGoPark(_g_.m.waittraceev, _g_.m.waittraceskip)
    }

    casgstatus(gp, _Grunning, _Gwaiting)
    dropg()
　　// 调用gopark传递进来的netpollblockcommit函数
    if fn := _g_.m.waitunlockf; fn != nil {
        ok := fn(gp, _g_.m.waitlock)
        _g_.m.waitunlockf = nil
        _g_.m.waitlock = nil
        if !ok {
            if trace.enabled {
                traceGoUnpark(gp, 2)
            }
            casgstatus(gp, _Gwaiting, _Grunnable)
　　　　　　　// 就绪，执行对应的goroutine
            execute(gp, true) // Schedule it back, never returns.
        }
    }
　　//未就绪，调用schedule重新调度

　　schedule() 
}

　　理解gopark和park_m的关联有一个关键点。首先，gopark将当前的goroutine.m.waitunlockf设置为了netpollblockcommit函数(通过gopark函数调用时，传入的unlockf参数为netpollblockcommit)，而park_m首先获取了当前的g，然后调用了g.m.waitunlockf（即gopark传进去的netpollblockcommit）。可见，gopark和park_m通过当前的goroutine这一个关系联系到了一起（执行gopark和park_m的为同一个goroutine）, 最终执行了netpollblockcommit。而netpollblockcommit函数做的事情，就是把每个socket对应goroutine绑定到对应的pollDesc.rg(wg)中。源码如下：

// netpollblockcommit检查当前的状态是否是pdWait：是，则将当前的gorouttine的*g放入pollDesc.rg或wg中，返回true；否，则说明当前的goroutine等待的事件已经ready，返回false
func netpollblockcommit(gp *g, gpp unsafe.Pointer) bool {
   r := atomic.Casuintptr((*uintptr)(gpp), pdWait, uintptr(unsafe.Pointer(gp))) // gpp即netpollblock中，将pollDesc.rg（wg）赋值给了它，此处将当前goroutine的结构体g赋值给了gpp
   if r {
      // Bump the count of goroutines waiting for the poller.
      // The scheduler uses this to decide whether to block
      // waiting for the poller if there is nothing else to do.
      atomic.Xadd(&netpollWaiters, 1)
   }
   return r
}

　　至此，等待端的所有事情都已经完成。通过在netpollopen中，将epoll的监听事件epoll_event.data绑定到对应的pollDesc结构体上；而在waitRead，waitWrite中，将goroutine绑定到pollDesc.wg和rg中，并挂起对应的goroutine（当socket无就绪事件时），完成等待端的所有操作。当一个socket就绪时，就能直接从epoll_event.data.wg（rg）获取到就绪的goroutine，加入runtime调度。而这就是就绪执行端做的事情。

3. 就绪执行端

　　前面已经分析了等待端的行为，而就绪执行端(epoll_wait)只需要在epoll_wait响应返回之后，取出对应的ev.data.rg（wg），重新加入runtime调度队列即可。这个过程就是通过netpoll实现的。netpoll对应的源码如下：

// netpoll checks for ready network connections.
// Returns list of goroutines that become runnable.
// delay < 0: blocks indefinitely
// delay == 0: does not block, just polls
// delay > 0: block for up to that many nanoseconds
// gList表示所有的就绪goroutine
func netpoll(delay int64) gList {
   if epfd == -1 {
      return gList{}
   }
   // 计算epoll_wait的等待时间
   var waitms int32
   if delay < 0 {
      waitms = -1
   } else if delay == 0 {
      waitms = 0
   } else if delay < 1e6 {
      waitms = 1
   } else if delay < 1e15 {
      waitms = int32(delay / 1e6)
   } else {
      // An arbitrary cap on how long to wait for a timer.
      // 1e9 ms == ~11.5 days.
      waitms = 1e9
   }
   var events [128]epollevent
retry:
    // 调用epoll_wait
   n := epollwait(epfd, &events[0], int32(len(events)), waitms)
   if n < 0 {
      if n != -_EINTR {
         println("runtime: epollwait on fd", epfd, "failed with", -n)
         throw("runtime: netpoll failed")
      }
      // If a timed sleep was interrupted, just return to
      // recalculate how long we should sleep now.
      if waitms > 0 {
         return gList{}
      }
      goto retry
   }
   var toRun gList
   for i := int32(0); i < n; i++ {
      ev := &events[i]
      if ev.events == 0 {
         continue
      }
    // 判断epoll_wait是否是被打断
      if *(**uintptr)(unsafe.Pointer(&ev.data)) == &netpollBreakRd {
         if ev.events != _EPOLLIN {
            println("runtime: netpoll: break fd ready for", ev.events)
            throw("runtime: netpoll: break fd ready for something unexpected")
         }
         if delay != 0 {
            // netpollBreak could be picked up by a
            // nonblocking poll. Only read the byte
            // if blocking.
            var tmp [16]byte
            read(int32(netpollBreakRd), noescape(unsafe.Pointer(&tmp[0])), int32(len(tmp)))
            atomic.Store(&netpollWakeSig, 0)
         }
         continue
      }

      var mode int32
      if ev.events&(_EPOLLIN|_EPOLLRDHUP|_EPOLLHUP|_EPOLLERR) != 0 {
         mode += 'r'
      }
      if ev.events&(_EPOLLOUT|_EPOLLHUP|_EPOLLERR) != 0 {
         mode += 'w'
      }
      if mode != 0 {
         // 因为pollDesc保存在epoll_event.data中，而goroutine保存在pollDesc.rg和pollDeesc.wg中，通过epoll_event.data就能拿到阻塞的goroutine
         pd := *(**pollDesc)(unsafe.Pointer(&ev.data))
         pd.everr = false
         if ev.events == _EPOLLERR {
            pd.everr = true
         }
         // 将就绪的goroutines放到就绪队列
         netpollready(&toRun, pd, mode)
      }
   }
   return toRun
}

// netpollready()唤醒就绪的goroutine
func netpollready(toRun *gList, pd *pollDesc, mode int32) {
   var rg, wg *g
   if mode == 'r' || mode == 'r'+'w' {
      // netpollunblock()负责从pollDesc中取出goroutine(g)
      rg = netpollunblock(pd, 'r', true)
   }
   if mode == 'w' || mode == 'r'+'w' {
      wg = netpollunblock(pd, 'w', true)
   }
   if rg != nil {
      toRun.push(rg)
   }
   if wg != nil {
      toRun.push(wg)
   }
}

// netpollunblock()负责从pollDesc中取出goroutine(g)
func netpollunblock(pd *pollDesc, mode int32, ioready bool) *g {
   gpp := &pd.rg
   if mode == 'w' {
      gpp = &pd.wg
   }

   for {
      old := atomic.Loaduintptr(gpp)
      // 已经是就绪事件，没有新的就绪事件发生，就没有需要唤醒的g
      if old == pdReady {
         return nil
      }
      if old == 0 && !ioready {
         // Only set pdReady for ioready. runtime_pollWait
         // will check for timeout/cancel before waiting.
         return nil
      }
      var new uintptr
      if ioready {
         new = pdReady
      }
      // 将pollDesc.rg置为pdReady并返回g
      if atomic.Casuintptr(gpp, old, new) {
         if old == pdWait {
            old = 0
         }
         return (*g)(unsafe.Pointer(old))
      }
   }
}

相对于等待端，就绪执行端的函数直接很多。就是epoll_wait检查就绪的socket，取出就绪的ev.data.rg，然后返回所有就绪的gList。

3.1 何时调用netpoll

　　netpoll只是返回了就绪的gList，但是还需要有函数来调用它，将就绪的gList重新放回runtime调度。换言之，唤醒goroutine是一个runtime重新调度的过程。

搜索netpoll的调用位置，发现调用netpoll的代码都在runtime/proc下（即golang的调度程序），调用函数和场景有以下几个：

1. findrunnable()函数。实际上，findrunnable()是在golang调度器调用 schedule()进行调度时，会找到当前就绪的goroutine，放入本地或全局的p中(G-M-P调度模型)。该函数的相关代码如下：

// schedule()函数调度时会调用findrunnable()找到就绪goroutine放入p中
func findrunnable() (gp *g, inheritTime bool) {
    ...
    // 优先检查是否已经有网络io就绪。此函数在本地P上没有可执行的G时，在steal from others 之前执行，如果有网络io就绪，立即唤醒并调度该goroutine。因此nepoll(0)（非阻塞调用）
    // Poll network.
    // This netpoll is only an optimization before we resort to stealing.
    // We can safely skip it if there are no waiters or a thread is blocked
    // in netpoll already. If there is any kind of logical race with that
    // blocked thread (e.g. it has already returned from netpoll, but does
    // not set lastpoll yet), this thread will do blocking netpoll below
    // anyway.
    if netpollinited() && atomic.Load(&netpollWaiters) > 0 && atomic.Load64(&sched.lastpoll) != 0 {
        // 拿到就绪的goroutine，通过injectglist()将就绪的goroutine放入p中进行调度
       if list := netpoll(0); !list.empty() { // non-blocking
          gp := list.pop()
          injectglist(&list)
          casgstatus(gp, _Gwaiting, _Grunnable)
          if trace.enabled {
             traceGoUnpark(gp, 0)
          }
          return gp, false
       }
    }
    ...
    
    // 再次调用netpoll，此次是阻塞调用
    if netpollinited() && (atomic.Load(&netpollWaiters) > 0 || pollUntil != 0) && atomic.Xchg64(&sched.lastpoll, 0) != 0 {
       atomic.Store64(&sched.pollUntil, uint64(pollUntil))
       if _g_.m.p != 0 {
          throw("findrunnable: netpoll with p")
       }
       if _g_.m.spinning {
          throw("findrunnable: netpoll with spinning")
       }
       delay := int64(-1)
       if pollUntil != 0 {
          if now == 0 {
             now = nanotime()
          }
          delay = pollUntil - now
          if delay < 0 {
             delay = 0
          }
       }
       if faketime != 0 {
          // When using fake time, just poll.
          delay = 0
       }
       list := netpoll(delay) // block until new work is available
       atomic.Store64(&sched.pollUntil, 0)
       atomic.Store64(&sched.lastpoll, uint64(nanotime()))
       if faketime != 0 && list.empty() {
          // Using fake time and nothing is ready; stop M.
          // When all M's stop, checkdead will call timejump.
          stopm()
          goto top
       }
       lock(&sched.lock)
       _p_ = pidleget()
       unlock(&sched.lock)
       if _p_ == nil {
          injectglist(&list)
       } else {
          acquirep(_p_)
          if !list.empty() {
             gp := list.pop()
             injectglist(&list)
             casgstatus(gp, _Gwaiting, _Grunnable)
             if trace.enabled {
                traceGoUnpark(gp, 0)
             }
             return gp, false
          }
          if wasSpinning {
             _g_.m.spinning = true
             atomic.Xadd(&sched.nmspinning, 1)
          }
          goto top
       }
    }
}

2. 在sysmon监控中调用。sysmon是golang设置的监控线程(是一个系统级的daemon)，独立于G-M-P的所有模型之外，不与逻辑P绑定而直接在M上运行。他跟其他G-M-P的关系如下：

sysmon会检查距离上一次netpoll()调用是否超过了10ms，超过则会调用netpoll()唤醒可执行的G。具体的代码如下：

func sysmon() {
    ...
    lastpoll := int64(atomic.Load64(&sched.lastpoll))
    if netpollinited() && lastpoll != 0 && lastpoll+10*1000*1000 < now {
       atomic.Cas64(&sched.lastpoll, uint64(lastpoll), uint64(now))
       list := netpoll(0) // non-blocking - returns list of goroutines
       if !list.empty() {
          // Need to decrement number of idle locked M's
          // (pretending that one more is running) before injectglist.
          // Otherwise it can lead to the following situation:
          // injectglist grabs all P's but before it starts M's to run the P's,
          // another M returns from syscall, finishes running its G,
          // observes that there is no work to do and no other running M's
          // and reports deadlock.
          incidlelocked(-1)
          injectglist(&list)
          incidlelocked(1)
       }
    }
    ...
}

注：等待端和就绪端中间是通过golang的runtime调度(G-M-P调度模型)联系到一起的，等待端挂起goroutine，就绪端重新将goroutine放入调度队列，此部分需要对GPM调度模型有个基本的了解，才能理解整个过程。GMP的调度可以参看文献[3]。

4. 总结

4.1 go netpoll执行流

　　Golang netpoll底层其实就是对epoll， kqueue等IO多路复用模型的封装，实现one goroutine per connection。对于这种one goroutine per connection，其实就是多线程IO处理模型，不同点在于：

goroutine是golang特有的轻量级线程，在用户空间进行调度，结合golang的runtime调度模型，其调度所需资源远小于线程。因此使得之前不适用的多线程IO处理重新大放异彩。
通过封装，golang利用其runtime调度，将IO等待与IO读写执行分离开来。通过pollDeesc.waitRead()和pollDesc.waitWrite()，将goroutine加入到epoll的等待队列，完成等待端的处理；通过netpoll()，将goroutine从等待端唤醒，重新放入就绪可调度的状态；而连接等待和执行的，就是golang强大的G-M-P调度模型。整个调度过程如下图:

4.2 关于go netpoll带来的启示

　　个人觉得，golang netpoll最精妙的设计在于对epoll的封装跟自身的runtime调度的完美结合。一个epoll程序的通用编写流程如下：

void ListenAndServe() {
    // 1. 创建并绑定服务端socket servsock
    // 2. 绑定servsock到epoll
    epoll_event event;
    event.events=EPOLLIN; //输入监听
    event.data.fd=servsock;    
    int epfd=epoll_create(20); //创建epoll例程
    epoll_ctl(epfd,EPOLL_CTL_ADD,servsock,&event); //向epfd添加fd的输入监听事件
    struct epoll_event* events;
    events=(epoll_event*)malloc(sizeof(struct epoll_event)*EPOLL_SIZE); //申请存放发生事件的数组
    // 3. 循环监听
    while(1) 
    {
        int event_cnt=epoll_wait(epfd,events,EPOLL_SIZE,-1); //设置无限等待
        if(event_cnt==-1)
             printf("epoll_wait() error");         
             for(int i=0;i<event_cnt;i++)
             {   
                 // 4. 服务端连接请求
                 if(events[i].data.fd==servsock)  //说明是新的客户端请求
                 {
                     // 调用accept进行连接并注册新的epoll监听
                     clntsock=accept(...);
                     event.events=EPOLLIN;                
                     event.data.fd=clntsock; //通过event.data.fd存放clntsock，用于标记返回的event属于哪个socket
                     epoll_ctl(epfd,EPOLL_CTL_ADD,clntsock,&event); //将新申请的连接加入epfd
                 } 
                 else 
                 {
                     // 5. 客户端读写事件，进行相应的处理
                 }
             }
        }
        ...
    }           
}

　　以上是单线程处理epoll的过程，相关变体也仅在于后续处理过程，比如变成多线程/线程池；读取和计算分离等。当时理所当然的认为epoll_event.data的作用就是用来存放对应的socket的。但是本质上epoll_event.data只是一个结构体指针，如果将其用来存放自己封装的结构体变量（当然封装的结构体里必须包含socket描述符用于区分是哪个socket的读写事件），则后续的所有流程，都能基于这个自定义结构体进行扩展。golang的netpoll就是在epoll_event.data中存储了封装的pollDesc结构体，而pollDesc.rg和pollDesc.wg又直接标记了绑定到该socket的goroutine结构体，从而将epoll的等待状态能直接反映到对应的goroutine上，然后利用golang的调度器对就绪的goroutine进行调度。这样的设计在所有基于epoll的网络通信模型中，都可以借鉴。

4.3 优缺点分析

　　golang充分利用了其goroutine的轻量级和多goroutine并行的特点，基于IO复用，实现了其netpoll网络库。由于goroutine的轻量级特性，单机goroutine数量可以轻松超越多线程模型好几个数量级。但是goroutine的数量也不是无限制的，即使goroutine再轻量化，其context switch也会消耗资源。当goroutine的数量超过一定标准（比如几万个goroutine同时运行），runtime调度性能会急剧下降，导致goroutine积压。简单的说，golang的netpoll没有实现服务降级。当连接数达到上限之后，就不应该继续增加goroutine的数量，而是应该让后续连接排队，即用固定数量的协程池来处理网络请求。

4.4 主从Reactor模型

　　go netpoll对于编写小规模网络应用，应该是绰绰有余的。但是，对于大型互联网公司的大量http和rpc请求，无限制的开启goroutine且没有服务降级是致命的，当大量的请求同时到达时，会直接导致golang的runtime调度瘫痪，因此大型公司往往都会基于操作系统的多路复用，封装自己的网络库。其中比较经典的模型是主从Reactor模型。主从Reactor模型通常由一个主reactor线程，一个从actor线程（或者多个），一个worker线程池组成。主reactor负责服务端socket的监听事件，并将连接的客户端socket封装成connection交给从reactor；从reactor专门负责接收read/write事件，将收到的数据发送给worker进行业务处理；worker即一个线程池（协程池），负责处理读写之后的业务逻辑。如下图所示：