But we’re still not done. Because generating and reading the select() bit arrays takes time proportional to the largest fd that you provided for select(), the select() call scales terribly when the number of sockets is high.
Different operating systems have provided different replacement functions for select. These include poll(), epoll(), kqueue(), evports, and /dev/poll. All of these give better performance than select(), and all but poll() give O(1) performance for adding a socket, removing a socket, and for noticing that a socket is ready for IO.
Unfortunately, none of the efficient interfaces is a ubiquitous standard. Linux has epoll(), the BSDs (including Darwin) have kqueue(), Solaris has evports and /dev/poll… and none of these operating systems has any of the others. So if you want to write a portable high-performance asynchronous application, you’ll need an abstraction that wraps all of these interfaces, and provides whichever one of them is the most efficient.
And that’s what the lowest level of the Libevent API does for you. It provides a consistent interface to various select() replacements, using the most efficient version available on the computer where it’s running.
Here’s yet another version of our asynchronous ROT13 server. This time, it uses Libevent 2 instead of select(). Note that the fd_sets are gone now: instead, we associate and disassociate events with a struct event_base, which might be implemented in terms of select(), poll(), epoll(), kqueue(), etc.
On the userspace side, generating and reading the bit arrays takes time proportional to the number of fds that you provided for select(). But on the kernel side, reading the bit arrays takes time proportional to the largest fd in the bit array, which tends to be around the total number of fds in use in the whole program, regardless of how many fds are added to the sets in select()。
因为最大的fd个数,是在内核编译的时候指定的, 所以每当select陷入内核,他面对的这个用来描述待处理文件描述符的队列,就是用一个最大fd个位的变量。扫面这个bit array不是一件高效的事情,难怪时间会随着fd的数量变化而变化(takes time proportional to the largest fd that you provided for select(), the select() call scales terribly when the number of sockets is high.)
以下是对bit array的评述:
Although most machines are not able to address individual bits in memory, nor have instructions to manipulate single bits, each bit in a word can be singled out and manipulated using bitwise operations. In particular:
- OR can be used to set a bit to one: 11101010 OR 00000100 = 11101110
- AND can be used to set a bit to zero: 11101010 AND 11111101 = 11101000
- AND together with zero-testing can be used to determine if a bit is set:
- 11101010 AND 00000001 = 00000000 = 0
- 11101010 AND 00000010 = 00000010 ≠ 0
- XOR can be used to invert or toggle a bit:
- 11101010 XOR 00000100 = 11101110
- 11101110 XOR 00000100 = 11101010
To obtain the bit mask needed for these operations, we can use a bit shift operator to shift the number 1 to the left by the appropriate number of places, as well as bitwise negation if necessary.
We can view a bit array as a subset of {1,2,...,n}, where a 1 bit indicates a number in the set and a 0 bit a number not in the set. This set data structure uses about n/wwords of space, where w is the number of bits in each machine word. Whether the least significant bit or the most significant bit indicates the smallest-index number is largely irrelevant, but the former tends to be preferred.
Given two bit arrays of the same size representing sets, we can compute their union, intersection, and set-theoretic difference using n/w simple bit operations each (2n/w for difference), as well as the complement of either:
Bit arrays, despite their simplicity, have a number of marked advantages over other data structures for the same problems:
- They are extremely compact; few other data structures can store n independent pieces of data in n/w words.
- They allow small arrays of bits to be stored and manipulated in the register set for long periods of time with no memory accesses.
- Because of their ability to exploit bit-level parallelism, limit memory access, and maximally use the data cache, they often outperform many other data structures on practical data sets, even those that are more asymptotically efficient.
However, bit arrays aren't the solution to everything. In particular:重点注意这里bit array的短板
- Without compression, they are wasteful set data structures for sparse sets (those with few elements compared to their range) in both time and space. For such applications, compressed bit arrays, Judy arrays, tries, or even Bloom filters should be considered instead.
- Accessing individual elements can be expensive and difficult to express in some languages. If random access is more common than sequential and the array is relatively small, a byte array may be preferable on a machine with byte addressing. A word array, however, is probably not justified due to the huge space overhead and additional cache misses it causes, unless the machine only has word addressing.
epoll
is a scalable I/O event notification mechanism for Linux, first introduced in Linux 2.5.44 [1]. It is meant to replace the older POSIX select(2)
and poll(2)
system calls, to achieve better performance in more demanding applications, where the number of watched file descriptors is large (unlike the older system calls, which operate at O(n), epoll
operates in O(1) [2]). epoll
is similar to FreeBSD's kqueue
, in that it operates on a configurable kernel object, exposed to user space as a file descriptor of its own.
epoll
provides both edge-triggered and level-triggered modes. In edge-triggered mode, a call to epoll_wait
will return only when a new event is enqueued with the epoll
object, while in level-triggered mode, epoll_wait
will return as long as the condition withholds.
For instance, if a pipe, registered with epoll
, has received data, a call to epoll_wait
will return, signaling that the presence of data to read. Suppose the reader only consumed part of data from the buffer. In level-triggered mode, further calls to epoll_wait
will return immediately, as long as the pipe's buffer contains data to be read. In edge-triggered mode, however, epoll_wait
will return only once new data is written to the pipe.
这里也有一篇reference:
http://kovyrin.net/2006/04/13/epoll-asynchronous-network-programming/
epoll is a new system call introduced in Linux 2.6. It is designed to replace the deprecated select (and also poll). Unlike these earlier system calls, which are O(n), epoll is an O(1) algorithm – this means that it scales well as the number of watched file descriptors increase. select uses a linear search through the list of watched file descriptors, which causes its O(n) behaviour, whereas epoll uses callbacks in the kernel file structure.
Another fundamental difference of epoll is that it can be used in an edge-triggered, as opposed to level-triggered, fashion. This means that you receive “hints” when the kernel believes the file descriptor has become ready for I/O, as opposed to being told “I/O can be carried out on this file descriptor”. This has a couple of minor advantages: kernel space doesn’t need to keep track of the state of the file descriptor, although it might just push that problem into user space, and user space programs can be more flexible (e.g. the readiness change notification can just be ignored).
这里还有一篇poll和epoll的性能测试对比:
http://lse.sourceforge.net/epoll/index.html#testing