1. 理论
零拷贝是服务器网络编程的关键,任何性能优化都离不开。在 Java 程序员的世界,常用的零拷贝有 mmap(内存映射) 和 sendFile。所谓的零拷贝不是说不拷贝,是不存在CPU拷贝,DMA拷贝是不可避免的。也就是从操作系统的角度来说,内核缓存区之间没有数据是重复的(只有kernel buffer 有一份数据)
2. 原来BIO测试
import; import; import; import; public class NIOSocket { public static void main(String[] args) throws Exception { // 建立 socket ServerSocket serverSocket = new ServerSocket(8080); System.out.println("serverSocket init 8080 ==== "); while (true) { Thread.sleep(1 * 1000); // 接到连接之后写回去数据 Socket socket = serverSocket.accept(); System.out.println("socket; " + socket.getRemoteSocketAddress()); // 读取文件的数据 File file = new File("index.html"); RandomAccessFile raf = new RandomAccessFile(file, "rw"); byte[] arr = new byte[(int) file.length()];; socket.getOutputStream().write(arr); } } }
1. index.html 内容如下:
[root@192 zerocopy]# cat index.html
index hello
2. linux 上面用strace 测试
[root@192 zerocopy]# strace -ff -o out ../jdk8/jdk1.8.0_291/bin/java NIOSocket
serverSocket init 8080 ====
3. 查看out文件
[root@192 zerocopy]# ll total 2568 -rw-r--r--. 1 root root 12 Jul 23 04:25 index.html -rw-r--r--. 1 root root 1389 Jul 24 20:17 NIOSocket.class -rw-r--r--. 1 root root 916 Jul 24 20:17 -rw-r--r--. 1 root root 12828 Jul 24 20:21 out.51588 -rw-r--r--. 1 root root 1278064 Jul 24 20:21 out.51589 -rw-r--r--. 1 root root 13861 Jul 24 20:22 out.51590 -rw-r--r--. 1 root root 1614 Jul 24 20:21 out.51591 -rw-r--r--. 1 root root 1558 Jul 24 20:21 out.51592 -rw-r--r--. 1 root root 1145 Jul 24 20:21 out.51593 -rw-r--r--. 1 root root 9446 Jul 24 20:22 out.51594 -rw-r--r--. 1 root root 1190 Jul 24 20:21 out.51595 -rw-r--r--. 1 root root 387057 Jul 24 20:22 out.51596
4. nc 进行连接测试
[root@192 zerocopy]# nc localhost 8080:
index hello
5. 查看out.51589文件
。。。 socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 5 。。。 bind(5, {sa_family=AF_INET6, sin6_port=htons(8080), inet_pton(AF_INET6, "::", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = 0 listen(5, 50) = 0 。。。 。。。 accept(5, {sa_family=AF_INET6, sin6_port=htons(53122), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, [28]) = 6 。。。 open("index.html", O_RDWR|O_CREAT|O_LARGEFILE, 0666) = 7 fstat64(7, {st_mode=S_IFREG|0644, st_size=12, ...}) = 0 stat64("index.html", {st_mode=S_IFREG|0644, st_size=12, ...}) = 0 read(7, "index hello\n", 12) = 12 。。。 。。。 send(6, "index hello\n", 12, 0) = 12 。。。
6. 我们会调用 read 方法读取 index.html 的内容—— 变成字节数组,然后调用 write 方法,将 index.html 字节流写到 socket 中,那么,我们调用这两个方法,在 OS 底层发生的操作如下:
1. read 调用导致用户态到内核态的一次变化,同时,第一次复制开始:DMA(Direct Memory Access,直接内存存取,即不使用 CPU 拷贝数据到内存,而是 DMA 引擎传输数据到内存,用于解放 CPU) 引擎从磁盘读取 index.html 文件,并将数据放入到内核缓冲区。
2. 发生第二次数据拷贝,即:将内核缓冲区的数据拷贝到用户缓冲区,同时,发生了一次用内核态到用户态的上下文切换。
3. 发生第三次数据拷贝,我们调用 write 方法,系统将用户缓冲区的数据拷贝到 Socket 缓冲区。此时,又发生了一次用户态到内核态的上下文切换。
4. 第四次拷贝,数据异步的从 Socket 缓冲区,使用 DMA 引擎拷贝到网络协议引擎。这一段,不需要进行上下文切换。
5. write 方法返回,再次从内核态切换到用户态。
如上操作经历了4次拷贝,2次DMA拷贝,2次CPU拷贝。 并且经历了4次状态切换。优化就需要内核继续发展,增加更高效的命令。
3. map 优化
mmap 通过内存映射,将文件映射到内核缓冲区,同时,用户空间可以共享内核空间的数据。这种方式的I/O原理就是将用户缓冲区(user buffer)的内存地址和内核缓冲区(kernel buffer)的内存地址做一个映射,也就是说系统在用户态可以直接读取并操作内核空间的数据。这样,在进行网络传输时,就可以减少内核空间到用户空间的拷贝次数。
MMAP(2) Linux Programmer's Manual MMAP(2) NAME mmap, munmap - map or unmap files or devices into memory SYNOPSIS #include <sys/mman.h> void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset); int munmap(void *addr, size_t length); See NOTES for information on feature test macro requirements. DESCRIPTION mmap() creates a new mapping in the virtual address space of the calling process. The starting address for the new mapping is specified in addr. The length argument specifies the length of the mapping. 。。。
RETURN VALUE On success, mmap() returns a pointer to the mapped area. On error, the value MAP_FAILED (that is, (void *) -1) is returned, and errno is set appropriately. On success, munmap() returns 0, on failure -1, and errno is set (probably to EINVAL).
mmap 的过程如下:
user buffer 和 kernel buffer 共享 index.html。如果你想把硬盘的 index.html 传输到网络中,再也不用拷贝到用户空间,再从用户空间拷贝到 Socket 缓冲区。
现在,你只需要从内核缓冲区拷贝到 Socket 缓冲区即可,这将减少一次内存拷贝(从 4 次变成了 3 次),但不减少上下文切换次数。
4. sendFile 优化
linux2.1 提供了 sendFile 函数,其基本原理如下:数据根本不经过用户态,直接从内核缓冲区进入到 Socket Buffer,同时,由于和用户态完全无关,就减少了一次上下文切换。
进行 sendFile 系统调用时,数据被 DMA 引擎从文件复制到内核缓冲区,然后调用 write 方法时,从内核缓冲区进入到 Socket,这时,是没有上下文切换的,因为在一个用户空间。
最后,数据从 Socket 缓冲区进入到协议栈。
此时,数据经过了 3 次拷贝,3次上下文切换。
5. sendFile 继续优化
Linux 在 2.4 版本中,做了一些修改,避免了从内核缓冲区拷贝到 Socket buffer 的操作,直接拷贝到协议栈,从而再一次减少了数据拷贝。
SENDFILE(2) Linux Programmer's Manual SENDFILE(2) NAME sendfile - transfer data between file descriptors SYNOPSIS #include <sys/sendfile.h> ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count); DESCRIPTION sendfile() copies data between one file descriptor and another. Because this copying is done within the kernel, sendfile() is more efficient than the combination of read(2) and write(2), which would require transferring data to and from user space. in_fd should be a file descriptor opened for reading and out_fd should be a descriptor opened for writing. If offset is not NULL, then it points to a variable holding the file offset from which sendfile() will start reading data from in_fd. When sendfile() returns, this variable will be set to the offset of the byte following the last byte that was read. If offset is not NULL, then sendfile() does not modify the current file offset of in_fd; otherwise the current file offset is adjusted to reflect the number of bytes read from in_fd. If offset is NULL, then data will be read from in_fd starting at the current file offset, and the file offset will be updated by the call. count is the number of bytes to copy between the file descriptors. The in_fd argument must correspond to a file which supports mmap(2)-like operations (i.e., it cannot be a socket). In Linux kernels before 2.6.33, out_fd must refer to a socket. Since Linux 2.6.33 it can be any file. If it is a regular file, then sendfile() changes the file offset appropriately. RETURN VALUE If the transfer was successful, the number of bytes written to out_fd is returned. On error, -1 is returned, and errno is set appropriately.
index.html 要从文件进入到网络协议栈,只需 2 次拷贝:第一次使用 DMA 引擎从文件拷贝到内核缓冲区,第二次从内核缓冲区将数据拷贝到网络协议栈;内核缓存区只会拷贝一些 offset 和 length 信息到 SocketBuffer,基本无消耗。也就是说也存在CPU拷贝,只是拷贝的只有文件的地址、偏移量等信息,可以忽略不计。
mmap 和 sendFile 的区别:
1. mmap 适合小数据量读写,sendFile 适合大文件传输。
2. mmap 需要 4 次上下文切换,3 次数据拷贝;sendFile 需要 3 次上下文切换,最少 2 次数据拷贝。
3. sendFile 可以利用 DMA 方式,减少 CPU 拷贝,mmap 则不能(必须从内核拷贝到 Socket 缓冲区)。
1. mmap(内存映射文件) 工作原理 内存映射:mmap 将文件或设备直接映射到进程的地址空间中,使得文件内容可以像普通内存一样被访问。 零拷贝:通过 mmap,文件内容可以直接在用户空间和内核空间之间共享,减少了数据复制的次数。 随机访问:支持高效的随机读写操作,因为文件内容就像一个大数组一样可以按字节索引访问。 import; import java.nio.MappedByteBuffer; import java.nio.channels.FileChannel; public class MMapExample { public static void main(String[] args) throws Exception { try (RandomAccessFile raf = new RandomAccessFile("example.txt", "rw"); FileChannel channel = raf.getChannel()) { // 将文件映射到内存中 MappedByteBuffer buffer =, 0, channel.size()); // 修改文件内容 buffer.put(0, (byte) 'A'); // 修改第一个字节为 'A' } } } 2、sendfile 工作原理 零拷贝:sendfile 可以直接从文件描述符读取数据并将其发送到另一个文件描述符(通常是套接字),而不需要将数据复制到用户空间。 减少上下文切换:由于数据直接在内核空间中传输,减少了用户空间和内核空间之间的上下文切换。 高效传输:特别适用于大文件传输,因为它避免了不必要的数据复制和内存拷贝。 import; import; import java.nio.channels.FileChannel; public class SendfileExample { public static void main(String[] args) throws Exception { try (FileInputStream fis = new FileInputStream("source.txt"); FileOutputStream fos = new FileOutputStream("destination.txt"); FileChannel sourceChannel = fis.getChannel(); FileChannel destChannel = fos.getChannel()) { // 使用 transferTo 方法模拟 sendfile 行为 long position = 0; long count = sourceChannel.size(); sourceChannel.transferTo(position, count, destChannel); } } } 缺点 有限制:只能用于文件到文件或文件到套接字的传输,不能用于其他类型的 I/O 操作。 平台依赖:sendfile 的具体实现和性能可能因操作系统不同而有所差异。
6. NIO测试
1. 代码如下:
import; import; import; import; import; import java.nio.channels.FileChannel; import java.nio.channels.ServerSocketChannel; import java.nio.channels.SocketChannel; public class NIOSocket { public static void main(String[] args) throws Exception { ServerSocketChannel serverSocketChannel =; ServerSocket serverSocket = serverSocketChannel.socket(); serverSocket.bind(new InetSocketAddress(8080)); serverSocket.setReuseAddress(true); System.out.println("serverSocketChannel init 8080 !!!"); while (true) { try { SocketChannel socketChannel = serverSocketChannel.accept(); System.out.println("客户端连接成功: " + socketChannel.getRemoteAddress()); // 输出的文件 File file = new File("index.html"); RandomAccessFile raf = new RandomAccessFile(file, "rw"); FileChannel channel = raf.getChannel(); long size = channel.size(); System.out.println("ready reansfer to !"); channel.transferTo(0, size, socketChannel); } catch (IOException e) { e.printStackTrace(); } } } }
2. nc 进行测试
[root@192 zerocopy]# nc localhost 8080
index hello
3. 主线程查看日志
strace -ff -o out ../jdk8/jdk1.8.0_291/bin/java NIOSocket serverSocketChannel init 8080 !!! 客户端连接成功: /0:0:0:0:0:0:0:1:53126 ready reansfer to !
4. 查看out 文件
[root@192 zerocopy]# ll total 3064 -rw-r--r--. 1 root root 12 Jul 23 04:25 index.html -rw-r--r--. 1 root root 1824 Jul 24 21:55 NIOSocket.class -rw-r--r--. 1 root root 1410 Jul 24 21:54 -rw-r--r--. 1 root root 13099 Jul 24 21:58 out.52182 -rw-r--r--. 1 root root 1490301 Jul 24 21:58 out.52183 -rw-r--r--. 1 root root 56765 Jul 24 21:58 out.52184 -rw-r--r--. 1 root root 1688 Jul 24 21:58 out.52185 -rw-r--r--. 1 root root 1626 Jul 24 21:58 out.52186 -rw-r--r--. 1 root root 5189 Jul 24 21:58 out.52187 -rw-r--r--. 1 root root 19040 Jul 24 21:58 out.52188 -rw-r--r--. 1 root root 1175 Jul 24 21:58 out.52189 -rw-r--r--. 1 root root 1509105 Jul 24 21:58 out.52190 -rw-r--r--. 1 root root 7216 Jul 24 21:58 out.52221
5. 查看 52183 文件
。。。 socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 4 setsockopt(4, SOL_IPV6, IPV6_V6ONLY, [0], 4) = 0 setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 。。。 bind(4, {sa_family=AF_INET6, sin6_port=htons(8080), inet_pton(AF_INET6, "::", &sin6_addr), sin6_flowinfo=htonl(0), s in6_scope_id=0}, 28) = 0 listen(4, 50) 。。。 accept(4, {sa_family=AF_INET6, sin6_port=htons(53126), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=htonl(0 ), sin6_scope_id=0}, [28]) = 6 fcntl64(6, F_GETFL) = 0x2 (flags O_RDWR) 。。。 open("index.html", O_RDWR|O_CREAT|O_LARGEFILE, 0666) = 7 fstat64(7, {st_mode=S_IFREG|0644, st_size=12, ...}) = 0 。。。 sendfile64(6, 7, [0] => [12], 12) = 12 。。。
可以看到最终是调用了sendfile64 函数进行输出。也就是传递的是地址以及两个fd和大小。
NIO的零拷贝由transferTo()方法实现。transferTo()方法将数据从FileChannel对象传送到可写的字节通道(如Socket Channel等)。在内部实现中,由native方法transferTo0()来实现,它依赖底层操作系统的支持。在UNIX和Linux系统中,调用这个方法将会引起sendfile()系统调用。 签名如下:
// Transfers from src to dst, or returns -2 if kernel can't do that private native long transferTo0(FileDescriptor src, long position, long count, FileDescriptor dst);
1. 较大,读写较慢,追求速度
2. 内存不足,不能加载太大数据
3. 带宽不够,即存在其他程序或线程存在大量的IO操作,导致带宽本来就小
补充: NIO直接内存修改数据
import; import; import; import; import; import java.nio.MappedByteBuffer; import java.nio.channels.FileChannel; import java.nio.channels.ServerSocketChannel; import java.nio.channels.SocketChannel; public class NIOSocket { public static void main(String[] args) throws Exception { ServerSocketChannel serverSocketChannel =; ServerSocket serverSocket = serverSocketChannel.socket(); serverSocket.bind(new InetSocketAddress(8080)); serverSocket.setReuseAddress(true); System.out.println("serverSocketChannel init 8080 !!!"); while (true) { try { SocketChannel socketChannel = serverSocketChannel.accept(); System.out.println("客户端连接成功: " + socketChannel.getRemoteAddress()); // 输出的文件 File file = new File("D:/index.html"); RandomAccessFile raf = new RandomAccessFile(file, "rw"); FileChannel channel = raf.getChannel(); System.out.println(" start ... "); MappedByteBuffer buffer =, 0, channel.size() + 2); // 添加一个字符进去,这里对镜像的修改直接生效到节点文件中了! buffer.putChar((int) (channel.size() - 2), 'C'); System.out.println(" end ... "); long size = channel.size(); System.out.println("ready reansfer to !"); channel.transferTo(0, size, socketChannel); } catch (IOException e) { e.printStackTrace(); } } } }
(1) strace 启动程序
strace -ff -o out ../jdk8/jdk1.8.0_291/bin/java NIOSocket
serverSocketChannel init 8080 !!!
(2) nc 连接
[root@192 zerocopy]# nc localhost 8080
index hello
(3) 查看out文件
。。。 socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 4 setsockopt(4, SOL_IPV6, IPV6_V6ONLY, [0], 4) = 0 setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 。。。 bind(4, {sa_family=AF_INET6, sin6_port=htons(8080), inet_pton(AF_INET6, "::", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = 0 listen(4, 50) 。。。 accept(4, {sa_family=AF_INET6, sin6_port=htons(53132), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, [28]) = 6 fcntl64(6, F_GETFL) = 0x2 (flags O_RDWR) 。。。 open("index.html", O_RDWR|O_CREAT|O_LARGEFILE, 0666) = 7 fstat64(7, {st_mode=S_IFREG|0644, st_size=12, ...}) = 0 。。。 mmap2(NULL, 14, PROT_READ|PROT_WRITE, MAP_SHARED, 7, 0) = 0xf7714000 。。。 sendfile64(6, 7, [0] => [14], 14) = 14
MMAP2(2) Linux Programmer's Manual MMAP2(2) NAME mmap2 - map files or devices into memory SYNOPSIS #include <sys/mman.h> void *mmap2(void *addr, size_t length, int prot, int flags, int fd, off_t pgoffset); DESCRIPTION This is probably not the system call you are interested; instead, see mmap(2), which describes the glibc wrapper function that invokes this system call. The mmap2() system call provides the same interface as mmap(2), except that the final argument specifies the offset into the file in 4096-byte units (instead of bytes, as is done by mmap(2)). This enables applications that use a 32-bit off_t to map large files (up to 2^44 bytes). RETURN VALUE On success, mmap2() returns a pointer to the mapped area. On error -1 is returned and errno is set appro‐ priately.
由于MappedByteBuffer申请的是堆外内存,因此不受Minor GC控制,只能在发生Full GC时才能被回收。而==DirectByteBuffer==改善了这一情况,它是MappedByteBuffer类的子类,同时它实现了DirectBuffer接口,维护一个Cleaner对象来完成内存回收。因此它既可以通过Full GC来回收内存,也可以调用clean()方法来进行回收。
补充: read/write 和 recv/send 的区别
recv和send函数提供了和read和write差不多的功能,针对是读、写操作是socket的fd文件描述符,不过它们提供了第四个参数来 flage 控制读写操作。
linux下面man 2 cmd 查看各个命令如下:
1. read
READ(2) Linux Programmer's Manual READ(2) NAME read - read from a file descriptor SYNOPSIS #include <unistd.h> ssize_t read(int fd, void *buf, size_t count); DESCRIPTION read() attempts to read up to count bytes from file descriptor fd into the buffer starting at buf. On files that support seeking, the read operation commences at the current file offset, and the file off‐ set is incremented by the number of bytes read. If the current file offset is at or past the end of file, no bytes are read, and read() returns zero. If count is zero, read() may detect the errors described below. In the absence of any errors, or if read() does not check for errors, a read() with a count of 0 returns zero and has no other effects. If count is greater than SSIZE_MAX, the result is unspecified. RETURN VALUE On success, the number of bytes read is returned (zero indicates end of file), and the file position is advanced by this number. It is not an error if this number is smaller than the number of bytes requested; this may happen for example because fewer bytes are actually available right now (maybe because we were close to end-of-file, or because we are reading from a pipe, or from a terminal), or because read() was interrupted by a signal. On error, -1 is returned, and errno is set appropriately. In this case it is left unspecified whether the file position (if any) changes.
2. write
WRITE(2) Linux Programmer's Manual WRITE(2) NAME write - write to a file descriptor SYNOPSIS #include <unistd.h> ssize_t write(int fd, const void *buf, size_t count); DESCRIPTION write() writes up to count bytes from the buffer pointed buf to the file referred to by the file descrip‐ tor fd. The number of bytes written may be less than count if, for example, there is insufficient space on the underlying physical medium, or the RLIMIT_FSIZE resource limit is encountered (see setrlimit(2)), or the call was interrupted by a signal handler after having written less than count bytes. (See also pipe(7).) For a seekable file (i.e., one to which lseek(2) may be applied, for example, a regular file) writing takes place at the current file offset, and the file offset is incremented by the number of bytes actually written. If the file was open(2)ed with O_APPEND, the file offset is first set to the end of the file before writing. The adjustment of the file offset and the write operation are performed as an atomic step. POSIX requires that a read(2) which can be proved to occur after a write() has returned returns the new data. Note that not all file systems are POSIX conforming. RETURN VALUE On success, the number of bytes written is returned (zero indicates nothing was written). On error, -1 is returned, and errno is set appropriately.
3. recv
RECV(2) Linux Programmer's Manual RECV(2) NAME recv, recvfrom, recvmsg - receive a message from a socket SYNOPSIS #include <sys/types.h> #include <sys/socket.h> ssize_t recv(int sockfd, void *buf, size_t len, int flags); ssize_t recvfrom(int sockfd, void *buf, size_t len, int flags, struct sockaddr *src_addr, socklen_t *addrlen); ssize_t recvmsg(int sockfd, struct msghdr *msg, int flags); DESCRIPTION The recvfrom() and recvmsg() calls are used to receive messages from a socket, and may be used to receive data on a socket whether or not it is connection-oriented. RETURN VALUE These calls return the number of bytes received, or -1 if an error occurred. In the event of an error, errno is set to indicate the error. The return value will be 0 when the peer has performed an orderly shutdown.
4. send
NAME send, sendto, sendmsg - send a message on a socket SYNOPSIS #include <sys/types.h> #include <sys/socket.h> ssize_t send(int sockfd, const void *buf, size_t len, int flags); ssize_t sendto(int sockfd, const void *buf, size_t len, int flags, const struct sockaddr *dest_addr, socklen_t addrlen); ssize_t sendmsg(int sockfd, const struct msghdr *msg, int flags); DESCRIPTION The system calls send(), sendto(), and sendmsg() are used to transmit a message to another socket.
