io_uring异步IO框架介绍与示例

本文翻译自 https://blogs.oracle.com/linux/an-introduction-to-the-io_uring-asynchronous-io-framework

简介

这篇博客是对Linux UEK V6版本里开始提供的io_uring异步I/O框架的一个简单介绍。文中重点说明了引入新机制的动机，使用示例代码描述了他的系统调用和函数库接口。本文也提供了一系列深入的技术细节描述以及使用例子。

io_uring异步I/O(AIO)框架是一个新的Linux I/O接口，首次引入是在上游内核5.1版本（2019年3月）时。他为需要AIO功能且希望由内核执行I/O的应用程序提供了一套低延迟、功能丰富的接口。其目的可能是为了使用文件系统能力，也可能是为了使用一些内核提供的镜像、块级加密等功能。和基于SPDK的应用相比，SPDK应用不希望内核来参与I/O的执行过程，因为这些应用自行实现文件系统等功能。

动机

Linux原生AIO框架受限于下面的限制，这也正是io_uring要克服的：

它不支持buffered I/O, 只支持direct I/O
它的行为有不确定性，在某些场景下会block住
它的API不够优化，每个I/O需要两次系统调用，一次提交请求，一次等待完成
- 每次提交请求需要拷贝64+8字节的数据，每次完成需要拷贝32字节

通信通道

一个io_uring实例有两个ring，一个提交队列(SQ)和一个完成队列(CQ)。这两个队列在内核和应用之间共享。每个队列都是单生产者，单消费者，大小为2的幂次。

借助内存屏障技术，这些队列都提供了的无锁接口。

应用程序需要创建一个或者多个SQ实体(SQE)，然后更新SQ的尾指针。内核会消费这些SQE, 然后更新SQ头指针。

内核会为每个完成的请求创建CQ实体(CQE)，然后更新CQ尾指针。应用程序来消费这些CQE并更新头指针。

完成事件可能会以任意次序发生，但是总是和特定的CQE关联。

系统调用API

io_uring API包括3个系统调用：io_uring_setup(2), io_uring_register(2) 和 io_uring_enter(2), 下面章节将会对其进行说明。系统API的完整的手册可从这里获得。

io_uring_setup

建立执行异步I/O上下文环境

int io_uring_setup(u32 entries, struct io_uring_params *params);

io_uring_setup() 系统调用会建立一个带有至少entries个实体的提交队列和完成队列，返回一个文件描述符用于后续在这个io_uring实力上执行提交操作。提交队列和完成队列是在应用程序和内核间共享的，这样在发起和完成I/O时就不需要复制数据了。

params 参数可以让应用程序来配置io_uring实例。这个参数也用于内核返回对提交队列和完成队列的配置信息。

io_uring实力可以配置成下面的三种操作模式：

中断模式 - 缺省条件下，io_uring实例会被设置成中断驱动I/O模式。这种模式下使用io_uring_enter()函数提交I/O, 然后通过直接检查完成队列来收割I/O。
轮询模式 - 使用忙等待模式处理I/O completion，这和使用异步IRQ(中断）进行通知正好相反。要使用此种模式，需要文件系统（如果有）和块设备必须支持轮询。忙等待提供了更低的时延，但是相比中断模式的I/O，需要消耗更多的CPU资源。目前这个功能仅能在使用O_DIRECT标志打开的文件描述符上使用。当一个读或者写I/O提交到轮询上下文后，应用必须要在CQ ring上调用io_uring_enter()。在同一个io_uring实例上混合使用轮询和非轮询模式是非法的。
内核轮询模式 - 在这种模式下，会创建一个内核线程来执行提交队列的轮询。将io_uring实例配置成这种模式可以使得应用程序提交I/O的时候无需切换到内核态。通过在提交队列里填入SQE，以及监控完成队列获得I/O完成事件，应用程序可以在不进行任何系统调用的条件下提交和收割I/O。如何内核线程空闲超过一定的时间（时间长度可以配置），它就会在通知应用后进入到IDLE状态。这种情况发生后，应用必须再次调用io_uring_enter()以唤醒内核线程。如何I/O一直很忙，那么内核线程将永远不会休眠。

io_uring_setup() 成功后返回一个文件描述符。应用程序随后把文件描述符传递给mmap(2) 调用来映射提交队列和完成队列，或者传递给io_uring_register() 、 io_uring_enter()系统调用。

io_uring_register

为异步I/O注册文件和用户buffer

int io_uring_register(unsigned int fd, unsigned int opcode,
                      void *arg, unsigned int nr_args);

io_uring_register() 为由fd确定的io_uring实例注册用户buffer或者文件。注册文件和用户buffer允许内核长时间持有与该文件相关的内核数据结构，和创建buffer所代表的应用内存的长期映射。相比每个I/O请求进行注册，这样的一次性注册可以减少每次I/O的均摊开销。

注册的buffer会被锁定在内存种，因而会消耗用户RLIMIT_MEMLOCK资源。另外，每个buffer的大小限制是1GiB。目前，注册的buffer必须是匿名的，非文件依托（non-file-backed）的内存，例如由malloc(3)分配或者由mmap(2) 使用MAP_ANONYMOUS标志分配的内存。大页内存是可以支持的。但需要注意的是大页内存的整个页都会被锁定在内核中，即使真正使用的只是一小部分。

设置好一个大的buffer，然后一个I/O只是使用其中的一部分，这样的做法是完全可以的，只要使用的区域初始时映射过了。

应用程序可以增加或者减少注册的buffer的数量和大小，首先将已经存在的buffer取消注册，然后重新调用io_uring_register()注册新的buffer。

应用程序可以动态更新注册的文件集合而无需对其进行取消注册。

对于io_uring实例的完成事件来使用eventfd(2) 来获取其事件通知也是可行的。如果想这样用的话，就通过系统调用注册eventfd的文件描述符即可。

The credentials of the running application can be registered with io_uring which returns an id associated with those credentials. Applications wishing to share a ring between separate users/processes can pass in this credential id in the SQE personality field. If set, that particular SQE will be issued with these credentials. （译：凭据这段没太懂，以后再译)

io_uring_enter

发起或者完成异步I/O。

int io_uring_enter(unsigned int fd, unsigned int to_submit,
                   unsigned int min_complete, unsigned int flags,
                   sigset_t *sig);

io_uring_enter()函数用于使用提交队列和完成队列发起和完成I/O，这两个队列是由io_uring_setup()所建立的。一次调用就可以同时完成新I/O提交和等待I/O完成，包括当前和之前io_uring_enter()调用提交的I/O。

fd 是由io_uring_setup()调用返回的文件描述符。to_submit 指定了要从提交队列里提交的I/O的数量。如果应用指定了min_complete, 那么这个调用会在返回前等待至少min_complete个事件。如果io_uring实例配置成了轮询模式，那么min_complete参数的含义会稍有不同。传入0表示要内核返回任何已经完成了的事件，而不要阻塞。如果min_complete非0，而且存在已经完成了的事件，内核仍然会立即返回。如果没有就绪的完成事件，这个函数将会进行轮询直到有一个或多个完成事件就绪，或者直到进程超过了它被调度的时间片。

对于中断驱动的I/O，应用程序也可以检查完成队列来获得完成事件而无需进入内核。

io_uring_enter()支持很多种操作，包括：

打开、关闭和获取文件状态
读或者写入多个buffer或者预先映射的buffer
套接字I/O操作
同步获取文件状态
异步监控一组文件描述符
创建和ring里特定操作连结的超时
试图取消一个进行中的操作
创建I/O链
- 链内操作顺序执行
- 多个链并行执行

当这个系统调用返回时，肯定一定量的SEQ实体已经被消费或者提交了，重用队列里的这些SQE是安全的。这一点总是是有保证的，即使实际的队列是由同步I/O上下文环境支持的，也就是说I/O实际上还没有提交。这种情况下内核会对它后面需要继续使用的SQE做一个自己的私有拷贝。

Liburing

Liburing 提供了一套高级API接口，满足应用程序的基本使用，避免应用自己处理系统调用的细节。API也避免一些代码重复的操作，比如设置io_uring实例。

例如，在从 io_uring_setup()获得ring 文件描述符后，按照io_uring_setup()手册里的描述，应用必须调用mmap()对提交队列和完成队列进行映射才能访问。这整个过程非常的冗长，但却可以用下面的liburing调用来完成：

int io_uring_queue_init(unsigned entries, struct io_uring *ring, unsigned flags)

下面的包含在liburing源代码里的示例应用有助于说明这些要点。

liburing的示例应用可以从这里获得。

截至目前还没有可用的liburing API文档，API的描述都在 liburing.h 头文件里。

示例应用：io_uring-test

io_uring-test 从用户指定文件中，使用4个SQE读取最多16KB数据。每个SQE请求从固定文件偏移位置读取4KB数据。Io-uring然后会收割每个CQE并检查是否从文件中读够了请求所要的4KB数据。

如果文件比16KB要小，全部4个SQE仍然会被提交，但是部分的CQE的返回会显示出读到了部分数据、或者读到了0字节数据，取决于文件的实际大小。

io-uring最终会报告它处理了的SQE和CQE的数量。

下面是完整的代码：

/* SPDX-License-Identifier: MIT */
/*
 * Simple app that demonstrates how to setup an io_uring interface,
 * submit and complete IO against it, and then tear it down.
 *
 * gcc -Wall -O2 -D_GNU_SOURCE -o io_uring-test io_uring-test.c -luring
 */
#include <stdio.h>
#include <fcntl.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include "liburing.h"
 
#define QD  4
 
int main(int argc, char *argv[])
{
    struct io_uring ring;
    int i, fd, ret, pending, done;
    struct io_uring_sqe *sqe;
    struct io_uring_cqe *cqe;
    struct iovec *iovecs;
    off_t offset;
    void *buf;
 
    if (argc < 2) {
        printf("%s: file\n", argv[0]);
        return 1;
    }
 
    ret = io_uring_queue_init(QD, &ring, 0);
    if (ret < 0) {
        fprintf(stderr, "queue_init: %s\n", strerror(-ret));
        return 1;
    }
 
    fd = open(argv[1], O_RDONLY | O_DIRECT);
    if (fd < 0) {
        perror("open");
        return 1;
    }
 
    iovecs = calloc(QD, sizeof(struct iovec));
    for (i = 0; i < QD; i++) {
        if (posix_memalign(&buf, 4096, 4096))
            return 1;
        iovecs[i].iov_base = buf;
        iovecs[i].iov_len = 4096;
    }
 
    offset = 0;
    i = 0;
    do {
        sqe = io_uring_get_sqe(&ring);
        if (!sqe)
            break;
        io_uring_prep_readv(sqe, fd, &iovecs[i], 1, offset);
        offset += iovecs[i].iov_len;
        i++;
    } while (1);
 
    ret = io_uring_submit(&ring);
    if (ret < 0) {
        fprintf(stderr, "io_uring_submit: %s\n", strerror(-ret));
        return 1;
    }
 
    done = 0;
    pending = ret;
    for (i = 0; i < pending; i++) {
        ret = io_uring_wait_cqe(&ring, &cqe);
        if (ret < 0) {
            fprintf(stderr, "io_uring_wait_cqe: %s\n", strerror(-ret));
            return 1;
        }
 
        done++;
        ret = 0;
        if (cqe->res != 4096) {
            fprintf(stderr, "ret=%d, wanted 4096\n", cqe->res);
            ret = 1;
        }
        io_uring_cqe_seen(&ring, cqe);
        if (ret)
            break;
    }
 
    printf("Submitted=%d, completed=%d\n", pending, done);
    close(fd);
    io_uring_queue_exit(&ring);
    return 0;
}

过程描述

以缺省的中断驱动模式创建一个io_uring实例，仅指定ring的大小。

ret = io_uring_queue_init(QD, &ring, 0);

ring里的所有SQE接下来会被取出来并准备 IORING_OP_READV操作，这个操作提供了readv(2)系统调用的异步接口。Liburing提供了好几种辅助函数来准备io_uring操作。

每个SQE都会指向一个已经分配好了的，由iovec数据结构表示的buffer。在readv完成的时候这个buffer里就装着读取的数据。

sqe = io_uring_get_sqe(&ring);

io_uring_prep_readv(sqe, fd, &iovecs[i], 1, offset);

然后所有的SQE仅使用一次 io_uring_submit()就完成了提交，返回成功提交的SQE数量。

ret = io_uring_submit(&ring);

CQE的收割是通过不断重复调用io_uring_wait_cqe()完成，一个请求的完成状态可以通过 cqe->res字段获得；在使用完了cqe后，需要调用 io_uring_cqe_seen()通知内核一定数量的CQE已经被消费过了。

ret = io_uring_wait_cqe(&ring, &cqe);

io_uring_cqe_seen(&ring, cqe);

最后，将io_uring销毁

void io_uring_queue_exit(struct io_uring *ring)

示例应用： link-cp

link-cp 使用io_uring 的SQE链功能来进行文件复制。

就像之前注意到的，io_uring支持创建I/O链。在同一个链中的I/O操作会顺序执行，而多个链上的I/O可以并行执行。

为了进行文件复制，link-cp创建了多个长度为2的SQE链。链里面的第一个SQE请求从输入文件里读数据。第二个请链接在第一个后面，是一个写请求，将同一个buffer写入到输出文件。

/* SPDX-License-Identifier: MIT */
/*
 * Very basic proof-of-concept for doing a copy with linked SQEs. Needs a
 * bit of error handling and short read love.
 */
#include <stdio.h>
#include <fcntl.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <assert.h>
#include <errno.h>
#include <inttypes.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/ioctl.h>
#include "liburing.h"
 
#define QD  64
#define BS  (32*1024)
 
struct io_data {
    size_t offset;
    int index;
    struct iovec iov;
};
 
static int infd, outfd;
static unsigned inflight;
 
 
static int setup_context(unsigned entries, struct io_uring *ring)
{
    int ret;
 
    ret = io_uring_queue_init(entries, ring, 0);
    if (ret < 0) {
        fprintf(stderr, "queue_init: %s\n", strerror(-ret));
        return -1;
    }
 
    return 0;
}
 
static int get_file_size(int fd, off_t *size)
{
    struct stat st;
 
    if (fstat(fd, &st) < 0)
        return -1;
    if (S_ISREG(st.st_mode)) {
        *size = st.st_size;
        return 0;
    } else if (S_ISBLK(st.st_mode)) {
        unsigned long long bytes;
 
        if (ioctl(fd, BLKGETSIZE64, &bytes) != 0)
            return -1;
 
        *size = bytes;
        return 0;
    }
 
    return -1;
}
 
static void queue_rw_pair(struct io_uring *ring, off_t size, off_t offset)
{
    struct io_uring_sqe *sqe;
    struct io_data *data;
    void *ptr;
 
    ptr = malloc(size + sizeof(*data));
    data = ptr + size;
    data->index = 0;
    data->offset = offset;
    data->iov.iov_base = ptr;
    data->iov.iov_len = size;
 
    sqe = io_uring_get_sqe(ring);
    io_uring_prep_readv(sqe, infd, &data->iov, 1, offset);
    sqe->flags |= IOSQE_IO_LINK;
    io_uring_sqe_set_data(sqe, data);
 
    sqe = io_uring_get_sqe(ring);
    io_uring_prep_writev(sqe, outfd, &data->iov, 1, offset);
    io_uring_sqe_set_data(sqe, data);
}
 
static int handle_cqe(struct io_uring *ring, struct io_uring_cqe *cqe)
{
    struct io_data *data = io_uring_cqe_get_data(cqe);
    int ret = 0;
 
    data->index++;
 
    if (cqe->res < 0) {
        if (cqe->res == -ECANCELED) {
            queue_rw_pair(ring, BS, data->offset);
            inflight += 2;
        } else {
            printf("cqe error: %s\n", strerror(cqe->res));
            ret = 1;
        }
    }
 
    if (data->index == 2) {
        void *ptr = (void *) data - data->iov.iov_len;
 
        free(ptr);
    }
    io_uring_cqe_seen(ring, cqe);
    return ret;
}
 
static int copy_file(struct io_uring *ring, off_t insize)
{
    struct io_uring_cqe *cqe;
    size_t this_size;
    off_t offset;
 
    offset = 0;
    while (insize) {
        int has_inflight = inflight;
        int depth;
 
        while (insize && inflight < QD) {
            this_size = BS;
            if (this_size > insize)
                this_size = insize;
            queue_rw_pair(ring, this_size, offset);
            offset += this_size;
            insize -= this_size;
            inflight += 2;
        }
 
        if (has_inflight != inflight)
            io_uring_submit(ring);
 
        if (insize)
            depth = QD;
        else
            depth = 1;
        while (inflight >= depth) {
            int ret;
 
            ret = io_uring_wait_cqe(ring, &cqe);
            if (ret < 0) {
                printf("wait cqe: %s\n", strerror(ret));
                return 1;
            }
            if (handle_cqe(ring, cqe))
                return 1;
            inflight--;
        }
    }
 
    return 0;
}
 
int main(int argc, char *argv[])
{
    struct io_uring ring;
    off_t insize;
    int ret;
 
    if (argc < 3) {
        printf("%s: infile outfile\n", argv[0]);
        return 1;
    }
 
    infd = open(argv[1], O_RDONLY);
    if (infd < 0) {
        perror("open infile");
        return 1;
    }
    outfd = open(argv[2], O_WRONLY | O_CREAT | O_TRUNC, 0644);
    if (outfd < 0) {
        perror("open outfile");
        return 1;
    }
 
    if (setup_context(QD, &ring))
        return 1;
    if (get_file_size(infd, &insize))
        return 1;
 
    ret = copy_file(&ring, insize);
 
    close(infd);
    close(outfd);
    io_uring_queue_exit(&ring);
    return ret;
}

过程描述

这三个过程：copy_file(), queue_rw_pair(), 和 handle_cqe(),实现了文件复制。

copy_file() 实现了上层复制循环，它调用 queue_rw_pair()来构造每一个SQE对

queue_rw_pair(ring, this_size, offset);

并在每次循环中调用一次 io_uring_submit()调用提交其所构造的SQE对。

if (has_inflight != inflight)

io_uring_submit(ring);

在复制数据期间，copy_file() 维持最多QD个SQE处于运行中(inflight)状态，它会在输入文件读取全部完成后等待并收割所有的CQE。

ret = io_uring_wait_cqe(ring, &cqe);

if (handle_cqe(ring, cqe))

queue_rw_pair() 函数构造一个读-写 SQE对。在读请求的SQE上设置 IOSQE_IO_LINK 标志表示链的起始。在写请求的SQE上不设置这个标志，表示链的结束。这两个的SQE的用户数据域设置成同样的数据描述符，在完成处理函数中也会使用。

sqe = io_uring_get_sqe(ring);

io_uring_prep_readv(sqe, infd, &data->iov, 1, offset);

sqe->flags |= IOSQE_IO_LINK;

io_uring_sqe_set_data(sqe, data);

sqe = io_uring_get_sqe(ring);

io_uring_prep_writev(sqe, outfd, &data->iov, 1, offset);

io_uring_sqe_set_data(sqe, data);

handle_cqe() 从CQE里取回初始由 queue_rw_pair() 设置的数据描述符，并把本次取回处理记录在描述符里。

struct io_data *data = io_uring_cqe_get_data(cqe);

data->index++;

handle_cqe() 在请求被取消完成的情况下会重新提交读-写请求对。

if (cqe->res == -ECANCELED) {

queue_rw_pair(ring, BS, data->offset);

下面是节选自io_uring_enter()手册，描述了链接请求的行为细节：

IOSQE_IO_LINK

若设置了此标志，那么将会和提交队列的下一个SQE形成一个联接。下一个SQE只有在前面的结束后才会开始执行。这在事实上形成了一个可以任意长度的链。链的结尾是由第一个未带有此标志的SQE指出。此标记对之前提交的SQE没有任何影响，也不影响这个链尾部以后的SQE。这意味着多个链，或者链与单独的SQE间可以并行执行。只有链内部的成员是顺次执行的。但是如果某个请求的执行发生错误，那么链的执行将会被中止。io_uring会把任何预期外的结果视为错误。这意味着，如果一个读操作没有获得预定的长度，也会终止一个链的执行。如果一个SQE链的执行被打破，那么剩下的未启动的部分将会以 -ECANCELED错误结束。

handle_cqe()在CQE的两个成员都被处理完成后释放数据描述符。

if (data->index == 2) {

void *ptr = (void *) data - data->iov.iov_len;

free(ptr);

}

handle_cqe() 在最后通知内核一定数量的CQE已经被消费完成。

io_uring_cqe_seen(ring, cqe);

Liburing API

io_uring-test 和 link-cp 两个应用使用下面的liburingAPI：

/*
 * Returns -1 on error, or zero on success. On success, 'ring'
 * contains the necessary information to read/write to the rings.
 */
int io_uring_queue_init(unsigned entries, struct io_uring *ring, unsigned flags);
 
/*
 * Return an sqe to fill. Application must later call io_uring_submit()
 * when it's ready to tell the kernel about it. The caller may call this
 * function multiple times before calling io_uring_submit().
 *
 * Returns a vacant sqe, or NULL if we're full.
 */
struct io_uring_sqe *io_uring_get_sqe(struct io_uring *ring);
 
/*
 * Set the SQE user_data field.
 */
void io_uring_sqe_set_data(struct io_uring_sqe *sqe, void *data);
 
/*
 * Prepare a readv I/O operation.
 */
void io_uring_prep_readv(struct io_uring_sqe *sqe, int fd,
                         const struct iovec *iovecs,
                         unsigned nr_vecs, off_t offset);
 
/*
 * Prepare a writev I/O operation.
 */
void io_uring_prep_writev(struct io_uring_sqe *sqe, int fd,
                          const struct iovec *iovecs,
                          unsigned nr_vecs, off_t offset);
 
/*
 * Submit sqes acquired from io_uring_get_sqe() to the kernel.
 *
 * Returns number of sqes submitted
 */
int io_uring_submit(struct io_uring *ring);
 
/*
 * Return an IO completion, waiting for it if necessary. Returns 0 with
 * cqe_ptr filled in on success, -errno on failure.
 */
int io_uring_wait_cqe(struct io_uring *ring,
                      struct io_uring_cqe **cqe_ptr);
 
/*
 * Must be called after io_uring_{peek,wait}_cqe() after the cqe has
 * been processed by the application.
 */
static inline void io_uring_cqe_seen(struct io_uring *ring,
                                     struct io_uring_cqe *cqe);
 
void io_uring_queue_exit(struct io_uring *ring);