重看ebpf 通信&&数据结构分析

　　Message passing to invoke behavior in a program is a widely used technique in soft‐ware engineering. A program can modify another program’s behavior by sending messages; this also allows the exchange of information between those programs. One of the most fascinating aspects about BPF, is that the code running on the kernel and the program that loaded said code can communicate with each other at runtime using message passing

　　BPF maps are key/value stores that reside in the kernel. They can be accessed by any BPF program that knows about them. Programs that run in user-space can also access these maps by using file descriptors. You can store any kind of data in a map, as long as you specify the data size correctly beforehand. The kernel treats keys and values as binary blobs, and it doesn’t care about what you keep in a map.

Creating BPF Maps

　　The most direct way to create a BPF map is by using the bpf syscall. When the first argument in the call is BPF_MAP_CREATE, you’re telling the kernel that you want to create a new map. This call will return the file descriptor identifier associated with the map you just created. The second argument in the syscall is the configuration for this
map:

union bpf_attr {
    struct {
        __u32 map_type; /* one of the values from bpf_map_type */
        __u32 key_size; /* size of the keys, in bytes */
        __u32 value_size; /* size of the values, in bytes */
        __u32 max_entries; /* maximum number of entries in the map */
        __u32 map_flags; /* flags to modify how we create the map */
    };
}

　　The third argument in the syscall is the size of this configuration attribute.
For example, you can create a hash-table map to store unsigned integers as keys and values as follows:

union bpf_attr my_map {
    .map_type = BPF_MAP_TYPE_HASH,
    .key_size = sizeof(int),
    .value_size = sizeof(int),
    .max_entries = 100,
    .map_flags = BPF_F_NO_PREALLOC,
};

int fd = bpf(BPF_MAP_CREATE, &my_map, sizeof(my_map));

　　If the call fails, the kernel returns a value of -1. There might be three reasons why it fails. If one of the attributes is invalid, the kernel sets the errno variable to EINVAL. If the user executing the operation doesn’t have enough privileges, the kernel sets the
errno variable to EPERM. Finally, if there is not enough memory to store the map, the kernel sets the errno variable to ENOMEM.
The helper function bpf_map_create wraps the code you just saw to make it easier to initialize maps on demand. We can use it to create the previous map with only one line of code:

int fd;
fd = bpf_create_map(BPF_MAP_TYPE_HASH, sizeof(int), sizeof(int), 100, BPF_F_NO_PREALOC);

　　If you know which kind of map you’re going to need in your program, you can also predefine it. This is helpful to get more visibility in the maps your program is using beforehand:

struct bpf_map_def SEC("maps") my_map = {
    .type = BPF_MAP_TYPE_HASH,
    .key_size = sizeof(int),
    .value_size = sizeof(int),
    .max_entries = 100,
    .map_flags = BPF_F_NO_PREALLOC,
};

　　When you define a map in this way, you’re using what’s called a section attribute, in this case SEC("maps"). This macro tells the kernel that this structure is a BPF map and it should be created accordingly !!
You might have noticed that we don’t have the file descriptor identifier associated with the map in this new example. In this case, the kernel uses a global variable called map_data to store information about the maps in your program. This variable is an array of structures, and it’s ordered by how you specified each map in your code. For example, if the previous map was the first one specified in your code, you’d get the file descriptor identifier from the first element in the array:
fd = map_data[0].fd;
　　You can also access the map’s name and its definition from this structure; this information is sometimes useful for debugging and tracing purposes.

其实主要就是：内核程序编译生成的 .o 文件要被解析成 ELF 文件 load 到内核。为此，map 是放在独有的 ELF 段中

#define SEC(NAME) __attribute__((section(NAME), used))

用户程序通过bpf 系统调用 (cmd为BPF_MAP_CREATE)创建 map，输入参数为 map 的各个参数，返回值为 map 对应的 fd。在官方例程中，用户空间程序是这样进行 map 创建的。

通过linux kernel 源码的sample/bpf里面的Makefile 可以知道

sockex1-objs := bpf_load.o libbpf.o sockex1_user.o
always += sockex1_kern.o

一个编译生成 sockex1_kern.o , 一个编译生成可执行程序 sockex1

所有 sockex1 会涉及到 bpf_load.c libbpf.c 等文件

#include <stdio.h>
#include <assert.h>
#include <linux/bpf.h>
#include "libbpf.h"
#include "bpf_load.h"
#include <unistd.h>
#include <arpa/inet.h>

int main(int ac, char **argv)
{
    char filename[256];
    FILE *f;
    int i, sock;

    snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);

    if (load_bpf_file(filename)) {
        printf("%s", bpf_log_buf);
        return 1;
    }

    sock = open_raw_sock("lo");

    assert(setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, prog_fd,
              sizeof(prog_fd[0])) == 0);

    f = popen("ping -c5 localhost", "r");
    (void) f;

    for (i = 0; i < 5; i++) {
        long long tcp_cnt, udp_cnt, icmp_cnt;
        int key;

        key = IPPROTO_TCP;
        assert(bpf_lookup_elem(map_fd[0], &key, &tcp_cnt) == 0);

        key = IPPROTO_UDP;
        assert(bpf_lookup_elem(map_fd[0], &key, &udp_cnt) == 0);

        key = IPPROTO_ICMP;
        assert(bpf_lookup_elem(map_fd[0], &key, &icmp_cnt) == 0);

        printf("TCP %lld UDP %lld ICMP %lld bytes\n",
               tcp_cnt, udp_cnt, icmp_cnt);
        sleep(1);
    }

    return 0;
}

　　之前讲解了 load_bpf_file 会通过系统调用bfp(BPF_PROG_LOAD..........) 将内核代码bpf 指令加载到内核返回一个关联的fd，但是通信用的map 怎样让user 以及kernel 都知道呢，也就是通过什么关联在一起呢？

user 通过关联体就能访问 kernel.o中创建的map呢？

答案是一切皆文件！！！

int load_bpf_file(char *path)
{
    -------------------------

    fd = open(path, O_RDONLY, 0);
    if (fd < 0)
        return 1;

    elf = elf_begin(fd, ELF_C_READ, NULL);

    if (!elf)
        return 1;
// 解析 ELF 文件
    if (gelf_getehdr(elf, &ehdr) != &ehdr)
        return 1;

---------------------------------------

    /* scan over all elf sections to get license and map info */
    for (i = 1; i < ehdr.e_shnum; i++) {

        --------------------------------------------
        } else if (strcmp(shname, "maps") == 0) {//解析到map的同时  调用load——map创建 对应的map 并关联到一个fd上
            processed_sec[i] = true;
            //扫描到SEC("maps")后，对BPF Map相关的操作是由load_maps函数完成，其中的bpf_create_map_node()和bpf_create_map_in_map_node()就是创建BPF Map的关键函数
            if (load_maps(data->d_buf, data->d_size))
                return 1;
        } else if (shdr.sh_type == SHT_SYMTAB) {
            symbols = data;
        }
    }

static int load_maps(struct bpf_map_def *maps, int len)
{
    int i;

    for (i = 0; i < len / sizeof(struct bpf_map_def); i++) {

        map_fd[i] = bpf_create_map(maps[i].type,
                       maps[i].key_size,
                       maps[i].value_size,
                       maps[i].max_entries,
                       maps[i].map_flags);
        if (map_fd[i] < 0) {
            printf("failed to create a map: %d %s\n",
                   errno, strerror(errno));
            return 1;
        }

        if (maps[i].type == BPF_MAP_TYPE_PROG_ARRAY)
            prog_array_fd = map_fd[i];
    }
    return 0;
}

int bpf_create_map(enum bpf_map_type map_type, int key_size, int value_size,
           int max_entries, int map_flags)
{
    union bpf_attr attr = {
        .map_type = map_type,
        .key_size = key_size,
        .value_size = value_size,
        .max_entries = max_entries,
        .map_flags = map_flags,
    };

    return syscall(__NR_bpf, BPF_MAP_CREATE, &attr, sizeof(attr));
}

load_bpf_file  
  |
  |-- load_maps 
      |
      |-- bpf_create_map

内核空间创建 map

内核空间响应 BPF_MAP_CREATE 系统调用，申请内存作为 map。

/kernel/bpf/syscall.c

static int map_create(union bpf_attr *attr)
{
    struct bpf_map *map;
    int err;

    /* find map type and init map: hashtable vs rbtree vs bloom vs ... */
    map = find_and_alloc_map(attr);

    // code omitted
    err = bpf_map_new_fd(map);

    return err;
}

内核程序写 map

内核程序通常做的是，将数据写入 map，内核程序通过 bpf_map_lookup_elem() 找到 index 为 KEY 对应的内存，然后对其进行修改

int bpf_prog1(struct __sk_buff *skb)
{
    int index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
    long *value;

    if (skb->pkt_type != PACKET_OUTGOING)
        return 0;

    value = bpf_map_lookup_elem(&my_map, &index);
    if (value)
        __sync_fetch_and_add(value, skb->len);

    return 0;
}

用户程序读 map

用户程序可以通过 BPF_MAP_LOOKUP_ELEM 系统调用可以读取 map 中特定 KEY 对应的值, 第一个参数即为创建 map 时返回的 fd.

int bpf_lookup_elem(int fd, void *key, void *value)
{
    union bpf_attr attr = {
        .map_fd = fd,
        .key = ptr_to_u64(key),
        .value = ptr_to_u64(value),
    };

    return syscall(__NR_bpf, BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr));
}

BPF社区网站

https://ebpf.io，最全BPF学习资源网站，主要由Cilium团队维护，上面会及时更新BPF技术的文档和视频。
https://lwn.net/Kernel/Index/#Berkeley_Packet_Filter ，lwn是学习Linux内核技术的最好的网站，这个BPF分类文章集合，记录了很多BPF里程碑事件的前前后后，既学会了知识，又明白了背景。
https://cilium.readthedocs.io/en/stable/bpf/，Cilium提供的BPF文档，是我看到过的最具实战价值的BPF手册，值得好好阅读。
https://www.kernel.org/doc/html/latest/bpf/bpf_devel_QA.html，开发BPF必读Q&A，里面是维护BPF内核代码的大佬给出的代码开发建议，读了能明白社区是如何运作BPF的。

学习技术还是得从源代码开始，下面是与bpf相关的代码仓库：

https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/ 这个repo是Linux社区官方维护的独立bpf代码仓库，一旦发布新版本后，代码就不会大改，只接受bug fix，相当于master repo，最终会merge到linux内核代码主干中。
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/ 这个repo也是Linux社区官方维护的bpf代码仓库，更新频繁，用于引入新功能或现有功能优化，稳定后merge到上面的master repo，相当于feature repo。看到最近的commits里，不乏有国人的贡献，感兴趣的话，来参与吧～

学习技术也需要沟通交流，下面是推荐的沟通渠道：

https://cilium.slack.com/archives/C4XCTGYEM 这Cilium提供的关于ebpf的thread，有什么疑问都可以去问
https://github.com/DavadDi/bpf_study 狄卫华老师的收集的BPF文章和教程，有问题可以去提issue

绕过conntrack，使用eBPF增强 IPVS优化k8s网络性能：https://v.qq.com/x/page/s3137ehoq8i.html
深入了解服务网格数据平面性能和调优：https://v.qq.com/x/page/v3137ax6zss.html
Kubernetes中用于混沌与跟踪的BPF：https://v.qq.com/x/page/f3130lpe0iv.html
https://kccnceu20.sched.com/event/ZejN/tutorial-using-bpf-in-cloud-native-environments-alban-crequy-marga-manterola-kinvolk
https://kccnceu20.sched.com/event/Zeoz/hubble-ebpf-based-observability-for-kubernetes-sebastian-wicki-isovalent
https://kccnceu20.sched.com/event/Zexb/designing-a-grpc-interface-for-kernel-tracing-with-ebpf-leonardo-di-donato-sysdig
https://kccnceu20.sched.com/event/ZemQ/ebpf-and-kubernetes-little-helper-minions-for-scaling-microservices-daniel-borkmann-cilium
https://kccnceu20.sched.com/event/Zewd/intro-to-falco-intrusion-detection-for-containers-shane-lawrence-shopify
https://kccnceu20.sched.com/event/ZetL/seccomp-security-profiles-and-you-a-practical-guide-duffie-cooley-vmware
https://kccnceu20.sched.com/event/ZeqL/k8s-in-the-datacenter-integrating-with-preexisting-bare-metal-environments-max-stritzinger-bloomberg

Brendan Gregg，来自Netflix最强BPF布道师，他的博客都是关于Linux系统优化的，观点独到，每一篇都值得一读；
Alexei Starovoitov，eBPF创造者，目前在Facebook就职，经常能在内核代码commit中看到他的踪迹；
Daniel Borkmann，eBPF kernel co-maintainer，目前在Cilium所在的公司Isovalent就职，是给eBPF增加feature的能力者；
Thomas Graf，Cilium之父，Isovalent的CTO，他也是eBPF和Cilium的强力布道师；
Quentin Monnet，BPFTool co-maintainer，Quentin是在stackoverflow上bpf问题的killer，twitter有关于eBPF的系列实战短文；

posted @ 2021-05-08 22:27 codestacklinuxer 阅读(413) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

坐看云起时

乘风好去，长空万里，直下看山河!!! 研究过httpserver、nginx、内核tcpip协议栈源码，内存管理、摄像头-iic-spi等驱动!! 目前瞎搞

重看ebpf 通信&&数据结构分析

内核空间创建 map

内核程序写 map

用户程序读 map

BPF社区网站

坐看云起时

乘风好去，长空万里，直下看山河!!! 研究过httpserver、nginx、内核tcpip协议栈源码，内存管理 、摄像头-iic-spi等驱动!! 目前瞎搞

重看ebpf 通信&&数据结构分析

内核空间创建 map

内核程序写 map

用户程序读 map

BPF社区网站

乘风好去，长空万里，直下看山河!!! 研究过httpserver、nginx、内核tcpip协议栈源码，内存管理、摄像头-iic-spi等驱动!! 目前瞎搞