重看ebpf 通信&&数据结构分析
Message passing to invoke behavior in a program is a widely used technique in soft‐ware engineering. A program can modify another program’s behavior by sending messages; this also allows the exchange of information between those programs. One of the most fascinating aspects about BPF, is that the code running on the kernel and the program that loaded said code can communicate with each other at runtime using message passing
BPF maps are key/value stores that reside in the kernel. They can be accessed by any BPF program that knows about them. Programs that run in user-space can also access these maps by using file descriptors. You can store any kind of data in a map, as long as you specify the data size correctly beforehand. The kernel treats keys and values as binary blobs, and it doesn’t care about what you keep in a map.
Creating BPF Maps
The most direct way to create a BPF map is by using the bpf syscall. When the first argument in the call is BPF_MAP_CREATE, you’re telling the kernel that you want to create a new map. This call will return the file descriptor identifier associated with the map you just created. The second argument in the syscall is the configuration for this
map:
union bpf_attr { struct { __u32 map_type; /* one of the values from bpf_map_type */ __u32 key_size; /* size of the keys, in bytes */ __u32 value_size; /* size of the values, in bytes */ __u32 max_entries; /* maximum number of entries in the map */ __u32 map_flags; /* flags to modify how we create the map */ }; }
The third argument in the syscall is the size of this configuration attribute.
For example, you can create a hash-table map to store unsigned integers as keys and values as follows:
union bpf_attr my_map { .map_type = BPF_MAP_TYPE_HASH, .key_size = sizeof(int), .value_size = sizeof(int), .max_entries = 100, .map_flags = BPF_F_NO_PREALLOC, };
int fd = bpf(BPF_MAP_CREATE, &my_map, sizeof(my_map));
If the call fails, the kernel returns a value of -1. There might be three reasons why it fails. If one of the attributes is invalid, the kernel sets the errno variable to EINVAL. If the user executing the operation doesn’t have enough privileges, the kernel sets the
errno variable to EPERM. Finally, if there is not enough memory to store the map, the kernel sets the errno variable to ENOMEM.
The helper function bpf_map_create wraps the code you just saw to make it easier to initialize maps on demand. We can use it to create the previous map with only one line of code:
int fd; fd = bpf_create_map(BPF_MAP_TYPE_HASH, sizeof(int), sizeof(int), 100, BPF_F_NO_PREALOC);
If you know which kind of map you’re going to need in your program, you can also predefine it. This is helpful to get more visibility in the maps your program is using beforehand:
struct bpf_map_def SEC("maps") my_map = { .type = BPF_MAP_TYPE_HASH, .key_size = sizeof(int), .value_size = sizeof(int), .max_entries = 100, .map_flags = BPF_F_NO_PREALLOC, };
When you define a map in this way, you’re using what’s called a section attribute, in this case SEC("maps"). This macro tells the kernel that this structure is a BPF map and it should be created accordingly !!
You might have noticed that we don’t have the file descriptor identifier associated with the map in this new example. In this case, the kernel uses a global variable called map_data to store information about the maps in your program. This variable is an array of structures, and it’s ordered by how you specified each map in your code. For example, if the previous map was the first one specified in your code, you’d get the file descriptor identifier from the first element in the array:
fd = map_data[0].fd;
You can also access the map’s name and its definition from this structure; this information is sometimes useful for debugging and tracing purposes.
其实主要就是:内核程序编译生成的 .o 文件要被解析成 ELF 文件 load 到内核。为此,map 是放在独有的 ELF 段中
#define SEC(NAME) __attribute__((section(NAME), used))
用户程序通过bpf 系统调用 (cmd为BPF_MAP_CREATE)创建 map,输入参数为 map 的各个参数,返回值为 map 对应的 fd。在官方例程中,用户空间程序是这样进行 map 创建的。
通过linux kernel 源码的sample/bpf里面的Makefile 可以知道
sockex1-objs := bpf_load.o libbpf.o sockex1_user.o
always += sockex1_kern.o
一个编译生成 sockex1_kern.o , 一个编译生成可执行程序 sockex1
所有 sockex1 会涉及到 bpf_load.c libbpf.c 等文件
#include <stdio.h> #include <assert.h> #include <linux/bpf.h> #include "libbpf.h" #include "bpf_load.h" #include <unistd.h> #include <arpa/inet.h> int main(int ac, char **argv) { char filename[256]; FILE *f; int i, sock; snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]); if (load_bpf_file(filename)) { printf("%s", bpf_log_buf); return 1; } sock = open_raw_sock("lo"); assert(setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, prog_fd, sizeof(prog_fd[0])) == 0); f = popen("ping -c5 localhost", "r"); (void) f; for (i = 0; i < 5; i++) { long long tcp_cnt, udp_cnt, icmp_cnt; int key; key = IPPROTO_TCP; assert(bpf_lookup_elem(map_fd[0], &key, &tcp_cnt) == 0); key = IPPROTO_UDP; assert(bpf_lookup_elem(map_fd[0], &key, &udp_cnt) == 0); key = IPPROTO_ICMP; assert(bpf_lookup_elem(map_fd[0], &key, &icmp_cnt) == 0); printf("TCP %lld UDP %lld ICMP %lld bytes\n", tcp_cnt, udp_cnt, icmp_cnt); sleep(1); } return 0; }
之前讲解了 load_bpf_file 会通过系统调用bfp(BPF_PROG_LOAD..........) 将内核代码bpf 指令加载到内核 返回一个关联的fd,但是通信用的map 怎样让user 以及kernel 都知道呢,也就是通过什么 关联在一起呢?
user 通过关联体 就能访问 kernel.o中创建的map呢?
答案是一切皆文件!!!
int load_bpf_file(char *path) { ------------------------- fd = open(path, O_RDONLY, 0); if (fd < 0) return 1; elf = elf_begin(fd, ELF_C_READ, NULL); if (!elf) return 1; // 解析 ELF 文件 if (gelf_getehdr(elf, &ehdr) != &ehdr) return 1; --------------------------------------- /* scan over all elf sections to get license and map info */ for (i = 1; i < ehdr.e_shnum; i++) { -------------------------------------------- } else if (strcmp(shname, "maps") == 0) {//解析到map的同时 调用load——map创建 对应的map 并关联到一个fd上 processed_sec[i] = true; //扫描到SEC("maps")后,对BPF Map相关的操作是由load_maps函数完成,其中的bpf_create_map_node()和bpf_create_map_in_map_node()就是创建BPF Map的关键函数 if (load_maps(data->d_buf, data->d_size)) return 1; } else if (shdr.sh_type == SHT_SYMTAB) { symbols = data; } }
static int load_maps(struct bpf_map_def *maps, int len) { int i; for (i = 0; i < len / sizeof(struct bpf_map_def); i++) { map_fd[i] = bpf_create_map(maps[i].type, maps[i].key_size, maps[i].value_size, maps[i].max_entries, maps[i].map_flags); if (map_fd[i] < 0) { printf("failed to create a map: %d %s\n", errno, strerror(errno)); return 1; } if (maps[i].type == BPF_MAP_TYPE_PROG_ARRAY) prog_array_fd = map_fd[i]; } return 0; }
int bpf_create_map(enum bpf_map_type map_type, int key_size, int value_size, int max_entries, int map_flags) { union bpf_attr attr = { .map_type = map_type, .key_size = key_size, .value_size = value_size, .max_entries = max_entries, .map_flags = map_flags, }; return syscall(__NR_bpf, BPF_MAP_CREATE, &attr, sizeof(attr)); }
load_bpf_file
|
|-- load_maps
|
|-- bpf_create_map
内核空间创建 map
内核空间响应 BPF_MAP_CREATE 系统调用,申请内存作为 map。
/kernel/bpf/syscall.c static int map_create(union bpf_attr *attr) { struct bpf_map *map; int err; /* find map type and init map: hashtable vs rbtree vs bloom vs ... */ map = find_and_alloc_map(attr); // code omitted err = bpf_map_new_fd(map); return err; }
内核程序写 map
内核程序通常做的是,将数据写入 map,内核程序通过 bpf_map_lookup_elem() 找到 index 为 KEY 对应的内存,然后对其进行修改
int bpf_prog1(struct __sk_buff *skb) { int index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol)); long *value; if (skb->pkt_type != PACKET_OUTGOING) return 0; value = bpf_map_lookup_elem(&my_map, &index); if (value) __sync_fetch_and_add(value, skb->len); return 0; }
用户程序读 map
用户程序可以通过 BPF_MAP_LOOKUP_ELEM 系统调用可以读取 map 中特定 KEY 对应的值, 第一个参数即为创建 map 时返回的 fd.
int bpf_lookup_elem(int fd, void *key, void *value) { union bpf_attr attr = { .map_fd = fd, .key = ptr_to_u64(key), .value = ptr_to_u64(value), }; return syscall(__NR_bpf, BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr)); }
BPF社区网站
- https://ebpf.io,最全BPF学习资源网站,主要由Cilium团队维护,上面会及时更新BPF技术的文档和视频。
- https://lwn.net/Kernel/Index/#Berkeley_Packet_Filter ,lwn是学习Linux内核技术的最好的网站,这个BPF分类文章集合,记录了很多BPF里程碑事件的前前后后,既学会了知识,又明白了背景。
- https://cilium.readthedocs.io/en/stable/bpf/,Cilium提供的BPF文档,是我看到过的最具实战价值的BPF手册,值得好好阅读。
- https://www.kernel.org/doc/html/latest/bpf/bpf_devel_QA.html,开发BPF必读Q&A,里面是维护BPF内核代码的大佬给出的代码开发建议,读了能明白社区是如何运作BPF的。
学习技术还是得从源代码开始,下面是与bpf相关的代码仓库:
- https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/ 这个repo是Linux社区官方维护的独立bpf代码仓库,一旦发布新版本后,代码就不会大改,只接受bug fix,相当于master repo,最终会merge到linux内核代码主干中。
- https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/ 这个repo也是Linux社区官方维护的bpf代码仓库,更新频繁,用于引入新功能或现有功能优化,稳定后merge到上面的master repo,相当于feature repo。看到最近的commits里,不乏有国人的贡献,感兴趣的话,来参与吧~
学习技术也需要沟通交流,下面是推荐的沟通渠道:
- https://cilium.slack.com/archives/C4XCTGYEM 这Cilium提供的关于ebpf的thread,有什么疑问都可以去问
- https://github.com/DavadDi/bpf_study 狄卫华老师的收集的BPF文章和教程,有问题可以去提issue
- 绕过conntrack,使用eBPF增强 IPVS优化k8s网络性能:https://v.qq.com/x/page/s3137ehoq8i.html
- 深入了解服务网格数据平面性能和调优:https://v.qq.com/x/page/v3137ax6zss.html
- Kubernetes中用于混沌与跟踪的BPF:https://v.qq.com/x/page/f3130lpe0iv.html
- https://kccnceu20.sched.com/event/ZejN/tutorial-using-bpf-in-cloud-native-environments-alban-crequy-marga-manterola-kinvolk
- https://kccnceu20.sched.com/event/Zeoz/hubble-ebpf-based-observability-for-kubernetes-sebastian-wicki-isovalent
- https://kccnceu20.sched.com/event/Zexb/designing-a-grpc-interface-for-kernel-tracing-with-ebpf-leonardo-di-donato-sysdig
- https://kccnceu20.sched.com/event/ZemQ/ebpf-and-kubernetes-little-helper-minions-for-scaling-microservices-daniel-borkmann-cilium
- https://kccnceu20.sched.com/event/Zewd/intro-to-falco-intrusion-detection-for-containers-shane-lawrence-shopify
- https://kccnceu20.sched.com/event/ZetL/seccomp-security-profiles-and-you-a-practical-guide-duffie-cooley-vmware
- https://kccnceu20.sched.com/event/ZeqL/k8s-in-the-datacenter-integrating-with-preexisting-bare-metal-environments-max-stritzinger-bloomberg
- Brendan Gregg,来自Netflix最强BPF布道师,他的博客都是关于Linux系统优化的,观点独到,每一篇都值得一读;
- Alexei Starovoitov,eBPF创造者,目前在Facebook就职,经常能在内核代码commit中看到他的踪迹;
- Daniel Borkmann,eBPF kernel co-maintainer,目前在Cilium所在的公司Isovalent就职,是给eBPF增加feature的能力者;
- Thomas Graf,Cilium之父,Isovalent的CTO,他也是eBPF和Cilium的强力布道师;
- Quentin Monnet,BPFTool co-maintainer,Quentin是在stackoverflow上bpf问题的killer,twitter有关于eBPF的系列实战短文;
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· AI与.NET技术实操系列:基于图像分类模型对图像进行分类
· go语言实现终端里的倒计时
· 如何编写易于单元测试的代码
· 10年+ .NET Coder 心语,封装的思维:从隐藏、稳定开始理解其本质意义
· .NET Core 中如何实现缓存的预热?
· 25岁的心里话
· 闲置电脑爆改个人服务器(超详细) #公网映射 #Vmware虚拟网络编辑器
· 基于 Docker 搭建 FRP 内网穿透开源项目(很简单哒)
· 零经验选手,Compose 一天开发一款小游戏!
· 一起来玩mcp_server_sqlite,让AI帮你做增删改查!!
2020-05-08 netfilter 的扩展功能 helper tftp-nat
2020-05-08 netfilter 的扩展功能