Loading

【翻译】Aya: Rust风格的 eBPF 伙伴

Aya: your tRusty eBPF companion - Aya: Rust风格的 eBPF 伙伴

  • 原文链接: https://deepfence.io/aya-your-trusty-ebpf-companion/
  • 第一次翻译长篇文章,有不好的地方欢迎评论指出
  • 不确定的翻译已通过中文斜体标出
  • 引用部分为原文

Aya is a library that makes it possible to write eBPF programs fully in Rust and is focused on providing an experience that is as friendly as possible for developers. In this post we are going to explain what eBPF is, why Aya started, and what’s unique about it.

Aya这个库可以让你完全用Rust来编写eBPF程序,并且为开发者提供尽可能友好的开发体验。这篇文章里我们会讲什么是eBPF,为什么发起Aya,还有它的独特之处。

What is eBPF? -- eBPF是什么?

eBPF (extended Berkeley Packet Filter) is a technology that makes it possible to run sandboxed programs inside a virtual machine with its own minimal set of instructions.

eBPF (extended Berkeley Packet Filter) 是能在虚拟机里运行沙箱程序的技术并且拥有自己的一套最小指令集。

It originated in the Linux kernel, where the eBPF VM (being the part of the kernel) triggers eBPF programs when a specific event happens in the system. There are more and more events added in new Linux kernel versions. For each type of event there is a separate kind of eBPF program. Currently in Linux, the most known program types are:

eBPF起源于Linux内核,在系统中发生某些特定事件时,内核里的eBPF虚拟机会触发eBPF程序。现在有越来越多的事件被加入新的Linux内核版本中。每种事件类型都有各自的eBPF程序。在Linux中现有已知的程序类型有:

  • Kprobes (kernel probes), fentry – can be attached to any kernel function during runtime.
  • Tracepoints – are hooks placed in various places in the Linux kernel code, which are more stable than Kprobes, that can change faster between kernel versions.
  • TC classifier – can be attached to egress and ingress qdisc (“queuing discipline” in Linux networking) for inspecting network interfaces and performing operations like accepting, dropping, redirecting, sending them to the queue again, etc.
  • XDP – similar to TC classifier, but attaches to the NIC driver and receives raw packets before they go through any layers of kernel networking stack. The limitation is that it can receive only ingress packets.
  • LSM – stands for Linux Security Modules, they are programs that are able to decide whether a particular security-related action is allowed to happen or not.
  • Kprobes (kernel probes,译:内核探针), fentry – 能在运行时附加到任意内核函数
  • Tracepoints (译:追踪点) – 位于Linux内核代码中的各种地方,相比 Kprobes 更加稳定,能够在不同的Linux版本之间更快的更改
  • TC classifier (译:TC分类器) – 能被附加到 qdisc (“queuing discipline” in Linux networking) 的出口和入口,用于检查网络接口和执行某些操作比如 accepting(接受), dropping(释放), redirecting(重定向), sending them to the queue again(再次发送到队列)等。
  • XDP – 类似于 TC classifier, 不过是附加到 NIC 驱动,而且能接收在通过内核网络栈的任意一层之前的原始数据包。有个限制是它只能接收流入的数据包。
  • LSM – 代表 Linux Security Modules (译:Linux安全模块),是能决定一个特殊的安全相关行为是否被允许的程序

eBPF projects usually are built from two parts:

  • eBPF program itself, running in the kernel and reacting to events.
  • User space program, which loads eBPF programs into the kernel and manages their lifetime.

There are ways to share data between eBPF programs and user space:

  • Maps – data structures used by eBPF programs and, depending on the type, also by the user space. With standard map types like HashMap, both eBPF and user space can read and write to them.
  • Perf / ring buffers – (PerfEventArray) – buffers to which eBPF program can push events (in form of custom structures) to the user space. This is a way to notify the user space program immediately.

Although eBPF started in Linux, nowadays there is also implementation in Windows. And eBPF is not even limited to operating system kernels. There are several user space implementations of eBPF, such as rbpf – a user space VM used in production by projects like Solana.

eBPF 项目通常由两部分构成:

  • eBPF program 本身, 运行在内核里响应事件
  • User space program, 用于加载 eBPF 程序到内核中并负责生命周期管理

在eBPF程序和用户程序之间有两种共享数据的方式:

  • Maps – 用于 eBPF 程序的数据结构,取决于具体类型,也用于用户层像 HashMap 这样的标准的map类型,eBPF 和 用户层代码都能读写。
  • Perf / ring buffers – (PerfEventArray) – 缓冲区,能让 eBPF 程序往里推送事件(以自定义结构体的形式)到用户程序。这是个可以及时通知用户态程序的方法。

虽然 eBPF 始于Linux, 但是现在也有 Windows 里的实现,并且 eBPF 不仅限于操作系统内核领域。有一些用户态的 eBPF 实现,比如 rbpf – 一个用户态的虚拟机用于像 Solana 这样的产品。

What is Aya and how did it start? - Aya是什么,如何开始?

Today, eBPF programs are usually written either in C or eBPF assembly. But in 2021, the Rust compiler got support for compiling eBPF programs. Since then, Rust Nightly can compile a limited subset of Rust into an eBPF object file.

If you are interested in reading about the implementation details, we recommend to check out this blog post by Alessandro Decina, who is the author of the pull request.

Aya is the first library that supports writing the whole eBPF project (both the user space and kernel space parts) in Rust, without any dependency on libbpf or clang. In most of environments, Rust Nightly is the only dependency needed for building. Some environments where rustc doesn’t expose its internal LLVM.so library (i.e. aarch64) require installing a shared LLVM library. But there is no need for libbpf, clang, or bcc!

As mentioned before, the main focus of Aya is developer experience – making it as easy as possible to write eBPF programs. Now we are going to go into details how Aya achieves that.

现在的 eBPF 程序一般用 C 或者 eBPF 汇编语言来写。但是在 2021 年,Rust编译器 开始支持编译 eBPF 程序. 从此以后,Rust Nightly 版本可以编译Rust受限的子集 到 eBPF 目标文件。

如果你对实现细节感兴趣,我们建议看看这篇博客 post by Alessandro Decina, who is the author of the pull request.

Aya 是第一个支持用Russt写整个 eBPF 项目的库 (包括用户层和内核层),不需要依赖libbpf 或者 clang。大部分环境下, 只需要依赖Rust Nightly即可构建。 有些环境 rustc 没有导出其内部的 LLVM.so library (比如 aarch64) 需要安装 LLVM 共享库。但是仍然不需要依赖 libbpf, clang, 或 bcc!

正如前面所提到的,Aya 的主要关注点是开发者体验 – 使 eBPF 程序的编写尽可能简单。现在我们开始了解 Aya 如何实现这些细节的。

More (type) safety -- (类型)更安全

Although the eBPF verifier ensures memory safety, using Rust over C is still beneficial in terms of type safety. Both Rust and macros inside Aya are strict in terms of what types are used in which context.

Let’s look at this example in C.

即使有 eBPF 验证器来保证内存安全,但在类型安全方面,使用 Rust 相比 C 而言是更有利的。在Rust里 Aya 的代码和宏在类型所属的上下文方面都是严格的。

我们来看一下这方面 C 的例子

头文件:

#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>

程序:

SEC("xdp")
int incorrect_xdp(struct __sk_buff *skb) {
    return XDP_PASS;
}

It will compile without any problems:

上面的代码编译没有任何问题

$ clang -O2 -emit-llvm -c incorrect\_xdp.c -o - | llc -march=bpf -filetype=obj -o bpf.o
$

… despite the fact that the function signature of that program is incorrect. struct __sk_buff *skb is an argument provided to TC classifier programs, not XDP, which has an argument of type struct xdp_md *ctx. Clang is not catching that mistake during compilation.

… 尽管事实上这个函数的签名是不对的。struct __sk_buff *skb是一个提供共TC classifier的参数,而不是参数类型为struct xdp_md *ctxXDP。 Clang 在编译期间没有捕获到这个错误。

Let’s try to make a similar mistake with Rust:

让我们试试在Rust中制造类似的错误

#[xdp(name = "incorrect_xdp")]
pub fn incorrect_xdp(ctx: SkBuffContext) -> u32 {
    xdp_action::XDP_PASS
}
$ cargo xtask build-ebpf
[...]
   Compiling incorrect-xdp-ebpf v0.1.0 (/home/vadorovsky/repos/aya-examples/incorrect-xdp/incorrect-xdp-ebpf)
     Running `rustc --crate-name incorrect_xdp --edition=2021 src/main.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --crate-type bin --emit=dep-info,link -C opt-level=3 -C panic=abort -C lto -C codegen-units=1 -C metadata=c92607119e7c631d -C extra-filename=-c92607119e7c631d --out-dir /home/vadorovsky/repos/aya-examples/incorrect-xdp/incorrect-xdp-ebpf/../target/bpfel-unknown-none/debug/deps --target bpfel-unknown-none -L dependency=/home/vadorovsky/repos/aya-examples/incorrect-xdp/incorrect-xdp-ebpf/../target/bpfel-unknown-none/debug/deps -L dependency=/home/vadorovsky/repos/aya-examples/incorrect-xdp/incorrect-xdp-ebpf/../target/debug/deps --extern aya_bpf=/home/vadorovsky/repos/aya-examples/incorrect-xdp/incorrect-xdp-ebpf/../target/bpfel-unknown-none/debug/deps/libaya_bpf-85e7be8a52b56ed9.rlib --extern aya_log_ebpf=/home/vadorovsky/repos/aya-examples/incorrect-xdp/incorrect-xdp-ebpf/../target/bpfel-unknown-none/debug/deps/libaya_log_ebpf-1b46466744bed2bc.rlib --extern 'noprelude:compiler_builtins=/home/vadorovsky/repos/aya-examples/incorrect-xdp/incorrect-xdp-ebpf/../target/bpfel-unknown-none/debug/deps/libcompiler_builtins-bb297dda66d0a4e2.rlib' --extern 'noprelude:core=/home/vadorovsky/repos/aya-examples/incorrect-xdp/incorrect-xdp-ebpf/../target/bpfel-unknown-none/debug/deps/libcore-65086a797df2a9a7.rlib' --extern incorrect_xdp_common=/home/vadorovsky/repos/aya-examples/incorrect-xdp/incorrect-xdp-ebpf/../target/bpfel-unknown-none/debug/deps/libincorrect_xdp_common-114ad60c902270da.rlib -Z unstable-options`
error[E0308]: mismatched types
 --> src/main.rs:7:1
  |
7 | #[xdp(name = "incorrect_xdp")]
  | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ expected struct `SkBuffContext`, found struct `XdpContext`
8 | pub fn incorrect_xdp(ctx: SkBuffContext) -> u32 {
[...]

The Rust compiler was able to detect the mismatch between SkBuffContext (context of TC classifier program) and XdpContext (context of XDP program, which we should use when using the xdp macro).

Rust编译器能检测出SkBuffContext (TC classifier的上下文) 和 XdpContext (XDP程序的上下文, 应该用 xdp 这个宏) 之间的不匹配。

Error handling -- 错误处理

The usual way of error handling in C is by returning an integer indicating success or error in a function and comparing that integer when calling it. In that case, since the return value is an error code, the actual result of a successful work is usually stored in a pointer provided as an argument. To make it very simple, the basic example (which gets triggered when new process is cloned and saves the PID in HashMap) looks like:

在C语言里常用的错误处理方式是在调用函数时返回一个整数表明成功还是错误。这种方式,因为返回值是个错误码,成功调用后的实际结果通常存到一个通过参数提供的指针里。为了简单,基本的例子(当新进程clone并且保存PID到HashMap里)看起来像这样:

struct {
	__uint(type, BPF_MAP_TYPE_HASH);
	__uint(max_entries, 1024);
	__type(key, pid_t);
	__type(value, u32);
} pids SEC(".maps");

SEC("fentry/kernel_clone")
int BPF_PROG(kernel_clone, struct kernel_clone_args *args)
{
	/* Get the pid */
	pid_t pid = bpf_get_current_pid_tgid() >> 32;
	/* Save the pid in map */
	u32 val = 0;
	int err = bpf_map_update_elem(&pids, &pid, &val, 0);
	if (err < 0)
		return err;
	return 0;
}

Aya lets you use the Result enum and handle errors as it’s done in the most of Rust projects. The only trick is to create two or more functions – one that has a C function signature, which returns only the integer type (the actual eBPF program) and others that return Result (used by the first function). Example:

跟大多数Rust项目一样,Aya(让你)用 Result 枚举处理错误。只有一个小把戏是创建两个或者更多的函数 - 一个拥有C函数签名,返回整数类型(实际的 eBPF 函数),另一个返回Result(用于封装C函数)。例子:

#[map(name = "pids")]
static mut PIDS: HashMap<u32, u32> = HashMap::<u32, u32>::with_max_entries(1024, 0);

#[fentry(name = "kernel_clone")]
pub fn kernel_clone(ctx: FEntryContext) -> u32 {
    match unsafe { try_kernel_clone(ctx) } {
        Ok(ret) => ret,
        Err(_) => 1,
    }
}

fn try_kernel_clone(ctx: FEntryContext) -> Result<u32, c_long> {
    // Get the pid
    let pid = ctx.pid();
    // Save the pid in map.
    unsafe { PIDS.insert(&pid, &0, 0)? };
    Ok(0)
}

The difference becomes significant when the eBPF code becomes larger and there are multiple errors to handle.

当 eBPF 代码变得更大并且有很多错误要处理时,会有显著的不同。

The Rust toolchain is all you need -- 你只需要Rust工具链

That’s right, to start playing with Aya and eBPF, usually you need to install only the Rust toolchain (nightly) and few crates. Detailed instructions are here. Cargo is enough to build the whole project and the produced binary will load the eBPF program into the kernel. Clang, bpftool, or iproute2 are not needed.

是的,一般情况下你只需要安装 Rust 工具链(Nightly) 和少数的库,就可以玩(译者:搞起) Aya 和 eBPF了。详细的教程在这。Cargo足以构建整个工程,生成二进制程序来加载 eBPF 程序到内核中。不需要Clang, bpftool, 还有 iproute2。

With Aya and Rust, you can use libraries in your eBPF program as long as they support no_std usage. More details about no_std are here. It’s also possible to release eBPF code as crates.

有了 Aya 和 Rust,你能在 eBPF 程序里使用很多支持 no_std的库。更多关于 no_std 的细节在这。你也能讲你的 eBPF 代码发布为 crate。

An example of a crate that is very often used in Aya-based eBPF programs is memoffset, which obtains offsets of struct members. We are going to see it in code examples later.

在基于Aya的 eBPF 程序里经常用到的一个crate是 memoffset,用于获取结构体成员的偏移。我们将在后面的代码例子中看到。

aya-log

Aya-log is a library that lets people to easily log from their eBPF programs to the user space program. There is no need to use bpf_printk()bpftool prog tracelog and the kernel trace buffer which is centralized. Aya-log sends the logs through a PerfEventArray to the user space part of the project, which is what eBPF developers often implement from scratch, but there is no need to do so with Aya!

Aya-log 是个能让人们更容易地从 eBPF 程序记录日志到用户层的库。不需要使用 bpf_printk()bpftool prog tracelog 而且内核的 trace buffer 是集中式的。Aya-log 通过 PerfEventArray 发送到项目的用户层,这是 eBPF 开发者经常需要从头实现的功能,但是用 Aya 就不需要这些啦!

Logging in Aya is as simple as:

在 Aya 中记录日志是如此简单:

#[fentry(name = "kernel_clone")]
pub fn kernel_clone(ctx: FEntryContext) -> u32 {
    let pid = ctx.pid();
    info!(&ctx, "new process: pid: {}", pid);
    0
}

And then it’s visible in the user space process:

用户层进程也是可见的:

aya-template

cargo-generate is a tool that helps with creating new Rust projects by using a git repository with a template. You can use it to create a new eBPF project based on Aya, using our aya-template repository.

cargo-generate 是个通过git仓库模板来辅助创建Rust项目的工具。可以用它创建Aya项目,通过我们的aya-template仓库。

Starting a new project is as simple as:

很容易开始一个新项目:

cargo install cargo-generate
cargo generate https://github.com/aya-rs/aya-template

Then cargo-generate asks question about the project, which mostly depend on the chosen type of eBPF program.

cargo-generate 会询问关于项目的问题,大部分是关于 eBPF 程序的选择类型

cargo-generate with aya

And the project layout with firewall user space crate, firewall-common crate for shared code, firewall-ebpf with eBPF code, and xtask for build commands:

在这个项目结构中, firewall 是用户层 crate, firewall-common crate 用于共享代码, firewall-ebpf 是 eBPF 代码,xtask 是用于构建的一些命令:

Sharing the common types and logic between user space and eBPF -- 在用户层和eBPF之间共享通用的类型和逻辑

Many eBPF projects keep the eBPF (kernel space) part in C, but the user space part in other languages like Go or Rust. In such cases, when some structures are used in both parts of the project, bindings to C structure definitions have to be generated.

许多 eBPF 项目保持 eBPF(内核空间的) 部分用C写,而用户层部分代码用其他语言比如Go或者Rust。这种情况下,当一些结构体在双方都用到时,必须生成绑定 C 的结构体。

In projects based entirely on Aya and Rust, it’s a common practice to keep common types in a crate called [project-name]-common. That crate is created by default when creating a new project using aya-template. That crate can contain, for example, struct definitions used in maps.

在完全基于Aya和Rust的项目中,将通用的类型放在名为[project-name]-common的crate中是个常见的做法。用aya-template创建新项目时,这个crate默认会创建,例如可能会包含用在maps里的结构体定义。

Async support -- 异步支持

User space part of projects based on Aya can be asynchronous, both Tokio and async-std are supported. Aya can load eBPF programs, perform operations on eBPF maps and perf buffers in asynchronous context.

基于Aya的项目用户层部分可以是异步的,支持 Tokio 和 async-std 。Aya 能够加载 eBPF 程序,在 eBPF maps 上执行操作,还有在异步上下文中 perf buffers。

How Deepfence leverages Aya -- Deepfence是如何利用Aya的

Packet inspection overview -- 数据包检查概览

We are analyzing network traffic on virtual machines, Docker, Kubernetes, and public cloud environments.

On each node, that analysis is done in two places for different purposes:

  • Inline in eBPF program, with the TC classifier context. On this level, we:
    • Perform a network layer inspection (L3) to check the source and destination addresses.
    • Perform a transport layer inspection (L4) to check the local and remote port.
    • Based on that information, we apply network policies. If the given address (and port) are in our blocklist, we drop the packet. We also have allowlist logic when there is a wildcard blocklist policy.
    • Perform basic application level (L7) inspection:
      • HTTP – some HTTP headers might contain information about the client that were masked by load balancers, which then we also use for enforcing network policies and dropping the packet.
  • User space after retrieving a packet from eBPF (via PerfEventArray) for further analysis.
    • We are matching all the packets with our sets of security rules, which are compatible with Suricata and ModSecurity.
    • When some packet matches any rule, we raise an alert.
    • Each rule has different alert severity – critical, high, medium, or low. Our thresholds for each severity are configurable.
    • After some threshold was reached, we automatically create a new network policy, which is going to block the particular traffic inline, in eBPF.

我们在虚拟机、Docker、Kubernetes和公有云环境中分析网络传输。

每个节点都需要两处地方分析,用于不同的目的:

  • eBPF 程序 内部分析, with the TC classifier context. 在这一层,我们要做:
    • 执行一次网络层(network layer)检查 (L3) 来检测源地址和目标地址。
    • 执行一次传输层(transport layer)检查 (L4) 来检测本地和远程端口。
    • 基于这些信息,来应用网络检测策略。如果给定的地址(和端口)在黑名单中,就丢弃这个包。 We also have allowlist logic when there is a wildcard blocklist policy.
    • 执行基本的应用层检查 (L7):
      • HTTP – some HTTP headers might contain information about the client that were masked by load balancers, which then we also use for enforcing network policies and dropping the packet.
  • 用户层 从 eBPF 收到包后,(通过PerfEventArray),做更多的分析。
    • We are matching all the packets with our sets of security rules, which are compatible with Suricata and ModSecurity.
    • 当有数据包匹配到某些规则,抛出一个警报
    • 每条规则都有不同的告警严重程度(告警级别) – critical(危险), high(高), medium(中), low(低)。每种告警级别的开关都是可配置的。
    • 在达到一定阈值后,会自动创建新的测率,将在 eBPF 程序内部阻拦某些特定的网络传输。

Example of TC classifier eBPF program -- TC classifier 类型的 eBPF 程序示例

At the beginning, we mentioned the TC classifier type of eBPF program. All incoming traffic comes to TC and is then redirected to a bound socket where the data can be consumed in user space. It’s the same logic for outgoing traffic but in reverse, the data goes via the socket API and then goes through TC. This means by attaching to TC, you can intercept the kernel socket buffer (sk_buff for those who ever ventured the kernel code) and analyze all of it. On top of accessing the content, you can also make decisions such as dropping the packet, or letting it through.

This is the example of eBPF program applying a simple ingress network policy:

一开始我们就提到了TC classifier类型的 eBPF 程序。所有流入的(数据包)都会到 TC 然后转发到用户层可以消费的已经绑定的 socket。对于流出的(数据包)也有相同但不过是相反的逻辑,数据通过 socket API 然后经过 TC。这意味着通过附加到 TC 上,你能够拦截内核 socket 的缓冲区(sk_buff for those who ever ventured the kernel code)并且进行分析。在能访问数据包的基础上,你能决定像丢弃数据包或者让他通过等这些行为。

这是个 eBPF 程序应用入口网络流量策略的例子

const ETH_HDR_LEN: usize = mem::size_of::<ethhdr>();

const ETH_P_IP: u16 = 0x0800;
const IPPROTO_TCP: u8 = 6;
const IPPROTO_UDP: u8 = 17;

#[map]
static mut BLOCKLIST_V4_INGRESS: HashMap<u32, u8> = HashMap::with_max_entries(1024, 0);

#[classifier(name = "tc_cls_ingress")]
pub fn tc_cls_ingress(ctx: SkBuffContext) -> i32 {
    match { try_tc_filter_ingress(ctx) } {
        Ok(_) => TC_ACT_PIPE,
        Err(_) => TC_ACT_SHOT,
    }
}

fn try_tc_cls_ingress(ctx: SkBuffContext) -> Result<(), i64> {
    let eth_proto = u16::from_be(ctx.load(offset_of!(ethhdr, h_proto))?);
    let ip_proto = ctx.load::<u8>(ETH_HDR_LEN + offset_of!(iphdr, protocol))?;
    if !(eth_proto == ETH_P_IP && (ip_proto == IPPROTO_TCP || ip_proto == IPPROTO_UDP)) {
        return Ok(());
    }

    let saddr = u32::from_be(ctx.load(ETH_HDR_LEN + offset_of!(iphdr, saddr))?);

    if unsafe { BLOCKLIST_V4_INGRESS.get(&saddr) }.is_some() {
        error!(&ctx, "blocked packet");
        return Err(-1);
    }

    info!(&ctx, "accepted packet");
    Ok(())
}

Rust will compile this code into an eBPF ELF format (tc-filter in our example) that the Linux eBPF VM will be able to execute.

eBPF loading and attaching in user space:

Rust会把这些代码编译成能被 Linux eBPF 虚拟机执行的 eBPF ELF 格式。

eBPF 在用户层的加载和附加:

    let mut bpf = Bpf::load(include_bytes_aligned!(
    ".../target/bpfel-unknown-none/release/tc-filter"
    ))?;
    let prog: &mut SchedClassifier = bpf.program_mut("tc_cls_ingress").unwrap().try_into()?;
    prog.load()?;
    let _ = tc::qdisc_add_clsact("eth0");
    prog.attach("eth0", TcAttachType::Ingress)?;

The code above loads the eBPF binary, loads the specific programs and adds a new qdisc to TC. qdisc is short for “queue discipline” and they are mandatory. They allow for multiple eBPF programs to attach together on the same interface. Finally we attach the eBPF classifier tc_cls_ingress to TC on ingress for the interface eth0. So any incoming packets reaching TC will call the tc_cls_ingress function. The same can be done on egress.

Now that we have seen how eBPF programs can be built and triggered, let’s go deeper and see what they can do with the socket buffer. A socket buffer fully encapsulates one TCP or one UDP packet. This means that if you want to reconstruct HTTP messages, you will need to stack TCP packets and reorder them properly.

上面的代码会加载 eBPF 二进制(代码),加载指定的执行并添加一个新的 qdisc 到 TC。qdisc 是“queue discipline”的法定缩写。它允许doge eBPF 程序同时附加到一个相同的(译者补:网络)接口上。最终我们把(自己的) eBPF classifier tc_cls_ingress 附加到了 TC ,在eth0网络接口的流入流量上。

现在我们学会了 eBPF 程序是如何构建和触发的,让我们更深入的了解他还能对 socket buffer 做什么。一个 socket buffer 完全封装了一个TCP 或 UDP 包。这意味着如果你想重新构建一个 HTTP 消息,你需要解开TCP栈并且正确地重组他们。

Sending data to user space with PerfEventArray -- 用 PerfEventArray 发送数据到用户层

There are multiple ways to communicate back and forth with eBPF programs and user space programs. To quickly transmit data from eBPF to user space, PerfEventArray is the most efficient ways to do. Fortunately Aya also brings nice utilities around it.

Coming back to PerfEventArray, it was initially designed to just report metrics about traffic performance – hence the name – but in reality, you can use those arrays to pass any data you want. Aya makes it dead easy as the same data type can be used for PerfEventArray and your user space program.

In the future we want to support ring buffers in Aya, which bring better performance and are supported by newer kernels. The ongoing work is in progress.

Let’s see how to transmit socket buffers to user space.

First, we define a custom data type:

有多种方法能够在 eBPF 程序和用户层间来回通信。想要快速从 eBPF 传送数据到用户层,PerfEventArray是最搞笑的方法。幸运的是 Aya 也对此提供了良好的功能支持。

话题回到 PerfEventArray,它最初只是被设计来用于报告关于传输性能的指标的 – 因此叫这个名 – 但实际上,你能用这些数组传递任意你想要的数据。 Aya 让它变得非常容易,因为相同的数据类型可以用于 PerfEventArray 和你的用户曾程序。

将来我们想在 Aya 里支持环形缓冲区(ring buffers),能提供更好的性能,支持更新的内核。持续工作中

来看看如何把 socket buffers 传到用户层。

首先,自定义一个数据类型:

#[derive(Copy, Clone, Debug, Hash, Eq, PartialEq)]
#[repr(C)]
pub struct OobBuffer {
    pub direction: TrafficDirection,
    pub size: usize,
}

PerfEventArray in eBPF programs:

eBPF 程序里的 PerfEventArray

static mut OOB_DATA: PerfEventArray<OobBuffer> = PerfEventArray::new(0);

#[classifier(name = "my_ingress_endpoint")]
fn tc_cls_ingress(mut skb: SkBuffContext) -> i32 {
    unsafe {
        OOB_DATA.output(
            skb,
            &OobBuffer {
                direction: TrafficDirection::Ingress,
                size: skb.len() as usize,
            },
            skb.len(),
         )
    }

  return TC_ACT_PIPE
}

用户层的PerfEventArray

    let oob_events: AsyncPerfEventArray<_> =
    bpf.map_mut("OOB_DATA").unwrap().try_into().unwrap();
        
    for cpu_id in online_cpus()? {
        let mut oob_cpu_buf = oob_events.open(cpu_id, Some(256))?;
        spawn(&format!("cpu_{}_perf_read", cpu_id), async move {
            loop {
                let bufs = (0..sizes.buf_count)
                    .map(|_| BytesMut::with_capacity(128 * 4096))
                    .collect::<Vec<_>>();
                let events = oob_cpu_buf.read_events(&mut bufs).await.unwrap();
                // Play with the recieved events in bufs
            }
        });
    }

The PerfEventArray is a ring buffer and is bound with a map name OOB_DATA across eBPF programs and user space. The faster you can retrieve events from the buffer, the fewer you are going to miss. Here, we open the PerfEventArray and we spawn tokio tasks across all CPUs. There is one PerfEventArray allocated per CPU. Then we start reading from it asynchronously. When an event is sent from eBPF, user space task is awoken and starts reading the event. Note that our PerfEventArray data is composed of: custom type and the appended socket buffer. So to retrieve the underlying socket buffer, we can simply offset the custom type and access the remaining bytes.

Here we are, we have a way to attach eBPF program to TC and retrieve socket buffers to user space. The funny work can start!

PerfEventArray是个环形缓冲区(ring buffer)并且绑定到一个映射名称 OOB_DATA,能够跨 eBPF 程序和用户层访问。你从缓冲区中获取事件越快,就有越少的事件会丢失。在这,我们打开了PerfEventArray然后跨越所有的CPU spawn tokio 任务。每个CPU(核心)都分配了一个PerfEventArray。然后我们开始异步去读。当一个事件从 eBPF 发送出来,用户层任务就会被唤醒并且开始读这个事件。注意我们的 PerfEventArray 数据是由自定义类型和附加的socket buffer组成的。所以要获取下层的 socket buffer,我们可以简单地跨过自定义类型然后访问剩下的字节。

我们有方法将 eBPF 程序附加到 TC 并获取 socket buffers 到用户空间。有趣的工作可以开始了!

Processing packets in user space -- 在用户层处理数据包

The way Deepfence does deep packet inspection is by reordering the TCP frames and reconstructing HTTP messages from the socket buffer data gathered by eBPF. Once the HTTP payload is reconstructed, we apply rule matching to detect whether something malicious is present. If so, we generate an alert and notify customers. We deal the same way with the other application layer (L7) protocols.

Deepfence进行深度数据包检查的方法是重组 TCP 帧,重构从 eBPF 收集到的 socket buffer 数据里的 HTTP 消息。一旦 HTTP payload 被重构,我们就应用相匹配的规则来探测是否有恶意存在。如果有的话,讲生成一个告警并通知用户。我们以相同的方式处理其他应用层的协议(L7)。

A rule defines how to detect a specific malicious content on HTTP payloads. It also includes meta information on its purpose (the reason for its creation, or what it detects). It usually relies on the haystack finding approach but also regular expression matching approaches. Matching happens on different parts of the HTTP message, it can be headers, port, or even HTML body. Needless to say that such operations are CPU intensive. Here is a rule example:

规则定义了如何检测 HTTP 负载上的特定恶意内容。它还包括有关其用途的元信息(创建原因或检测到的内容)。它通常依赖于 haystack 寻找方法还有正则表达式匹配方法。匹配发生在 HTTP 消息的不同部分,可以 HTTP 的头部、端口,甚至是正文。不用说,这种操作肯定是 CPU 密集型的。这是一个规则示例:

alert http $HOME_NET any -> any any (msg:"ET POLICY Outgoing Basic Auth Base64 HTTP Password detected unencrypted";
flow:established,to_server; threshold: type both, count 1, seconds 300, track by_src;
http.header; content:"Authorization|3a 20|Basic"; nocase; content:!"YW5vbnltb3VzOg==";
within:32; content:!"Proxy-Authorization|3a 20|Basic"; nocase;
content:!"KG51bGwpOihudWxsKQ=="; 
reference:url,doc.emergingthreats.net/bin/view/Main/2006380;
classtype:policy-violation; sid:2006380; rev:15; 
metadata:created_at 2010_07_30, former_category POLICY, updated_at 2022_06_14;)

This rule above is an emerging threat rule. It can be roughly translated into: any HTTP payload containing Authorization text and anything but YW5vbnltb3VzOg== next to it, should trigger an alert saying unencrypted passwords were found.

上面的规则是一个 新兴威胁规则。大体上可以翻译为:任何包含Authorization 文本并且YW5vbnltb3VzOg==在它后面的 HTTP payload,应该出发一个告警,告知未加密的密码被发现了。

Finding the right set of rules is challenging, and applying them at the right time is crucial. At Deepfence, we aggregate different rules from different sources and different format, like emerging threat rules but also mod security core rule set for instance. But users can also provide their own rules. We apply them to the live traffic captured by eBPF program to achieve real time alerting.

寻找正确的规则集合是具有挑战性的,而且在正确的时间这些规则也是至关重要的。在Deepfence,我们从不同的源中聚合了不同格式的规则,像 新兴威胁规则 mod security core rule set for instance。用户仍然可以提供自己的规则。我们建议用户通过 eBPF 程序去实操捕获数据传输来实现实时告警。

Watching processes and containers -- 监控进程和容器

Deepfence, apart from network tracing, focuses also on monitoring processes container workloads. That can be achieved by using tracepoint eBPF program triggered by new processes in the system.

Deepfence除了网络数据追踪,还聚焦于监控进程容器的工作负载。可以通过 tracepoint eBPF 程序来完成这项工作,当系统中有新进程创建的时候。

#[map]
pub static mut RUNC_EVENT_SCRATCH: PerCpuArray<RuncEvent> = PerCpuArray::with_max_entries(1, 0);
#[map]
pub static mut RUNC_EVENTS: PerfEventArray<RuncEvent> = PerfEventArray::new(0);

#[tracepoint(name = "runc_tracepoint")]
pub fn runc_tracepoint(ctx: TracePointContext) -> i64 {
    match { try_runc_tracepoint(ctx) } {
        Ok(ret) => ret,
        Err(_) => ret,
    }
}

fn try_runc_tracepoint(ctx: TracePointContext) -> Result<i64, i64> {
    // To check offset values:
    // sudo cat /sys/kernel/debug/tracing/events/sched/sched_process_exec/format
    const FILENAME_POINTER_OFFSET: usize = 8;
    let buf = unsafe {
        let ptr = FILENAME_BUF.get_ptr_mut(0).ok_or(0i64)?;
        &mut *ptr
    };
    let filename = unsafe {
        let len = bpf_probe_read_kernel_str(
            (ctx.as_ptr() as *const u8).add(ctx.read_at::<u16>(FILENAME_POINTER_OFFSET)? as usize),
            &mut buf.buf,
        )?;
        core::str::from_utf8_unchecked(&buf.buf[..len])
    };
    if filename.ends_with("runc\0") {
        let pid = bpf_get_current_pid_tgid() as u32;
        let event = &RuncEvent { pid };
        unsafe { RUNC_EVENTS.output(&ctx, &event, 0) };
    }
    Ok(0)
}

This program basically:

  • Gets triggered by sched_process_exec tracepoint – when new processes are spawned by executing a binary.
  • Checks if the filename ends with runc.
  • If yes, outputs an event to the user space via a PerfEventArray.

Of course with such simple filtering, if someone calls some binary foobar-runc (and it has nothing to do with the real runc), we have a problem. But let’s deal with that in the user space.

The definition of RuncEvent is here, it just contains a PID:

这个程序基本上做了这些:

  • 通过sched_process_exec 追踪点触发执行 - 当从二进制文件创建新进程时
  • 检查文件名是否以runc结尾
  • 如果是的话,通过PerfEventArray输出事件到用户层

当然这只是个简单的过滤,如果某人调用了名为foobar-runc的二进制程序(并且它并没有实际执行runc),就会有问题。但我们可以从用户层去处理这种情况看。

RuncEvent的定义在这,只是包含了一个PID:

#[derive(Debug, Clone)]
#[repr(C)]
pub struct RuncEvent {
    pub pid: u32,
}

Then we can consume the event in the user space:

然后我们就可以在用户层消费这些事件:

    let oob_events: AsyncPerfEventArray<_> =
    bpf.map_mut("RUNC_EVENTS").unwrap().try_into().unwrap();
        
    for cpu_id in online_cpus()? {
        let mut runc_buf = oob_events.open(cpu_id, Some(256))?;
        spawn(&format!("cpu_{}_runc_perf_read", cpu_id), async move {
            loop {
                let bufs = (0..sizes.buf_count)
                    .map(|_| BytesMut::with_capacity(128 * 4096))
                    .collect::<Vec<_>>();
                let events = runc_buf.read_events(&mut bufs).await.unwrap();
                for i in 0..events.read {
                    let buf = &mut buffers[i];
                    let ptr = buf.as_ptr() as *const SuidEvent;
                    let event = unsafe { ptr.read_unaligned() };
                    handle_runc_event(event.pid).unwrap();
                }
            }
        });
    }

It’s better to define our logic in some other function, the loop above is already complex enough. We can try to look for the actual process:

最好是在其他函数中定义我们的逻辑,上面的loop循环已经足够复杂了。我们尝试看一下实际的处理过程:

fn handle_runc_event(pid: u32) -> Result<(), anyhow::Error> {
    let p = match Process::new(pid as i32)?;
    // do something with `p`, parse its cmdline
    // and check if it's actually runc
}

That way of monitoring runc processes is agnostic to container engines (Docker, podman) or orchestration systems (Kubernetes and different CRI implementations), so it’s universal and fast. And based on container creation events, we are able to start parsing container configuration.

We use that solution for monitoring and scanning of container filesystems whenever a new container is created.

对于容器引擎(Docker, podman) 或者容器编排系统 (Kubernetes 和其他不同的 CRI 实现),这种监控 runc 进程的方法是未知的,所以是普遍且快速的。基于容器创建事件,我们就能开始分析容器配置了。

Conclusion - 总结

In this post we introduced you to eBPF and Aya, and how Deepfence leverages those technologies to reliably detect real customer security issues.

If you have questions or want to hear more, we encourage you to read our Deep Packet Inspection documentation and join the Deepfence Slack workspace!

If you want to learn about Aya, check out the Aya book.

Aya has very active community on Discord, where conversations happen pretty much everyday. We invite you to join and feel free to ask any questions related to Aya and eBPF!

Finally, if you’re interested in hacking on eBPF, Rust, and Kubernetes, reach out – careers(at)deepfence(dot)io

在这篇文章里,我们介绍了 eBPF 和 Aya,还有 Deepfence 是如何利用这些技术来可靠地探测实时的客户安全问题。

如果你还有更多的问题,建议你读我们的Deep Packet Inspection documentation 并且加入Deepfence 的 Slack 空间

如果你想学 Aya (的更多内容), 可以看看Aya book

Aya 在Discord上非常活跃,每天都有相当多的交接。我们邀请你加入并且轻松地提问任何关于 Aya 和 eBPF 的问题!

最后,如果你对 eBPF, Rust, 还有 Kubernetes 感兴趣,请联系 – careers@deepfence.io

posted @ 2023-03-30 17:30  _朝晖  阅读(236)  评论(0编辑  收藏  举报