什么是 nsenter,nsenter 是 runc 中的一个 package,它包含了一个特殊的构造函数,用来在 Go runtime 启动之前做一些事情,比如 setns()。nsenter 会引入 C 并使用 cgo 实现相关逻辑。在cgo中,如果在 C 的 import 后紧跟注释,则在编译程序包的 C 语言实现部分时,该注释将用作 header。因此,每次 import nsenter 时,nsexec()都会调用 C 函数。
在 runc 中只有 init.go import 了 nsenter。
Why
容器技术最关键的就是 namespace 和 cgroup,其中 namespace 是通过 setns() 函数来实现的,但是 setns() 有一个问题: A multithreaded process may not change user namespace with setns(). 。而 go runtime 是多线程的,所以需要在 go runtime 启动前执行 setns() 设置好 namespace,然后再走 go 相关实现流程。
/* * Okay, so this is quite annoying. * * In order for this unsharing code to be more extensible we need to split * up unshare(CLONE_NEWUSER) and clone() in various ways. The ideal case * would be if we did clone(CLONE_NEWUSER) and the other namespaces * separately, but because of SELinux issues we cannot really do that. But * we cannot just dump the namespace flags into clone(...) because several * usecases (such as rootless containers) require more granularity around * the namespace setup. In addition, some older kernels had issues where * CLONE_NEWUSER wasn't handled before other namespaces (but we cannot * handle this while also dealing with SELinux so we choose SELinux support * over broken kernel support). * * However, if we unshare(2) the user namespace *before* we clone(2), then * all hell breaks loose. * * The parent no longer has permissions to do many things (unshare(2) drops * all capabilities in your old namespace), and the container cannot be set * up to have more than one {uid,gid} mapping. This is obviously less than * ideal. In order to fix this, we have to first clone(2) and then unshare. * * Unfortunately, it's not as simple as that. We have to fork to enter the * PID namespace (the PID namespace only applies to children). Since we'll * have to double-fork, this clone_parent() call won't be able to get the * PID of the _actual_ init process (without doing more synchronisation than * I can deal with at the moment). So we'll just get the parent to send it * for us, the only job of this process is to update * /proc/pid/{setgroups,uid_map,gid_map}. * * And as a result of the above, we also need to setns(2) in the first child * because if we join a PID namespace in the topmost parent then our child * will be in that namespace (and it will not be able to give us a PID value * that makes sense without resorting to sending things with cmsg). * * This also deals with an older issue caused by dumping cloneflags into * clone(2): On old kernels, CLONE_PARENT didn't work with CLONE_NEWPID, so * we have to unshare(2) before clone(2) in order to do this. This was fixed * in upstream commit 1f7f4dde5c945f41a7abc2285be43d918029ecc5, and was * introduced by 40a0d32d1eaffe6aac7324ca92604b6b3977eb0e. As far as we're * aware, the last mainline kernel which had this bug was Linux 3.12. * However, we cannot comment on which kernels the broken patch was * backported to. * * -- Aleksa "what has my life come to?" Sarai */
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】凌霞软件回馈社区,博客园 & 1Panel & Halo 联合会员上线
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步