OCI和runC

一、OCI

OCI（open Container Initiative）容器标准化组织的主要目的是推进容器技术的标准化。对容器标准进行准确的定义。其主要目的是为了解决容器标准混乱的问题。没有统一的容器标准，工业界就无法按照统一的标准进行容器开发。因此OCI于2015年由docker牵头和其他公司制定了相应的容器标准。

二、OCI的标准

OCI目前包含两个标准: runtime-spec和image-spec。分别定义了容器运行时标准和容器镜像标准。

三、runC

runC是docker贡献给oci的容器运行时，也是使用较多的容器运行时。docker目前的实现也是runc。

# create the top most bundle directory
mkdir /mycontainer
cd /mycontainer

# create the rootfs directory
mkdir rootfs

# export busybox via Docker into the rootfs directory
docker export $(docker create busybox) | tar -C rootfs -xvf -

这一步将文件系统解压到bundle中，执行runc spec自动生成config.json。通过这些操作就生成了一个OCI runtime bundle文件。config.json定义了运行容器所需的所有内容。

而目录下面的rootfs则定义了根文件系统，以及根文件系统的内容。config.json需要定义的主要参数如下：

ociVersion: 定义了oci的版本
process: 定义了容器进程，包括命令，环境变量，rootfs的路径，挂载信息等。
hooks：容器的生命周期管理中不同时间点需要执行的脚本或者代码。

当然还包含其他的参数具体内容可以参考oci标准。

四、RunC的实现原理

1、runc和libcontainer

runc和libcontainer有很大的关系，runc其实是在libcontainer的基础上进行了进一步的封装。通过runc命令可以创建一个新的容器。底层与操作系统的交互还是通过libcontainer来实现。runc就是docker公司将自己实现容器的底层代码libcontainer重新封装贡献给社区。

但是runc和原本的libcontainer还是有些区别的，最主要的还是runc遵循oci的标准。包括支持hook等。

2、runc的启动流程

runc启动容器还是要从main函数说起，main() (runc/main.go)函数内部定义了许多的command，这些command就是runc所具备的最主要功能。容器内部经常把github.com/urfave/cli 作为命令行工具，用于命令的解析和执行。

这里重点关注下createCommand，这个命令用于创建容器。创建容器调用了startContainer(context, spec, CT_ACT_CREATE, nil) 这个函数会调用createContainer。createContianer会创建一个逻辑容器。逻辑容器存在于内存当中，并没有实际运行。

package libcontainer

import (
	"github.com/opencontainers/runc/libcontainer/configs"
)

type Factory interface {
	StartInitialization() error
	Type() string
}

使用工厂方法的主要原因是实现容器的平台多种多样，可能是linux，也可能是window。linux_factory是在linux平台上实现了对应的接口，返回的是linuxContainer。逻辑容器的启动交给了runner。

runner中最主要的是run方法，run方法将config.json中的process封装成libcontainer.process类型并返回。这个process是逻辑process也没有真正的运行。container用来运行process。

调用linuxContainer的start方法来启动容器。启动的过程中首先要执行newParentProcess来执行父进程。这是一个比较重要的方法。首先创建了socketPair("init")，这个socketPair主要用于父子进程之间的通信。

func (c *linuxContainer) newParentProcess(p *Process) (parentProcess, error) {
	parentInitPipe, childInitPipe, err := utils.NewSockPair("init")
	if err != nil {
		return nil, newSystemErrorWithCause(err, "creating new init pipe")
	}
	messageSockPair := filePair{parentInitPipe, childInitPipe}

	parentLogPipe, childLogPipe, err := os.Pipe()
	if err != nil {
		return nil, fmt.Errorf("Unable to create the log pipe:  %s", err)
	}
	logFilePair := filePair{parentLogPipe, childLogPipe}

	cmd := c.commandTemplate(p, childInitPipe, childLogPipe)
	if !p.Init {
		return c.newSetnsProcess(p, cmd, messageSockPair, logFilePair)
	}
	if err := c.includeExecFifo(cmd); err != nil {
		return nil, newSystemErrorWithCause(err, "including execfifo in cmd.Exec setup")
	}
	return c.newInitProcess(p, cmd, messageSockPair, logFilePair)
}

最终会生成initProcesss

type initProcess struct {
	cmd             *exec.Cmd
	messageSockPair filePair
	logFilePair     filePair
	config          *initConfig
	manager         cgroups.Manager
	intelRdtManager intelrdt.Manager
	container       *linuxContainer
	fds             []string
	process         *Process
	bootstrapData   io.Reader
	sharePidns      bool
}

cmd就是封装好的父进程命令，这个命令执行runc init。cmd启动之后子进程，用户容器进程也就启动了，但是没有启动命令，这个启动命令由父进程传递给自己。messageSocketPair用于父子进程之间的通信。最终调用的函数是initProcess里面的start方法。

func (p *initProcess) start() (retErr error) {
	defer p.messageSockPair.parent.Close()
       //启动封装好的cmd命令，启动独立的子线程，也就是容器进程。
	err := p.cmd.Start()
	p.process.ops = p
	// close the write-side of the pipes (controlled by child)
	p.messageSockPair.child.Close()
	p.logFilePair.child.Close()
	if err != nil {
		p.process.ops = nil
		return newSystemErrorWithCause(err, "starting init process command")
	}
	defer func() {
		if retErr != nil {
			// terminate the process to ensure we can remove cgroups
			if err := ignoreTerminateErrors(p.terminate()); err != nil {
				logrus.WithError(err).Warn("unable to terminate initProcess")
			}

			p.manager.Destroy()
			if p.intelRdtManager != nil {
				p.intelRdtManager.Destroy()
			}
		}
	}()
	if err := p.manager.Apply(p.pid()); err != nil {
		return newSystemErrorWithCause(err, "applying cgroup configuration for process")
	}
	if p.intelRdtManager != nil {
		if err := p.intelRdtManager.Apply(p.pid()); err != nil {
			return newSystemErrorWithCause(err, "applying Intel RDT configuration for process")
		}
	}
        //将启动数据写入管道，子进程会读取管道中的数据并执行下一步操作。
	if _, err := io.Copy(p.messageSockPair.parent, p.bootstrapData); err != nil {
		return newSystemErrorWithCause(err, "copying bootstrap data to pipe")
	}
	childPid, err := p.getChildPid()
	if err != nil {
		return newSystemErrorWithCause(err, "getting the final child's pid from pipe")
	}

	fds, err := getPipeFds(childPid)
	if err != nil {
		return newSystemErrorWithCausef(err, "getting pipe fds for pid %d", childPid)
	}
	p.setExternalDescriptors(fds)
	if p.config.Config.Namespaces.Contains(configs.NEWCGROUP) && p.config.Config.Namespaces.PathOf(configs.NEWCGROUP) == "" {
		if _, err := p.messageSockPair.parent.Write([]byte{createCgroupns}); err != nil {
			return newSystemErrorWithCause(err, "sending synchronization value to init process")
		}
	}

	// Wait for our first child to exit
	if err := p.waitForChildExit(childPid); err != nil {
		return newSystemErrorWithCause(err, "waiting for our first child to exit")
	}

	if err := p.createNetworkInterfaces(); err != nil {
		return newSystemErrorWithCause(err, "creating network interfaces")
	}
	if err := p.updateSpecState(); err != nil {
		return newSystemErrorWithCause(err, "updating the spec state")
	}
	if err := p.sendConfig(); err != nil {
		return newSystemErrorWithCause(err, "sending config to init process")
	}
	var (
		sentRun    bool
		sentResume bool
	)

	ierr := parseSync(p.messageSockPair.parent, func(sync *syncT) error {
		switch sync.Type {
		case procReady:
			.......
		case procHooks:
			.......
		default:
			return newSystemError(errors.New("invalid JSON payload from child"))
		}
		return nil
	})

	if !sentRun {
		return newSystemErrorWithCause(ierr, "container init")
	}
	if p.config.Config.Namespaces.Contains(configs.NEWNS) && !sentResume {
		return newSystemError(errors.New("could not synchronise after executing prestart and CreateRuntime hooks with container process"))
	}
	if err := unix.Shutdown(int(p.messageSockPair.parent.Fd()), unix.SHUT_WR); err != nil {
		return newSystemErrorWithCause(err, "shutting down init pipe")
	}

	// Must be done after Shutdown so the child will exit and we can wait for it.
	if ierr != nil {
		p.wait()
		return ierr
	}
	return nil
}

这个方法是核心的方法。做了如下的事情：

执行cmd命令，启动一个独立的进程。这个进程的执行过程也就是InitCommand做的事情。后面可以分析一下这部分的代码。
将bootstrapData拷贝到管道中，这样子进程就可以从管道中读取配置。
然后再调用parseSync()函数，通过init管道与容器初始化进程进行同步，待其初始化完成之后，执行PreStart Hook等一些回调操作。最后，关闭init管道，容器创建完成。

三、子进程和父进程的交互流程

子进程也就是容器进程，父进程也就是runc进程。在上面的分析中知道。runc进程会单独启动一个独立的容器进程。下面我们分析下容器子进程的启动过程。

var initCommand = cli.Command{
	Name:  "init",
	Usage: `initialize the namespaces and launch the process (do not call it outside of runc)`,
	Action: func(context *cli.Context) error {
		factory, _ := libcontainer.New("")
		if err := factory.StartInitialization(); err != nil {
			// as the error is sent back to the parent there is no need to log
			// or write it to stderr because the parent process will handle this
			os.Exit(1)
		}
		panic("libcontainer: container init failed to exec")
	},
}

libcontainer.New()生成了一个新的linux_factory。并调用StartInitialization方法。StartInitialization通过读取父进程文件描述符内的配置和容器状态生成一个新的容器。并调用newContainerInit方法。

func newContainerInit(t initType, pipe *os.File, consoleSocket *os.File, fifoFd int) (initer, error) {
	var config *initConfig
	if err := json.NewDecoder(pipe).Decode(&config); err != nil {
		return nil, err
	}
	if err := populateProcessEnvironment(config.Env); err != nil {
		return nil, err
	}
	switch t {
	case initSetns:
		return &linuxSetnsInit{
			pipe:          pipe,
			consoleSocket: consoleSocket,
			config:        config,
		}, nil
	case initStandard:
		return &linuxStandardInit{
			pipe:          pipe,
			consoleSocket: consoleSocket,
			parentPid:     unix.Getppid(),
			config:        config,
			fifoFd:        fifoFd,
		}, nil
	}
	return nil, fmt.Errorf("unknown init type %q", t)

newContainerInit根据type返回不同的linux init。

type linuxStandardInit struct {
	pipe          *os.File
	consoleSocket *os.File
	parentPid     int
	fifoFd        int
	config        *initConfig
}

最终调用linuxStardardInit的init方法，做如下操作。

setupNetwork: 配置容器的网络，调用第三方 netlink.LinkSetup
setupRoute: 配置容器静态路由信息，调用第三方 netlink.RouteAdd
label.Init: 检查selinux是否被启动并将结果存入全局变量。
finalizeNamespace: 根据config配置将需要的特权capabilities加入白名单，设置user namespace，关闭不需要的文件描述符。
unix.Openat: 只写方式打开fifo管道并写入0，会一直保持阻塞，直到管道的另一端以读方式打开，并读取内容
syscall.Exec 系统调用来执行用户所指定的在容器中运行的程序

配置 hostname、apparmor、processLabel、sysctl、readonlyPath、maskPath。create 虽然不会执行命令，但会检查命令路径，错误会在 create 期间返回。

总结：

RunC是容器的底层实现，主要调用linux提供的系统调用来实现。从代码分析可以看出来，容器技术主要是namespace, cgroup，chroot， filesystem，aufs等linux技术的组合，通过这些组合解决了应用的线上应用环境问题。尤其是rootfs，解决了线上线下环境不一致的问题。使得应用进程的安装和部署更加便捷。

namespace提供了容器隔离技术。runc的实现主要是如下代码：

cmd := exec.Command(initCmd, "init")
cmd.SysProcAttr = &syscall.SysProcAttr{
	Cloneflags: syscall.CLONE_NEWUTS | syscall.CLONE_NEWPID | syscall.CLONE_NEWNS |
	syscall.CLONE_NEWNET | syscall.CLONE_NEWIPC,
}

上面的代码表示在fork进程的时候要clone uts， pid, ns, net, ipc等。通过这种方式隔离出对应独立的运行空间。

cgroup是对进程的资源进行限制，如cpu，内存，blkio等。runc的代码实现如下：

	cgroupManager := cgroups.NewCgroupManager(containerID)
	defer cgroupManager.Destroy()
	cgroupManager.Set(res)
	cgroupManager.Apply(parent.Process.Pid)

　上面的代码也就是将process的pid放在cgroup目录下的tasks里。这样就可以对其进行限制。

参考链接：

https://github.com/opencontainers/runc

https://cizixs.com/2017/11/05/oci-and-runc/

posted @ 2020-12-23 19:45 周围静地出奇阅读(1887) 评论(0) 收藏举报

刷新页面返回顶部

OCI和runC

公告