使用容器方式创建firecracker虚拟机

下面以containerd作为runtime进行介绍。

简介

Ignite是一个启动firecracker vm的引擎，它使用容器的方式承载了firecracker vm。目前项目处于停滞阶段，也比较可惜，通过阅读了解ignite的工作方式，学习到了很多，希望能借此维护该项目。

ignite的运作方式和kubernetes类似，可以将Firecracker看作是runc，将ignite看作是cri(还有一个Footloose可以看作是docker-compose)。此外它还使用了一个存储(下面统称Storage)，用于保存集群的元数据(image/kernel/vm)，可以看做是kubernetes中的etcd。

ignite创建vm的过程如上，可以看到vm其实是由容器中的Firecracker 命令创建出来的。下面将会用到容器和vm两个概念，注意区分。

首先使用containerd创建一个名为firecracker的命名空间，后续会在该命名空间下拉取镜像和创建容器，然后在容器中通过ignite-spawn调用firecracker来创建vm。整个过程中需要涉及vm文件系统的制作和挂载、配置并使用containerd创建容器、vm网络的配置、使用firecracker启动容器等流程。vm进程启动步骤如下：

/usr/bin/containerd-shim-runc-v2 -namespace firecracker -id ignite-ddf49307b5b27c34 -address /run/containerd/containerd.sock
/usr/local/bin/ignite-spawn --log-level=info ddf49307b5b27c34
firecracker --api-sock /var/lib/firecracker/vm/ddf49307b5b27c34/firecracker.sock

其进程树如下：

containerd-shim─┬─ignite-spawn─┬─firecracker───2*[{firecracker}]
                │              └─14*[{ignite-spawn}]
                └─11*[{containerd-shim}]

ignite通过如下接口来操作容器，可以看到和一般docker命令行支持的功能类似：

type Interface interface {
	PullImage(image meta.OCIImageRef) error
	InspectImage(image meta.OCIImageRef) (*ImageInspectResult, error)
	ExportImage(image meta.OCIImageRef) (io.ReadCloser, func() error, error)

	InspectContainer(container string) (*ContainerInspectResult, error)
	AttachContainer(container string) error
	RunContainer(image meta.OCIImageRef, config *ContainerConfig, name, id string) (string, error)
	StopContainer(container string, timeout *time.Duration) error
	KillContainer(container, signal string) error
	RemoveContainer(container string) error
	ContainerLogs(container string) (io.ReadCloser, error)

	Name() Name
	RawClient() interface{}

	PreflightChecker() preflight.Checker
}

ignite中有三种资源：Image、Kernel、VM三种，分别代表基础镜像，内核镜像和虚拟机。它使用Storage来保存这些资源的元数据。元数据存储路径为constants.DATA_DIR，代码中定义为/var/lib/firecracker。

ignite中有两个主要的目录：

/var/lib/firecracker/：保存了Image、Kernel、VM的元数据，以及内核文件和vm的文件系统文件等。ignite的三种对象都有一个UID，相关资源保存在对应的/var/lib/firecracker/<image/kernel/vm>/<UID>目录中。

/etc/firecracker/manifests：ignited守护进程使用的vm manifest文件，用于通过watch 文件的方式管理vm

ignite中使用了多种存储类型，底层的Storage接口如下，可以看到其支持的方法与client-go操作kubernetes资源的方式十分类似，Storage内部保存了ignite 的CRD对象：

type Storage interface {
	// New creates a new Object for the specified kind
	New(gvk schema.GroupVersionKind) (runtime.Object, error)
	// Get returns a new Object for the resource at the specified kind/uid path, based on the file content
	Get(gvk schema.GroupVersionKind, uid runtime.UID) (runtime.Object, error)
	// GetMeta returns a new Object's APIType representation for the resource at the specified kind/uid path
	GetMeta(gvk schema.GroupVersionKind, uid runtime.UID) (runtime.Object, error)
	// Set saves the Object to disk. If the Object does not exist, the
	// ObjectMeta.Created field is set automatically
	Set(gvk schema.GroupVersionKind, obj runtime.Object) error
	// Patch performs a strategic merge patch on the Object with the given UID, using the byte-encoded patch given
	Patch(gvk schema.GroupVersionKind, uid runtime.UID, patch []byte) error
	// Delete removes an Object from the storage
	Delete(gvk schema.GroupVersionKind, uid runtime.UID) error
	// List lists Objects for the specific kind
	List(gvk schema.GroupVersionKind) ([]runtime.Object, error)
	// ListMeta lists all Objects' APIType representation. In other words,
	// only metadata about each Object is unmarshalled (uid/name/kind/apiVersion).
	// This allows for faster runs (no need to unmarshal "the world"), and less
	// resource usage, when only metadata is unmarshalled into memory
	ListMeta(gvk schema.GroupVersionKind) ([]runtime.Object, error)
	// Count returns the amount of available Objects of a specific kind
	// This is used by Caches to check if all Objects are cached to perform a List
	Count(gvk schema.GroupVersionKind) (uint64, error)
	// Checksum returns a string representing the state of an Object on disk
	// The checksum should change if any modifications have been made to the
	// Object on disk, it can be e.g. the Object's modification timestamp or
	// calculated checksum
	Checksum(gvk schema.GroupVersionKind, uid runtime.UID) (string, error)
	// RawStorage returns the RawStorage instance backing this Storage
	RawStorage() RawStorage
	// Serializer returns the serializer
	Serializer() serializer.Serializer
	// Close closes all underlying resources (e.g. goroutines) used; before the application exits
	Close() error
}

ignite使用CRD的方式定义了其管理的资源，对应的gvk：group为ignite.weave.works；version有v1alpha2、v1alpha3、v1alpha4三个版本；kind有Image、Kernel、VM三种。scheme.Serializer提供了CRD的编解码方式:

var (
	// Scheme is the runtime.Scheme to which all types are registered.
	Scheme = runtime.NewScheme()

	// codecs provides access to encoding and decoding for the scheme.
	// codecs is private, as Serializer will be used for all higher-level encoding/decoding
	codecs = k8sserializer.NewCodecFactory(Scheme)

	// Serializer provides high-level encoding/decoding functions
	Serializer = serializer.NewSerializer(Scheme, &codecs)
)

func init() {
	AddToScheme(Scheme)
}

// AddToScheme builds the scheme using all known versions of the api.
func AddToScheme(scheme *runtime.Scheme) {
	utilruntime.Must(ignite.AddToScheme(Scheme))
	utilruntime.Must(v1alpha2.AddToScheme(Scheme))
	utilruntime.Must(v1alpha3.AddToScheme(Scheme))
	utilruntime.Must(v1alpha4.AddToScheme(Scheme))
	utilruntime.Must(scheme.SetVersionPriority(v1alpha4.SchemeGroupVersion))
}

运行

ignite有两种方式来管理vm，分别对应两个命令：ignite和ignited。前者使用手动命令行的方式来管理vm，后者使用监听vm manifest文件的方式来自动管理vm。

ignite使用的Storage称为GenericStorage，而ignited使用的Storage称为ManifestStorage。其初始化方式分别如下：

func SetGenericStorage() error {
	log.Trace("Initializing the GenericStorage provider...")
	providers.Storage = cache.NewCache(
		storage.NewGenericStorage(
			storage.NewGenericRawStorage(constants.DATA_DIR), scheme.Serializer))
	return nil
}

func SetManifestStorage() (err error) {
	log.Trace("Initializing the ManifestStorage provider...")
	ManifestStorage, err = manifest.NewTwoWayManifestStorage(constants.MANIFEST_DIR, constants.DATA_DIR, scheme.Serializer)
	if err != nil {
		return
	}

	providers.Storage = cache.NewCache(ManifestStorage)
	return
}

Ignite使用如下几个变量来操作各个资源，Runtime用于管理containerd的资源，Client用于管理ignite自身的资源，NetworkPlugin则用于配置容器的CNI。

制作vm文件系统

vm文件系统包含两部分内容：基础文件系统和内核文件，这两部分内容分别来自基础镜像和内核镜像。在制作vm文件系统过程中，会将这两部分合并成为一个完整的vm文件系统，后续在使用firecracker启动vm时，会将该文件系统挂载为vm的root fs。

ignite cli通过如下两个命令拉取基础镜像和内核镜像：

$ ignite image import <OCI image> [flags]
$ ignite kernel import <OCI image> [flags]

制作vm基础文件系统文件

ignite拉取镜像时需要指定两个参数:providers.RuntimeName(默认containerd)和providers.NetworkPluginName(默认cni)，前者创建containerdClient，用于拉取镜像和创建容器，后者用于配置CNI网络，但在文件系统制作过程中并未用到。

创建contianerdClient

创建contianerdClient需要两个参数：containerd.sock和containerd-shim，这两个是创建containerdClient的必要参数。在实际系统中，containerd-shim可能有多个版本，优先使用io.containerd.runc.v2：

ignite使用containerd创建了名为firecracker的命名空间，后续ignite的镜像和vm都是在该命名空间下面操作的，可以使用ctr命令直接查看ignite的容器和镜像。如下面展示了ignite创建出来的容器和镜像，ignite的容器是以ignite-开头的，但使用ignite vm ls时不会显示该前缀(也可以理解为该命令查看的是容器内的vm名称)：

$ ctr -n firecracker containers ls 
CONTAINER                  IMAGE                                  RUNTIME                  
ignite-ddf49307b5b27c34    docker.io/weaveworks/ignite:v0.10.0    io.containerd.runc.v2    
$ ctr -n firecracker images ls 
REF           TYPE             DIGEST                                                                  SIZE     PLATFORMS                  LABELS 
docker.io/weaveworks/ignite-kernel:5.10.51 application/vnd.docker.distribution.manifest.list.v2+json sha256:c1d99eafa5b2bcaeab26c0a093d83d709a560e4721f52b6e7c5ef7e9e771189d 15.0 MiB linux/amd64,linux/arm64    -      
docker.io/weaveworks/ignite-ubuntu:latest  application/vnd.docker.distribution.manifest.list.v2+json sha256:11550e0912d24aeaad847f06fdf2133302f2af2fd2ce231723d078ffce9216ba 78.1 MiB linux/amd64,linux/arm64/v8 -      
docker.io/weaveworks/ignite:v0.10.0        application/vnd.docker.distribution.manifest.list.v2+json sha256:b8cc53c5cba81d685b1dc95a0f34ca3fa732ddd450b6f0eba0c829ccc1c67462 16.5 MiB linux/amd64,linux/arm64    -

下面是创建containerdClient的过程，获取containerd socket和runtime即可。注意在获取runtime的时候如果不存在RuntimeRuncV2，则会退一步查找RuntimeRuncV1：

func GetContainerdClient() (*ctdClient, error) {
	ctdSocket, err := StatContainerdSocket()
	if err != nil {
		return nil, err
	}

	runtime, err := getNewestAvailableContainerdRuntime()//获取可用的runtime
	if err != nil {
		// proceed with the default runtime -- our PATH can't see a shim binary, but containerd might be able to
		log.Warningf("Proceeding with default runtime %q: %v", runtime, err)
	}

	cli, err := containerd.New(
		ctdSocket,
		containerd.WithDefaultRuntime(runtime),
	)
	if err != nil {
		return nil, err
	}

	return &ctdClient{
		client: cli,
		ctx:    namespaces.WithNamespace(context.Background(), ctdNamespace), //设置ignite的命名空间
	}, nil
}

func getNewestAvailableContainerdRuntime() (string, error) {
	for _, rt := range v2ShimRuntimes {
		binary := v2shim.BinaryName(rt)
		if binary == "" {
			// this shouldn't happen if the matching test is passing, but it's not fatal -- just log and continue
			log.Errorf("shim binary could not be found -- %q is an invalid runtime/v2/shim", rt)
		} else if _, err := exec.LookPath(binary); err == nil {
			return rt, nil
		}
	}
	...
}

v2ShimRuntimes = []string{
  plugin.RuntimeRuncV2,
  plugin.RuntimeRuncV1,
}

const (
	// RuntimeLinuxV1 is the legacy linux runtime
	RuntimeLinuxV1 = "io.containerd.runtime.v1.linux"
	// RuntimeRuncV1 is the runc runtime that supports a single container
	RuntimeRuncV1 = "io.containerd.runc.v1"
	// RuntimeRuncV2 is the runc runtime that supports multiple containers per shim
	RuntimeRuncV2 = "io.containerd.runc.v2"
)

最后通过containerd.New创建containerdClient，并将其保存在providers.Runtime变量中：

	cli, err := containerd.New(
		ctdSocket,
		containerd.WithDefaultRuntime(runtime),
	)

与containerd相关的默认配置如下：

const (
	// DefaultRootDir is the default location used by containerd to store
	// persistent data
	DefaultRootDir = "/var/lib/containerd"
	// DefaultStateDir is the default location used by containerd to store
	// transient data
	DefaultStateDir = "/run/containerd"
	// DefaultAddress is the default unix socket address
	DefaultAddress = "/run/containerd/containerd.sock"
	// DefaultDebugAddress is the default unix socket address for pprof data
	DefaultDebugAddress = "/run/containerd/debug.sock"
	// DefaultFIFODir is the default location used by client-side cio library
	// to store FIFOs.
	DefaultFIFODir = "/run/containerd/fifo"
	// DefaultRuntime is the default linux runtime
	DefaultRuntime = "io.containerd.runc.v2"
	// DefaultConfigDir is the default location for config files.
	DefaultConfigDir = "/etc/containerd"
)

不同版本的contianerd-shim的区别参见Containerd shim 原理深入解读

创建cniInstance

cniInstance用于设置容器的网络，这一步在制作vm文件系统中并没有用到，但会被一并初始化。

创建cniInstance时会依赖上面获取到的providers.Runtime，表示用于配置特定容器runtime的CNI网络。结果保存在providers.NetworkPlugin中。

这一步主要就是通过gocni.New初始化一个cni实例cniInstance，后续通过cniInstance.Setup来设置容器网络(见下面的"CNI"章节)。

func GetCNINetworkPlugin(runtime runtime.Interface) (network.Plugin, error) {
	// If the CNI configuration directory doesn't exist, create it
	if !util.DirExists(CNIConfDir) {
		if err := os.MkdirAll(CNIConfDir, constants.DATA_DIR_PERM); err != nil {
			return nil, err
		}
	}

	binDirs := []string{CNIBinDir}
	cniInstance, err := gocni.New(gocni.WithMinNetworkCount(2),
		gocni.WithPluginConfDir(CNIConfDir),
		gocni.WithPluginDir(binDirs))
	if err != nil {
		return nil, err
	}

	return &cniNetworkPlugin{
		runtime: runtime,
		cni:     cniInstance,
		once:    &sync.Once{},
	}, nil
}

	// CNIBinDir describes the directory where the CNI binaries are stored
	CNIBinDir = "/opt/cni/bin"
	// CNIConfDir describes the directory where the CNI plugin's configuration is stored
	CNIConfDir = "/etc/cni/net.d"

拉取基础镜像

首先通过从ignite的image元数据中查找镜像来判断是否已经存在该镜像，如果不存在，则通过containerdClient从(如果runtime为containerd的话)本地查找(类似执行 ctr --namespace firecracker images ls)，再找不到才会从远端拉取镜像。

func FindOrImportImage(c *client.Client, ociRef meta.OCIImageRef) (*api.Image, error) {
	log.Debugf("Ensuring image %s exists, or importing it...", ociRef)
	image, err := c.Images().Find(filter.NewIDNameFilter(ociRef.String())) //查看元数据中是否存在需要的镜像
	if err == nil {
		// Return the image found
		log.Debugf("Found image with UID %s", image.GetUID())
		return image, nil
	}

	switch err.(type) {
	case *filterer.NonexistentError:
		return importImage(c, ociRef) //从containerd本地或远端加载镜像
	default:
		return nil, err
	}
}

看下imageClient的初始化，其指定了Storage以及镜像资源对应的gvk，这个跟使用client-go查找kubernetes的逻辑是一样的。kernel Client和vm Client的初始化和image Client方式类似，只是需要将kind设置为对应的类型。

func newImageClient(s storage.Storage, gv schema.GroupVersion) ImageClient {
	return &imageClient{
		storage:  s,
		filterer: filterer.NewFilterer(s),
		gvk:      gv.WithKind(api.KindImage.Title()),
	}
}

主要处理函数importImage如下，在从containerd本地或远端加载镜像成功之后，会初始化一个特定gvk的image对象，并配置相关参数，如镜像名称、镜像的OCI地址(如weaveworks/ignite-ubuntu:latest)以及镜像的UID，UID用于确定唯一的镜像对象(注意UID表示的是CRD的对象，而不是镜像的SHA值)，可以在/var/lib/firecracker/image/<UID>/metadata.json中查看相关的镜像元数据。

在配置好image对象之后，会(调用dmlegacy.CreateImageFilesystem)在/var/lib/firecracker/image/<UID>/中创建一个名为image.ext4的文件，然后调用truncate调整文件大小，并使用"mkfs.ext4 -b 4096 -I 256 -F -E lazy_itable_init=0,lazy_journal_init=0 /var/lib/firecracker/image/<UID>/image.ext4 将其初始化为一个空的ext4格式的文件，然后通过将image.ext4挂在到/dev/loop形成一个虚拟文件系统，挂载该虚拟文件系统并导入基础镜像文件(细节见"创建基础文件系统文件")，最后umount挂载的文件系统，至此完成基础文件系统文件(image.ext4)的制作。最后将image元数据保存到ignite的存储中，便于后续检索：

func importImage(c *client.Client, ociRef meta.OCIImageRef) (*api.Image, error) {
	log.Debugf("Importing image with ociRef %q", ociRef)
	// Parse the source
	dockerSource := source.NewDockerSource()
	src, err := dockerSource.Parse(ociRef) //从containerd本地加载或远端拉取镜像
	if err != nil {
		return nil, err
	}

	image := c.Images().New() //初始化一个ignite image对象
	// Set the image name
	image.Name = ociRef.String()
	// Set the image's ociRef
	image.Spec.OCI = ociRef
	// Set the image's ociSource
	image.Status.OCISource = *src

	// Generate UID automatically
	if err := metadata.SetNameAndUID(image, c); err != nil { //设置image对象的UID
		return nil, err
	}

	log.Infoln("Starting image import...")

	// Truncate a file for the filesystem, format it with ext4, and copy in the files from the source
	if err := dmlegacy.CreateImageFilesystem(image, dockerSource); err != nil { //创建ext4文件系统，并导入镜像文件
		return nil, err
	}

	if err := c.Images().Set(image); err != nil  //存储新镜像的元数据
		return nil, err
	}

	log.Infof("Imported OCI image %q (%s) to base image with UID %q", ociRef, image.Status.OCISource.Size, image.GetUID())
	return image, nil
}

下面看下ignite是如何从containerd或远端加载镜像的。其实现比较简单，此处用到了providers.Runtime。首先通过providers.Runtime.InspectImage查找本地镜像，如果没有则从远端拉取(ctr --namespace firecracker images pull):

func (ds *DockerSource) Parse(ociRef meta.OCIImageRef) (*api.OCIImageSource, error) {
	res, err := providers.Runtime.InspectImage(ociRef)
	if err != nil {
		log.Infof("%s image %q not found locally, pulling...", providers.Runtime.Name(), ociRef)
		if err := providers.Runtime.PullImage(ociRef); err != nil {
			return nil, err
		}

		if res, err = providers.Runtime.InspectImage(ociRef); err != nil {
			return nil, err
		}
	}

	if res.Size == 0 || res.ID == nil {
		return nil, fmt.Errorf("parsing %s image %q data failed", providers.Runtime.Name(), ociRef)
	}

	ds.imageRef = ociRef

	return &api.OCIImageSource{
		ID:   res.ID,
		Size: meta.NewSizeFromBytes(uint64(res.Size)),
	}, nil
}

在镜像加载成功之后就可以在constants.DATA_DIR(/var/lib/firecracker)中查看镜像的元数据。下面是weaveworks/ignite-ubuntu:latest的image元数据，其保存路径为/var/lib/firecracker/image/<UID>。metadata.json 中以yaml格式保存了镜像的元数据，使用的CRD的gv为ignite.weave.works/v1alpha4，kind为Image：

# cd /var/lib/firecracker/image/669a5721d130ef1d

# ll
-rw-r--r--. 1 root root 295698432 Jul 14 10:53 image.ext4
-rw-r--r--. 1 root root       464 Jul 14 10:53 metadata.json

# cat metadata.json 
{
  "kind": "Image",
  "apiVersion": "ignite.weave.works/v1alpha4",
  "metadata": {
    "name": "weaveworks/ignite-ubuntu:latest",
    "uid": "669a5721d130ef1d",
    "created": "2023-07-14T02:53:01Z"
  },
  "spec": {
    "oci": "weaveworks/ignite-ubuntu:latest"
  },
  "status": {
    "ociSource": {
      "id": "oci://docker.io/weaveworks/ignite-ubuntu@sha256:52414720f26c808bc1273845c6d0f0a99472dfa8eaf8df52429261cbac27f1ba",
      "size": "249308KB"
    }
  }
}

image对象的定义如下，可以看到它就是一个标准的对应上面的metadata.json ：

type Image struct {
	runtime.TypeMeta `json:",inline"`
	// runtime.ObjectMeta is also embedded into the struct, and defines the human-readable name, and the machine-readable ID
	// Name is available at the .metadata.name JSON path
	// ID is available at the .metadata.uid JSON path (the Go type is k8s.io/apimachinery/pkg/types.UID, which is only a typed string)
	runtime.ObjectMeta `json:"metadata"`

	Spec   ImageSpec   `json:"spec"`
	Status ImageStatus `json:"status"`
}

// ImageSpec declares what the image contains
type ImageSpec struct {
	OCI meta.OCIImageRef `json:"oci"`
}

type OCIImageSource struct {
	// ID defines the source's content ID (e.g. the canonical OCI path or Docker image ID)
	ID *meta.OCIContentID `json:"id"`
	// Size defines the size of the source in bytes
	Size meta.Size `json:"size"`
}

// ImageStatus defines the status of the image
type ImageStatus struct {
	// OCISource contains the information about how this OCI image was imported
	OCISource OCIImageSource `json:"ociSource"`
}

创建基础文件系统文件

上面提到ignite的Storage中会保存镜像的元数据，而镜像本身会导入到使用mkfs.ext4创建出来的文件系统中，下面看下这个过程。

首先找到mkfs创建出来的image.ext4文件路径，然后创建一个临时目录，并将image.ext4挂载到临时目录中。
使用export方式(类似docker export)将镜像导出为tar包，然后将该tar包解压到临时目录中(/dev/loop用于将文件虚拟成文件系统)
配置/etc/resolv.conf文件，主要是确保该文件的存在
umount并删除临时目录，至此就完成了基础镜像文件系统文件的制作。

func addFiles(img *api.Image, src source.Source) (err error) {
	log.Debugf("Copying in files to the image file from a source...")
	p := path.Join(img.ObjectPath(), constants.IMAGE_FS) //mkfs创建出来的ext4文件路径
	tempDir, err := ioutil.TempDir("", "") //创建一个临时文件
	if err != nil {
		return
	}
	defer os.RemoveAll(tempDir)

	if _, err := util.ExecuteCommand("mount", "-o", "loop", p, tempDir); err != nil { //挂载ext4文件系统到临时目录
		return fmt.Errorf("failed to mount image %q: %v", p, err)
	}
	defer util.DeferErr(&err, func() error {
		_, execErr := util.ExecuteCommand("umount", tempDir)
		return execErr
	})

	err = source.TarExtract(src, tempDir)//将基础镜像解压到历史目录中
	if err != nil {
		return
	}

	err = setupResolvConf(tempDir)//确保存在/etc/resolv.conf文件

	return
}

制作vm内核文件

与制作基础文件系统文件类似，制作内核文件也需要拉取所需的内核镜像。同样也需要"创建containerdClient"和"创建cniInstance"，不同的是，此处gvk中的kind为Kernel。

创建内核文件的方法如下，大体上与创建基础文件系统类似，但并不需要所有内核镜像中的文件，只需要内核镜像的/boot 和 /lib目录即可，且/boot目录中必须包含vmlinux文件(vmlinux是kvm创建vm的必要文件)。过程如下：

查找内核镜像(本地获取或远程拉取)
创建kernel对象，并配置对象的相关参数，如名称、UID等
解压内核镜像中的/boot和/lib/modules目录
将vmlinux文件拷贝到constants.DATA_DIR路径下
将解压出来的文件打包到constants.DATA_DIR路径下，名称为kernel.tar，后续和基础文件系统合并，目录结构如下：

$ pwd
/var/lib/firecracker/kernel/1bdd3b2354873157

$ ll
-rw-r--r--. 1 root root 73574400 Jul 14 10:53 kernel.tar
-rw-r--r--. 1 root root      492 Jul 14 10:53 metadata.json
-rwxr-xr-x. 1 root root 43526368 Jul 14 10:53 vmlinux

由于内核文件后续需要放到文件系统中，因此不需要再制作单独的文件系统，只需要将所需的文件拷贝打包到本地即可，在执行"Create vm"的过程中会将打包的内核文件解压到基础文件系统中进行合并：

// importKernel imports a kernel from an OCI image
func importKernel(c *client.Client, ociRef meta.OCIImageRef) (*api.Kernel, error) {
	log.Debugf("Importing kernel with ociRef %q", ociRef)
	// Parse the source
	dockerSource := source.NewDockerSource()
	src, err := dockerSource.Parse(ociRef) //从containerd本地或远端加载镜像
	if err != nil {
		return nil, err
	}

	kernel := c.Kernels().New() //初始化一个image对象
	// Set the kernel name
	kernel.Name = ociRef.String()
	// Set the kernel's ociRef
	kernel.Spec.OCI = ociRef
	// Set the kernel's ociSource
	kernel.Status.OCISource = *src

	// Generate UID automatically
	if err := metadata.SetNameAndUID(kernel, c); err != nil { //设置kernel对象的UID
		return nil, err
	}

	// Cache the kernel contents in the kernel tar file
	kernelTarFile := path.Join(kernel.ObjectPath(), constants.KERNEL_TAR)

	// vmlinuxFile describes the uncompressed kernel file at /var/lib/firecracker/kernel/<id>/vmlinux
	vmlinuxFile := path.Join(kernel.ObjectPath(), constants.KERNEL_FILE)

	// Create both the kernel tar file and the vmlinux file it either doesn't exist
	if !util.FileExists(kernelTarFile) || !util.FileExists(vmlinuxFile) {
		// Create a temporary directory for extracting
		// the necessary files from the OCI image
		tempDir, err := ioutil.TempDir("", "")
		if err != nil {
			return nil, err
		}

		// Extract only the /boot and /lib directories of the tar stream into the tempDir
		err = source.TarExtract(dockerSource, tempDir, "boot", "lib/modules") //抽取所需的内核文件到临时目录
		if err != nil {
			return nil, err
		}

		// Locate the kernel file in the temporary directory
		kernelTmpFile, err := findKernel(tempDir) //查找vmlinux文件
		if err != nil {
			return nil, err
		}

		// Copy the vmlinux file
		if err := util.CopyFile(kernelTmpFile, vmlinuxFile); err != nil {
			return nil, fmt.Errorf("failed to copy kernel file %q to kernel %q: %v", kernelTmpFile, kernel.GetUID(), err)
		}

		// 将抽取出来的内核文件打包到 /var/lib/firecracker/kernel/<UID>/kernel.tar
		if _, err := util.ExecuteCommand("tar", "-cf", kernelTarFile, "-C", tempDir, "."); err != nil {
			return nil, err
		}

		// 移除临时目录
		if err := os.RemoveAll(tempDir); err != nil {
			return nil, err
		}
	}

	// Populate the kernel version field if possible
	if len(kernel.Status.Version) == 0 {
		cmd := fmt.Sprintf("strings %s | grep 'Linux version' | awk '{print $3}'", vmlinuxFile)
		// Use the pipefail option to return an error if any of the pipeline commands is not available
		out, err := util.ExecuteCommand("/bin/bash", "-o", "pipefail", "-c", cmd)
		if err != nil {
			kernel.Status.Version = "<unknown>"
		} else {
			kernel.Status.Version = out
		}
	}

	if err := c.Kernels().Set(kernel); err != nil { //将内核对象保存到Storage中
		return nil, err
	}

	log.Infof("Imported OCI image %q (%s) to kernel image with UID %q", ociRef, kernel.Status.OCISource.Size, kernel.GetUID())
	return kernel, nil
}

内核镜像的元数据如下：

# cat metadata.json 
{
  "kind": "Kernel",
  "apiVersion": "ignite.weave.works/v1alpha4",
  "metadata": {
    "name": "weaveworks/ignite-kernel:5.10.51",
    "uid": "1bdd3b2354873157",
    "created": "2023-07-14T02:53:10Z"
  },
  "spec": {
    "oci": "weaveworks/ignite-kernel:5.10.51"
  },
  "status": {
    "version": "5.10.51",
    "ociSource": {
      "id": "oci://docker.io/weaveworks/ignite-kernel@sha256:a992aa9f7b6f5e7945e72610017c3f4f38338ff1452964e30410bb6110a794a7",
      "size": "72588KB"
    }
  }
}

kernel对象的定义如下，对应上面的metadata.json：

type Kernel struct {
	runtime.TypeMeta `json:",inline"`
	// runtime.ObjectMeta is also embedded into the struct, and defines the human-readable name, and the machine-readable ID
	// Name is available at the .metadata.name JSON path
	// ID is available at the .metadata.uid JSON path (the Go type is k8s.io/apimachinery/pkg/types.UID, which is only a typed string)
	runtime.ObjectMeta `json:"metadata"`

	Spec   KernelSpec   `json:"spec"`
	Status KernelStatus `json:"status"`
}

// KernelSpec describes the properties of a kernel
type KernelSpec struct {
	OCI meta.OCIImageRef `json:"oci"`
	// Optional future feature, support per-kernel specific default command lines
	// DefaultCmdLine string
}

// KernelStatus describes the status of a kernel
type KernelStatus struct {
	Version   string         `json:"version"`
	OCISource OCIImageSource `json:"ociSource"`
}

Create vm

创建vm使用的命令是ignite vm create，这一步只是做好vm启动前的准备，如果要启动vm，还需要执行 ignite vm start。

配置vm对象

首先需要初始化一个vm对象，包括：

配置vm对象的镜像、runtime和网络
合并命令行传入的自定义参数
校验vm对象的合法性
尝试拉取基础镜像和内核镜像，并给vm对象设置image和kernel信息

运行一个vm可以直接执行ignite vm create+ignite vm start，或直接执行ignite vm run

func (cf *CreateFlags) NewCreateOptions(args []string, fs *flag.FlagSet) (*CreateOptions, error) {
	// Create a new base VM and configure it by combining the component config,
	// VM config file and flags.
	baseVM := providers.Client.VMs().New() //初始化一个vm对象

	// If component config is in use, set the VMDefaults on the base VM.
	if providers.ComponentConfig != nil {
		baseVM.Spec = providers.ComponentConfig.Spec.VMDefaults
	}

	// Resolve registry configuration used for pulling image if required.
	cmdutil.ResolveRegistryConfigDir()

	// Initialize the VM's Prefixer
	baseVM.Status.IDPrefix = providers.IDPrefix //设置vm对象的基本信息
	// Set the runtime and network-plugin on the VM, then override the global config.
	baseVM.Status.Runtime.Name = providers.RuntimeName // 设置runtime 和 CNI实例
	baseVM.Status.Network.Plugin = providers.NetworkPluginName
	// Populate the runtime and network-plugin providers.
	if err := config.SetAndPopulateProviders(providers.RuntimeName, providers.NetworkPluginName); err != nil {
		return nil, err
	}

	// Set the passed image argument on the new VM spec.
	// Image is necessary while serializing the VM spec.
	if len(args) == 1 {
		ociRef, err := meta.NewOCIImageRef(args[0])
		if err != nil {
			return nil, err
		}
		baseVM.Spec.Image.OCI = ociRef
	}

	// Generate a VM name and UID if not set yet.
	if err := metadata.SetNameAndUID(baseVM, providers.Client); err != nil {//设置vm的UID和名称
		return nil, err
	}

	// Apply the VM config on the base VM, if a VM config is given.
	if len(cf.ConfigFile) != 0 {//如果使用文件指定了vm对象的配置信息，则将该配置合并到vm对象中
		if err := applyVMConfigFile(baseVM, cf.ConfigFile); err != nil {
			return nil, err
		}
	}

	// Apply flag overrides.
	if err := applyVMFlagOverrides(baseVM, cf, fs); err != nil {//使用命令行参数覆盖vm对象
		return nil, err
	}

	// If --require-name is true, VM name must be provided.
	if cf.RequireName && len(baseVM.Name) == 0 {
		return nil, fmt.Errorf("must set VM name, flag --require-name set")
	}

	// Assign the new VM to the configFlag.
	cf.VM = baseVM

	// Validate the VM object.
	if err := validation.ValidateVM(cf.VM).ToAggregate(); err != nil { //vm对象有效性校验
		return nil, err
	}

	co := &CreateOptions{CreateFlags: cf}
  //下面用于拉取基础镜像和内核镜像，相当于 ignite image import 和 ignite kernel import
	// Get the image, or import it if it doesn't exist.
	var err error
	co.image, err = operations.FindOrImportImage(providers.Client, cf.VM.Spec.Image.OCI)
	if err != nil {
		return nil, err
	}

	// Populate relevant data from the Image on the VM object.
	cf.VM.SetImage(co.image) //设置vm对象的image信息

	// Get the kernel, or import it if it doesn't exist.
	co.kernel, err = (providers.Client, cf.VM.Spec.Kernel.OCI)
	if err != nil {
		return nil, err
	}

	// Populate relevant data from the Kernel on the VM object.
	cf.VM.SetKernel(co.kernel) //设置vm对象的kernel元数据
	return co, nil
}

vm对象的元数据如下：

$ cat metadata.json 
{
  "kind": "VM",
  "apiVersion": "ignite.weave.works/v1alpha4",
  "metadata": {
    "name": "restless-waterfall",
    "uid": "ddf49307b5b27c34",
    "created": "2023-07-18T08:33:25Z"
  },
  "spec": {
    "image": {
      "oci": "weaveworks/ignite-ubuntu:latest"
    },
    "sandbox": {
      "oci": "weaveworks/ignite:v0.10.0"
    },
    "kernel": {
      "oci": "weaveworks/ignite-kernel:5.10.51",
      "cmdLine": "console=ttyS0 reboot=k panic=1 pci=off ip=dhcp"
    },
    "cpus": 1,
    "memory": "512MB",
    "diskSize": "4GB",
    "network": {
    },
    "storage": {
    },
    "ssh": true
  },
  "status": {
    "running": true,
    "runtime": {
      "id": "ignite-ddf49307b5b27c34",
      "name": "containerd"
    },
    "startTime": "2023-07-18T08:33:25Z",
    "network": {
      "plugin": "cni",
      "ipAddresses": [
        "10.61.0.3"
      ]
    },
    "image": {
      "id": "oci://docker.io/weaveworks/ignite-ubuntu@sha256:52414720f26c808bc1273845c6d0f0a99472dfa8eaf8df52429261cbac27f1ba",
      "size": "249308KB"
    },
    "kernel": {
      "id": "oci://docker.io/weaveworks/ignite-kernel@sha256:a992aa9f7b6f5e7945e72610017c3f4f38338ff1452964e30410bb6110a794a7",
      "size": "72588KB"
    },
    "idPrefix": "ignite"
  }
}

vm对象的定义如下，与上述metadata.json对应:

type VM struct {
	runtime.TypeMeta `json:",inline"`
	// runtime.ObjectMeta is also embedded into the struct, and defines the human-readable name, and the machine-readable ID
	// Name is available at the .metadata.name JSON path
	// ID is available at the .metadata.uid JSON path (the Go type is k8s.io/apimachinery/pkg/types.UID, which is only a typed string)
	runtime.ObjectMeta `json:"metadata"`

	Spec   VMSpec   `json:"spec"`
	Status VMStatus `json:"status"`
}

// VMSpec describes the configuration of a VM
type VMSpec struct {
	Image    VMImageSpec   `json:"image"`   //基础文件镜像
	Sandbox  VMSandboxSpec `json:"sandbox"` //运行镜像
	Kernel   VMKernelSpec  `json:"kernel"`  //内核文件镜像
	CPUs     uint64        `json:"cpus"`
	Memory   meta.Size     `json:"memory"`
	DiskSize meta.Size     `json:"diskSize"`
	// TODO: Implement working omitempty without pointers for the following entries
	// Currently both will show in the JSON output as empty arrays. Making them
	// pointers requires plenty of nil checks (as their contents are accessed directly)
	// and is very risky for stability. APIMachinery potentially has a solution.
	Network VMNetworkSpec `json:"network,omitempty"`
	Storage VMStorageSpec `json:"storage,omitempty"`
	// This will be done at either "ignite start" or "ignite create" time
	// TODO: We might revisit this later
	CopyFiles []FileMapping `json:"copyFiles,omitempty"`
	// SSH specifies how the SSH setup should be done
	// nil here means "don't do anything special"
	// If SSH.Generate is set, Ignite will generate a new SSH key and copy it in to authorized_keys in the VM
	// Specifying a path in SSH.Generate means "use this public key"
	// If SSH.PublicKey is set, this struct will marshal as a string using that path
	// If SSH.Generate is set, this struct will marshal as a bool => true
	SSH *SSH `json:"ssh,omitempty"`
}

创建vm文件系统

至此我们已经创建了基础文件系统文件，抽取了必要的内核文件，并创建了一个vm对象，但创建一个vm还需要一个完整的文件系统。在上面的"制作vm文件系统"中只是分别制作了基础文件系统文件和内核文件，下面还需要将其合并成一个完整的文件系统。

下面是ignite vm create命令的入口，首先设置vm的UID和名称(这一步在"创建vm对象"中已经执行过，此处主要是确保有UID和名称)以及标签，然后将其保存到Storage中，并创建vm的文件系统：

func Create(co *CreateOptions) (err error) {
	// Generate a random UID and Name
	if err = metadata.SetNameAndUID(co.VM, providers.Client); err != nil {
		return
	}
	// Set VM labels.
	if err = metadata.SetLabels(co.VM, co.Labels); err != nil {
		return
	}
	defer util.DeferErr(&err, func() error { return metadata.Cleanup(co.VM, false) })

	if err = providers.Client.VMs().Set(co.VM); err != nil {// 将vm对象存储到Storage中
		return
	}

	// Allocate and populate the overlay file
	if err = dmlegacy.AllocateAndPopulateOverlay(co.VM); err != nil {//创建vm文件系统
		return
	}

	err = metadata.Success(co.VM)

	return
}

AllocateAndPopulateOverlay是文件系统制作的入口，最终生成一个devicemapper设备：

首先通过vm中的镜像地址(name:tag)找到镜像的UID，用于查找本地的本地/var/lib/firecracker/image/<UID>/image.ext4
使用找到的UID定位基础镜像的文件系统/var/lib/firecracker/image/<imageUID>/image.ext4，并调整文件系统大小。后续作为devicemapper snapshot类型的origin device。(devicemapper snapshot的介绍见下文)
创建目录/var/lib/firecracker/vm/<vmUID>，并创建文件/var/lib/firecracker/vm/<vmUID>/overlay.dm，根据命令行或vm配置文件来定义overlay.dm的大小，不能小于image.ext4。后续作为devicemapper snapshot类型的COW device。
使用image.ext4和overlay.dm创建一个snapshot类型的devicemapper，此时snapshot存储中包含了基础文件系统
将内核文件解压合并到snapshot存储中，并调整vm文件系统的其他配置，如hostname，DNS等。至此完成了一个vm文件系统。

func AllocateAndPopulateOverlay(vm *api.VM) error {
	requestedSize := vm.Spec.DiskSize.Bytes()
	// Truncate only accepts an int64
  if requestedSize > math.MaxInt64 {
		return fmt.Errorf("requested size %d too large, cannot truncate", requestedSize)
	}
	size := int64(requestedSize)

	//获取基础镜像的UID，用于在/var/lib/firecracker/image中查找image.ext4
	imageUID, err := lookup.ImageUIDForVM(vm, providers.Client)
	if err != nil {
		return err
	}

	// Get the size of the image ext4 file
	fi, err := os.Stat(path.Join(constants.IMAGE_DIR, imageUID.String(), constants.IMAGE_FS))//查找image.ext4
	if err != nil {
		return err
	}
	imageSize := fi.Size()

	// The overlay needs to be at least as large as the image
	if size < imageSize { //调整overlay.dm的大小
		log.Warnf("warning: requested overlay size (%s) < image size (%s), using image size for overlay\n",
			vm.Spec.DiskSize.String(), meta.NewSizeFromBytes(uint64(imageSize)).String())
		size = imageSize
	}

	// Make sure the all directories above the snapshot directory exists
	if err := os.MkdirAll(path.Dir(vm.OverlayFile()), constants.DATA_DIR_PERM); err != nil {
		return err
	}

	overlayFile, err := os.Create(vm.OverlayFile())//创建vm的overlay文件
	if err != nil {
		return fmt.Errorf("failed to create overlay file for %q, %v", vm.GetUID(), err)
	}
	defer overlayFile.Close()

	if err := overlayFile.Truncate(size); err != nil {//调整overlay文件大小
		return fmt.Errorf("failed to allocate overlay file for VM %q: %v", vm.GetUID(), err)
	}

	// populate the filesystem
	return copyToOverlay(vm)//创建snapshot类型的devicemapper设备
}

现在根据，copyToOverlay的实现如下，ActivateSnapshot用于创建vm运行所需的devicemapper snapshot类型的存储，除此之外，都是对vm文件系统的调整，如导入内核文件，设置ssh等。

func copyToOverlay(vm *api.VM) (err error) {
	_, err = ActivateSnapshot(vm) //创建devicemapper的snapshot存储，作为vm的启动设备
	if err != nil {
		return
	}
	defer util.DeferErr(&err, func() error { return DeactivateSnapshot(vm) })

	mp, err := util.Mount(vm.SnapshotDev()) //挂载snapshot存储
	if err != nil {
		return
	}
	defer util.DeferErr(&err, mp.Umount)

	// Copy the kernel files to the VM. TODO: Use snapshot overlaying instead.
  //将/var/lib/firecracker/kernel/<UID>/kernel.tar解压到挂载路径下，与基础文件系统进行合并
	if err = copyKernelToOverlay(vm, mp.Path); err != nil { 
		return
	}

	// do not mutate vm.Spec.CopyFiles
	fileMappings := vm.Spec.CopyFiles

	if vm.Spec.SSH != nil { //如果指定了ssh，则需要为vm创建ssh密钥对
		pubKeyPath := vm.Spec.SSH.PublicKey
		if vm.Spec.SSH.Generate {
			// generate a key if PublicKey is empty
			pubKeyPath, err = newSSHKeypair(vm)
			if err != nil {
				return
			}
		}

		if len(pubKeyPath) > 0 {
			fileMappings = append(fileMappings, api.FileMapping{
				HostPath: pubKeyPath,
				VMPath:   vmAuthorizedKeys,
			})
		}
	}

	// TODO: File/directory permissions?
	for _, mapping := range fileMappings { //使用拷贝方式处理vm和host的文件映射
		vmFilePath := path.Join(mp.Path, mapping.VMPath)
		if err = os.MkdirAll(path.Dir(vmFilePath), constants.DATA_DIR_PERM); err != nil {
			return
		}

		if err = util.CopyFile(mapping.HostPath, vmFilePath); err != nil {
			return
		}
	}

	ip := net.IP{127, 0, 0, 1}
	if len(vm.Status.Network.IPAddresses) > 0 {
		ip = vm.Status.Network.IPAddresses[0]
	}

	// Write /etc/hosts for the VM //在/etc/hosts中设置本机主机名地址解析
	if err = writeEtcHosts(mp.Path, vm.GetUID().String(), ip); err != nil {
		return
	}

	// Write the UID to /etc/hostname for the VM // 在/etc/hostname中设置本机主机名
	if err = writeEtcHostname(mp.Path, vm.GetUID().String()); err != nil {
		return
	}

	// Populate /etc/fstab with the VM's volume mounts //在/etc/fstab中配置vm.Spec.Storage中定义的卷挂载
	if err = populateFstab(vm, mp.Path); err != nil {
		return
	}

	// Set overlay root permissions
	err = os.Chmod(mp.Path, constants.DATA_DIR_PERM)

	return
}

ActivateSnapshot用来创建一个给vm使用的devicemapper块存储，主要步骤如下：

使用losetup将镜像文件image.ext4 attach到一个/dev/loop设备上，此时可以将其虚拟成一个文件系统

使用losetup将镜像文件overlay.dm attach到一个/dev/loop设备上，可以使用losetup查看attach的设备：

$ losetup 
NAME       SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE                                              DIO LOG-SEC
/dev/loop1         0      0         1  1 /var/lib/firecracker/image/669a5721d130ef1d/image.ext4   0     512
/dev/loop2         0      0         1  0 /var/lib/firecracker/vm/ddf49307b5b27c34/overlay.dm      0     512

如果overlay loop设备的大小大于image loop设备，需要对image loop设备进行扩展(官方要求)。方法是创建一个linear类型的devicemapper设备，将image loop设备映射到该dm设备上，并使用zero类型的devicemapper扩展该dm设备。扩展方式如下：

linear类型的dm用于join多个存储，或将一个存储split成多个dm设备。
```
$ dmsetup create test-snapshot <<EOF
"0 8388608 linear /dev/loop0 0"
"8388608 12582912 zero"
EOF
```

使用dmsetup命令创建一个snapshot类型的devicemapper设备，创建方式如下，其中image loop作为origin device，overlay loop作为COW device：

$ dmsetup create ignite-ddf49307b5b27c34 --table '0 8388608 snapshot /dev/{loop0,mapper/ignite-<uid>-base} /dev/loop1 P 8'

使用如下命令可以查看创建的dm设备：

$ dmsetup status
ignite-ddf49307b5b27c34: 0 8388608 snapshot 274328/8388608 1080 #创建出来的snapshot设备
ignite-ddf49307b5b27c34-base: 0 577536 linear      #扩展image loop所创建的devicemapper，映射到image loop设备
ignite-ddf49307b5b27c34-base: 577536 8388608 zero  #用于扩展ignite-ddf49307b5b27c34-base的设备

官方对snapshot的描述如下，即向snapshot中写入数据时，数据只会写到COW device，而读取时则会从COW device和origin device中读取。这里描述了COW device要小于origin的大小。

*) snapshot <origin> <COW device> <persistent?> <chunksize>

A snapshot of the <origin> block device is created. Changed chunks of
<chunksize> sectors will be stored on the <COW device>.  Writes will
only go to the <COW device>.  Reads will come from the <COW device> or
from <origin> for unchanged data.  <COW device> will often be
smaller than the origin and if it fills up the snapshot will become
useless and be disabled, returning errors.  So it is important to monitor
the amount of free space and expand the <COW device> before it fills up.

<persistent?> is P (Persistent) or N (Not persistent - will not survive
after reboot).  O (Overflow) can be added as a persistent store option
to allow userspace to advertise its support for seeing "Overflow" in the
snapshot status.  So supported store types are "P", "PO" and "N".

创建snapshot类型的dm设备需要两部分，一个是origin device，它是只读的；另一个是COW device，可读可写。下面展示一下该类型在容器中的用法：

$ mkdir -p /tmp/mnt

# 拷贝一个vm镜像文件，并attach到/dev/loop5
$ cp /var/lib/firecracker/image/669a5721d130ef1d/image.ext4 /home  
$ losetup /dev/loop5 image.img

# 拷贝一个和vm镜像文件一样大小的overlay文件，并attach到/dev/loop6
$ dd if=/dev/zero of=overlay.dm  bs=512 count=577536
$ losetup /dev/loop6 overlay.dm

# 获取块的block数目，并创建snapshot类型的devicemapper设备，并将其挂载到/tmp/mnt
$ blockdev --getsz /dev/loop5
577536

# 创建snapshot设备并挂载到/tmp/mnt目录
$ dmsetup create test-snapshot --table '0 577536 snapshot /dev/loop5 /dev/loop6 P 8'
$ mount /dev/mapper/test-snapshot /tmp/mnt

# 设置remove snapshot设备之后自动detach loop设备
$ losetup -d /dev/loop0
$ losetup -d /dev/loop1

查看挂载的目录，可以看到它就是一个完整的linux文件系统。如果对该文件系统进行修改，其修改内容并不会影响到vm镜像(可以在修改之后umount snapshot设备并单独挂载image的/dev/loop5，可以发现其并没有任何变更，重新创建snapshot之后可以复原变更)：

$ ll /tmp/mnt/
total 76
lrwxrwxrwx.  1 root root     7 Oct  7  2021 bin -> usr/bin
drwxr-xr-x.  2 root root  4096 Apr 15  2020 boot
drwxr-xr-x.  2 root root  4096 Oct  7  2021 dev
drwxr-xr-x. 52 root root  4096 Jul 14 10:52 etc
drwxr-xr-x.  2 root root  4096 Apr 15  2020 home
lrwxrwxrwx.  1 root root     7 Oct  7  2021 lib -> usr/lib
lrwxrwxrwx.  1 root root     9 Oct  7  2021 lib32 -> usr/lib32
lrwxrwxrwx.  1 root root     9 Oct  7  2021 lib64 -> usr/lib64
lrwxrwxrwx.  1 root root    10 Oct  7  2021 libx32 -> usr/libx32
drwx------.  2 root root 16384 Jul 14 10:52 lost+found
drwxr-xr-x.  2 root root  4096 Oct  7  2021 media
drwxr-xr-x.  2 root root  4096 Oct  7  2021 mnt
drwxr-xr-x.  2 root root  4096 Oct  7  2021 opt
drwxr-xr-x.  2 root root  4096 Apr 15  2020 proc
drwx------.  2 root root  4096 Oct  7  2021 root
drwxr-xr-x.  8 root root  4096 Nov  9  2021 run
lrwxrwxrwx.  1 root root     8 Oct  7  2021 sbin -> usr/sbin
drwxr-xr-x.  2 root root  4096 Oct  7  2021 srv
drwxr-xr-x.  2 root root  4096 Apr 15  2020 sys
drwxrwxrwt.  2 root root  4096 Oct  7  2021 tmp
drwxr-xr-x. 13 root root  4096 Oct  7  2021 usr
drwxr-xr-x. 11 root root  4096 Oct  7  2021 var

环境清理

$ umount  /tmp/mnt
$ dmsetup remove  test-snapshot

使用e2fsck解决可能存在的文件系统错误
```
$ e2fsck -p -f /dev/mapper/<snapshot>
```
使用loseup Detach image和overlay的loop设备，这样在snapshot被移除之后，底层的loop设备也会被自动移除：
```
$ losetup -d /dev/loop0
$ losetup -d /dev/loop1
```
这样就完成了一个vm文件系统存储，后面只需将其进行挂载就可以为vm所用。

配置ssh

在创建vm文件系统的过程中需要配置(copyToOverlay方法)ssh，代码段如下，主要就是将公钥(.pub结尾的文件)拷贝到vm的/root/.ssh/authorized_keys中。如果vm.Spec.SSH.Generate为true，则会通过openssl命令生成新的密钥对，路径为/var/lib/firecracker/vm/<UID>/id_<UID>/，其中包含了公钥和私钥，公钥仍然会被拷贝到vm的/root/.ssh/authorized_keys中，私钥则用于ssh client连接。

这样后续就可以通过ssh登陆vm机器：

	if vm.Spec.SSH != nil {
		pubKeyPath := vm.Spec.SSH.PublicKey
		if vm.Spec.SSH.Generate {
			// generate a key if PublicKey is empty
			pubKeyPath, err = newSSHKeypair(vm)
			if err != nil {
				return
			}
		}

		if len(pubKeyPath) > 0 {
			fileMappings = append(fileMappings, api.FileMapping{
				HostPath: pubKeyPath,
				VMPath:   vmAuthorizedKeys,
			})
		}
	}

	for _, mapping := range fileMappings {
		vmFilePath := path.Join(mp.Path, mapping.VMPath)
		if err = os.MkdirAll(path.Dir(vmFilePath), constants.DATA_DIR_PERM); err != nil {
			return
		}

		if err = util.CopyFile(mapping.HostPath, vmFilePath); err != nil {
			return
		}
	}

ssh密钥对生成方式如下：

// Generate a new SSH keypair for the vm
func newSSHKeypair(vm *api.VM) (string, error) {
	privKeyPath := path.Join(vm.ObjectPath(), fmt.Sprintf(constants.VM_SSH_KEY_TEMPLATE, vm.GetUID()))
	// TODO: In future versions, let the user specify what key algorithm to use through the API types
	sshKeyAlgorithm := "ed25519"
	if util.FIPSEnabled() {
		// Use rsa on FIPS machines
		sshKeyAlgorithm = "rsa"
	}
	_, err := util.ExecuteCommand("ssh-keygen", "-q", "-t", sshKeyAlgorithm, "-N", "", "-f", privKeyPath)
	if err != nil {
		return "", err
	}

	return fmt.Sprintf("%s.pub", privKeyPath), nil
}

Start vm

ignite的vm其实就是在容器中通过firecracker命令创建出来的一个vm。因此要创建一个vm，首先要启动一个容器。容器也有自己的文件系统，在下图中，容器的文件系统由ignite镜像提供。另一个是vm所需要的文件系统，它就是上面我们创建出来的devicemapper设备，后续由firecracker挂载为vm的root fs。

启动vm使用的命令是ignite vm start。主要是启动由ignite vm create创建出来的vm对象。

第一步通过vm名称从Storage中找到该vm对象，然后启动vm。入参so中包含了vm对象及其参数。在vm启动之后，还需要处理ssh连接以及vm attach之类的操作：

func Start(so *StartOptions, fs *flag.FlagSet) error {
	// Check if the given VM is already running
	if so.vm.Running() {
		return fmt.Errorf("VM %q is already running", so.vm.GetUID())
	}
  
  //下面主要是配置runtime和networkplugin的名称和client
	// Stopped VMs don't contain the runtime and network information. Set the
	// default runtime and network from the providers if empty.
	if so.vm.Status.Runtime.Name == "" {
		so.vm.Status.Runtime.Name = providers.RuntimeName
	}
	if so.vm.Status.Network.Plugin == "" {
		so.vm.Status.Network.Plugin = providers.NetworkPluginName
	}

	// In case the runtime and network-plugin are specified explicitly at
	// start, set the runtime and network-plugin on the VM. This overrides the
	// global config and config on the VM object, if any.
	if fs.Changed("runtime") {
		so.vm.Status.Runtime.Name = providers.RuntimeName
	}
	if fs.Changed("network-plugin") {
		so.vm.Status.Network.Plugin = providers.NetworkPluginName
	}

	// Set the runtime and network-plugin providers from the VM status.
	if err := config.SetAndPopulateProviders(so.vm.Status.Runtime.Name, so.vm.Status.Network.Plugin); err != nil {
		return err
	}

  //有效性校验，主要校验文件的存在性，如依赖的可执行文件，依赖的CNI文件以及内核/dev/kvm、/dev/net/tun、/dev/mapper/control等
	ignoredPreflightErrors := sets.NewString(util.ToLower(so.StartFlags.IgnoredPreflightErrors)...)
	if err := checkers.StartCmdChecks(so.vm, ignoredPreflightErrors); err != nil {
		return err
	}

  //启动vm
	if err := operations.StartVM(so.vm, so.Debug); err != nil {
		return err
	}

  //等待ssh服务就绪
	// When --ssh is enabled, wait until SSH service started on port 22 at most N seconds
	if ssh := so.vm.Spec.SSH; ssh != nil && ssh.Generate && len(so.vm.Status.Network.IPAddresses) > 0 {
		if err := waitForSSH(so.vm, constants.SSH_DEFAULT_TIMEOUT_SECONDS, constants.IGNITE_SPAWN_TIMEOUT); err != nil {
			return err
		}
	}

	// If starting interactively, attach after starting
	if so.Interactive {
		return Attach(so.AttachOptions)
	}
	return nil
}

StartVM是启动vm的入口，vmChans.SpawnFinished用于校验vm对象是否被成功保存到Storage中，超时时间为2min，超时返回启动失败的错误。

func StartVM(vm *api.VM, debug bool) error {

	vmChans, err := StartVMNonBlocking(vm, debug)
	if err != nil {
		return err
	}

	if err := <-vmChans.SpawnFinished; err != nil {
		return err
	}

	return nil
}

启动一个vm需要预先设置一些条件，如文件系统、网络、目录挂载等。下面是启动vm的方法，了解vm是如何启动的，基本就了解ignite是如何运作的。

首先查找是否已经存在vm所在的容器，如果存在，则移除该容器。这里需要注意的是，RemoveContainer调用的是containerdClient去删除容器，如果容器正在运行，则无法删除，此时会直接返回，中断后续流程(缺少kill？)
调用ActivateSnapshot配置容器需要的snapshot设备
获取vm的目录(/var/lib/firecracker/vm/<UID>)和内核目录(/var/lib/firecracker/kernel/<UID>)，分别用于挂载vm的metadata.json文件和内核的vmlinux文件，后续firecracker会使用这两个文件来启动vm
添加环境变量，以及挂载的设备(如/dev/mapper/control，/dev/net/tun)和自定义目录，这里包含vm的文件系统。可以使用ctr --namespace firecracker containers info <UID>查看一个vm的挂载情况。
调用providers.Runtime.RunContainer启动vm的容器
配置容器的cni网络
设置vm对象的runtime字段，后续会通过该字段来判断vm使用的runtime
将vm元数据保存到Storage中。
通过vmChans.SpawnFinished等待vm创建成功。

func StartVMNonBlocking(vm *api.VM, debug bool) (*VMChannels, error) {
	// Inspect the VM container and remove it if it exists
	inspectResult, _ := providers.Runtime.InspectContainer(vm.PrefixedID())
	RemoveVMContainer(inspectResult)

	// Make sure we always initialize all channels
	vmChans := &VMChannels{
		SpawnFinished: make(chan error),
	}

	// Setup the snapshot overlay filesystem
	snapshotDevPath, err := dmlegacy.ActivateSnapshot(vm)
	if err != nil {
		return vmChans, err
	}

	kernelUID, err := lookup.KernelUIDForVM(vm, providers.Client)
	if err != nil {
		return vmChans, err
	}

  //查找vm路径和kernel路径，用于挂载vm元数据和内核vmlinux文件
	vmDir := filepath.Join(constants.VM_DIR, vm.GetUID().String())
	kernelDir := filepath.Join(constants.KERNEL_DIR, kernelUID.String())

	// Verify that the image containing ignite-spawn is pulled
	// TODO: Integrate automatic pulling into pkg/runtime
  //校验基础镜像和内核镜像是否存在，不存在则重新拉取
	if err := verifyPulled(vm.Spec.Sandbox.OCI); err != nil {
		return vmChans, err
	}

  //设置挂载的卷，主要是/var/lib/firecracker/vm/<UID>/目录和该目录下的vmlinux文件，以及/dev下的一些设备
	config := &runtime.ContainerConfig{
		Cmd: []string{
			fmt.Sprintf("--log-level=%s", logs.Logger.Level.String()),
			vm.GetUID().String(),
		},
		Labels: map[string]string{"ignite.name": vm.GetName()},
		Binds: []*runtime.Bind{
			{
				HostPath:      vmDir,
				ContainerPath: vmDir,
			},
			{
				// Mount the metadata.json file specifically into the container, to a well-known place for ignite-spawn to access
				HostPath:      path.Join(vmDir, constants.METADATA),
				ContainerPath: constants.IGNITE_SPAWN_VM_FILE_PATH,
			},
			{
				// Mount the vmlinux file specifically into the container, to a well-known place for ignite-spawn to access
				HostPath:      path.Join(kernelDir, constants.KERNEL_FILE),
				ContainerPath: constants.IGNITE_SPAWN_VMLINUX_FILE_PATH,
			},
		},
		CapAdds: []string{
			"SYS_ADMIN", // Needed to run "dmsetup remove" inside the container
			"NET_ADMIN", // Needed for removing the IP from the container's interface
		},
		Devices: []*runtime.Bind{
			runtime.BindBoth("/dev/mapper/control"), // This enables containerized Ignite to remove its own dm snapshot
			runtime.BindBoth("/dev/net/tun"),        // Needed for creating TAP adapters
			runtime.BindBoth("/dev/kvm"),            // Pass through virtualization support
			runtime.BindBoth(snapshotDevPath),       // The block device to boot from
		},
		StopTimeout:  constants.STOP_TIMEOUT + constants.IGNITE_TIMEOUT,
		PortBindings: vm.Spec.Network.Ports, // Add the port mappings to Docker
	}

  // 配置环境变量
	var envVars []string
	for k, v := range vm.GetObjectMeta().Annotations {
		if strings.HasPrefix(k, constants.IGNITE_SANDBOX_ENV_VAR) {
			k := strings.TrimPrefix(k, constants.IGNITE_SANDBOX_ENV_VAR)
			envVars = append(envVars, fmt.Sprintf("%s=%s", k, v))
		}
	}
	config.EnvVars = envVars

	// 添加自定义挂载
	for _, volume := range vm.Spec.Storage.Volumes {
		if volume.BlockDevice == nil {
			continue // Skip all non block device volumes for now
		}

		config.Devices = append(config.Devices, &runtime.Bind{
			HostPath:      volume.BlockDevice.Path,
			ContainerPath: path.Join(constants.IGNITE_SPAWN_VOLUME_DIR, volume.Name),
		})
	}

	// Prepare the networking for the container, for the given network plugin
	if err := providers.NetworkPlugin.PrepareContainerSpec(config); err != nil {
		return vmChans, err
	}

	// If we're not debugging, remove the container post-run
	if !debug {
		config.AutoRemove = true
	}

	// Run the VM container in Docker
	containerID, err := providers.Runtime.RunContainer(vm.Spec.Sandbox.OCI, config, vm.PrefixedID(), vm.GetUID().String())
	if err != nil {
		return vmChans, fmt.Errorf("failed to start container for VM %q: %v", vm.GetUID(), err)
	}

	// 配置CNI网络
	result, err := providers.NetworkPlugin.SetupContainerNetwork(containerID, vm.Spec.Network.Ports...)
	if err != nil {
		return vmChans, err
	}

	if !logs.Quiet {
		log.Infof("Networking is handled by %q", providers.NetworkPlugin.Name())
		log.Infof("Started Firecracker VM %q in a container with ID %q", vm.GetUID(), containerID)
	}

	// Set the container ID for the VM
	vm.Status.Runtime.ID = containerID
	vm.Status.Runtime.Name = providers.RuntimeName

	// Append non-loopback runtime IP addresses of the VM to its state
	for _, addr := range result.Addresses {
		if !addr.IP.IsLoopback() {
			vm.Status.Network.IPAddresses = append(vm.Status.Network.IPAddresses, addr.IP)
		}
	}
	vm.Status.Network.Plugin = providers.NetworkPluginName

	// write the API object in a non-running state before we wait for spawn's network logic and firecracker
	if err := providers.Client.VMs().Set(vm); err != nil {
		return vmChans, err
	}

	// TODO: This is temporary until we have proper communication to the container
	// It's best to perform any imperative changes to the VM object pointer before this go-routine starts
	go waitForSpawn(vm, vmChans)

	return vmChans, nil
}

在下面的RunContainer方法中涉及到一个snapshotService，该snapshot与devicemapper 的snapshot不同，此处的snapshotService用于解压并挂载容器镜像，给容器提供启动所需的文件系统。

下面是containerd的主目录。containerd本身是插件化的，该目录下的目录都由不同的插件创建。使用ctr plugin list查看支持的插件：
$ cd /var/lib/containerd/
$ ll
drwxr-xr-x. 4 root root 33 Mar 15 14:14 io.containerd.content.v1.content
drwx--x--x. 2 root root 21 Mar 15 11:05 io.containerd.metadata.v1.bolt
drwx--x--x. 2 root root  6 Mar 15 11:05 io.containerd.runtime.v1.linux
drwx--x--x. 4 root root 37 Jul 14 10:53 io.containerd.runtime.v2.task
drwx------. 3 root root 23 Mar 15 11:05 io.containerd.snapshotter.v1.native
drwx------. 3 root root 42 Jul 14 10:52 io.containerd.snapshotter.v1.overlayfs
io.containerd.content.v1.content：存储 OCI 镜像，更多参见：oci image spec。

io.containerd.metadata.v1.bolt：存储 containerd 管理的镜像、容器、快照的元数据，存储的内容参见：源码。

io.containerd.snapshotter.v1.<type>：Snapshotter 快照目录，参见Snapshotters 文档

io.containerd.snapshotter.v1.btrfs ：使用 btrfs 文件系统创建容器快照的目录

io.containerd.snapshotter.v1.overlayfs ：默认的 snapshotter。采用 overlayfs2 创建快照。

上面是对containerd主目录的描述，镜像文件会被放到io.containerd.content.v1.content中，然后由snapshot解压并mount到io.containerd.snapshotter.v1.overlayfs(此处使用的是overlayfs)，供容器使用。

下面是本机启动的一个ignit vm，可以看到snapshot提供了overlayfs所需的lowerdir、upperdir和workdir。
$ mount|grep ignit
overlay on /run/containerd/io.containerd.runtime.v2.task/firecracker/ignite-272a0eab-75be-4131-a022-1fde8012f9f6/rootfs type overlay (rw,relatime,seclabel,lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/10/fs,upperdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/17/fs,workdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/17/work)
containerd的snapshot有两种类型Active和Committed，分别对应容器运行的container layer(lowerdir、workdir)和image layer(lowerdir)，对Active snapshot的修改是不会保存的，如果需要保存可以通过snapshot commit将其转变为Committed状态。使用ctr --namespace firecracker snapshot ls可以查看当前的snapshot状态。snapshot是有层级关系的，使用ctr --namespace firecracker snapshot tree 可以查看snapshot的层级关系。
$ ctr --namespace firecracker snapshot ls
KEY                                                                     PARENT                                                                  KIND      
ignite-272a0eab-75be-4131-a022-1fde8012f9f6                             sha256:6d1a1092846de7c30d76df9c7aa787b50ad4dee32d32daebe0c7a87ffede14b9 Active    
sha256:0c949a3342f6400d49b4d378bf7b20b768cf09bef107cc6c5d58a1f3e50e06f3 sha256:38facc6304c0b9270805fab2c549a3fef82dce370cab3f24d922e0a3b46c2541 Committed 
sha256:38facc6304c0b9270805fab2c549a3fef82dce370cab3f24d922e0a3b46c2541                                                                         Committed 
sha256:6d1a1092846de7c30d76df9c7aa787b50ad4dee32d32daebe0c7a87ffede14b9                                                                         Committed 
sha256:9f54eef412758095c8079ac465d494a2872e02e90bf1fb5f12a1641c0d1bb78b                                                                         Committed 
sha256:ac9030d17ea3c723f7ff631b7e9c16f0d914ecf43f37b3e0f7cb5cae8012b39d sha256:f0e76d36d3129de5a1ddb77efc4963b2dfec81f9c5ca21e117198a3c2ae9f397 Committed 
sha256:bc98849e95ef9484381c1a36ce97339d7cd8675f23a37766ed47b7fcc947bb91 sha256:9f54eef412758095c8079ac465d494a2872e02e90bf1fb5f12a1641c0d1bb78b Committed 
sha256:f0e76d36d3129de5a1ddb77efc4963b2dfec81f9c5ca21e117198a3c2ae9f397 sha256:bc98849e95ef9484381c1a36ce97339d7cd8675f23a37766ed47b7fcc947bb91 Committed 
sha256:f9e99b137a1976a6aaa287cb3cddea2f6e6545707ad1302c454fd4d06ffbb2ab sha256:ac9030d17ea3c723f7ff631b7e9c16f0d914ecf43f37b3e0f7cb5cae8012b39d Committed 
下面用一个例子看下snapshot是如何工作的。
# 创建一个名为test的containerd 命名空间
$ ctr ns create test

# 准备挂载点
$ mkdir /var/lib/containerd/custom_dir  

# 第一次提交 (根)
$ ctr -n test snapshot prepare activeLayer0                 # prepare 创建一个工作状态的层
# 生成并执行snapshot 文件系统挂载(此挂载类型overlayfs)命令
$ ctr -n test snapshot mount /var/lib/containerd/custom_dir activeLayer0  | xargs sudo
$ echo "1" > /var/lib/containerd/custom_dir/add01           # 增加一次变更文件 
$ umount /var/lib/containerd/custom_dir                     # umount
$ ctr  -n test snapshot commit commit_add01 activeLayer0    # 提交 committed,变更snapshot状态，保存Layer
上面snapshot mount 生成的mount命令为："mount -t bind /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/21866/fs /var/lib/containerd/custom_dir -o rw,rbind"

查看当前的snapshot，可以发现生成了一个committed状态的snapshot
$ ctr -n test snapshot ls
KEY          PARENT KIND      
commit_add01        Committed 
如果查看io.containerd.snapshotter.v1.overlayfs/snapshots目录可以发现生成了一个新的文件夹21866，它就是commit产生的文件系统：
$ ll io.containerd.snapshotter.v1.overlayfs/snapshots/21841/fs/
-rw-r----- 1 root root    2 Jul 26 01:33 add01
下面再测试提交一个变更，首先创建一个active的snapshot，parent为commit_add01
# 第二次提交，以第一次 layer 为 parent
$ ctr -n test snapshot prepare activeLayer0 commit_add01
查看snapshot，发现active的snapshot的parent是上面commit的snapshot:
$ ctr -n test snapshot ls  
KEY          PARENT       KIND      
activeLayer0 commit_add01 Active    
commit_add01              Committed 
提交二次变更，这一步中ctr生成的mount命令为："mount -t overlay overlay /var/lib/containerd/custom_dir -o index=off,workdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/21867/work,upperdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/21867/fs,lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/21866/fs"，可以看到它就是容器运行所需的overlay文件系统，lowerdir就是commit_add01生成的文件系统。
$ ctr -n test snapshot mount /var/lib/containerd/custom_dir activeLayer0 | xargs sudo
$ echo "2" > /var/lib/containerd/custom_dir/add02
$ umount /var/lib/containerd/custom_dir
$ ctr -n test snapshot commit commit_add02 activeLayer0
查看snapshot，可以发现新增了一个snapshot commit_add02，其parents为commit_add01，即commit之后就产生了一个子snapshot。
$ ctr -n test snapshot ls
KEY          PARENT       KIND      
commit_add01              Committed 
commit_add02 commit_add01 Committed 
如果继续以新的snapshot commit_add02为parents创建overlay，会不会合并commit_add01的变更？
$ ctr -n test snapshot prepare activeLayer0 commit_add02
$ ctr -n test snapshot mount /var/lib/containerd/custom_dir activeLayer0 
下面是snapshot mount生成的mount命令，可以看到lowerdir中包含了commit_add01和commit_add02的文件系统。
$ mount -t overlay overlay /var/lib/containerd/custom_dir -o index=off,workdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/21869/work,upperdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/21869/fs,lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/21867/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/21866/fs
环境清理，清理时注意先清理子snapshot，否则会出现错误"cannot remove snapshot with child: failed precondition"
$ ctr -n test snapshot rm commit_add02
$ ctr -n test snapshot rm commit_add01
此外还可以使用ctr snapshot view创建只读系统，此时如果向挂载的目录中写数据，会返回"Read-only file system"的错误。

总结下来就是，首先使用prepare或view(只读)创建一个active snapshot，然后通过mount命令挂载active snapshot，在对active snapshot修改之后就可以通过commit命令将变更持久化。

更多参见Snapshots。

containerd有两个概念：container和task。container可以看做是为容器运行准备的环境，如cgroup和挂载的卷，而task则是容器内运行的进程。如下，查看container可以看到的是容器使用的镜像和runtime，而task则是进程和进程状态：
$ ctr -n firecracker container ls 
CONTAINER                                      IMAGE                                  RUNTIME                  
ignite-4a64e75d-c7fb-43ba-aaed-6e7923374ba5    docker.io/weaveworks/ignite:v0.10.0    io.containerd.runc.v2 
$ ctr -n firecracker task ls 
TASK                                           PID      STATUS    
ignite-4a64e75d-c7fb-43ba-aaed-6e7923374ba5    10332    RUNNING

有了上述知识后，就不难理解RunContainer的流程：

首先移除非running的容器
将主机的/etc/resolv.conf中的内容写入vm目录下的runtime.containerd.resolv.conf文件中，并将其加入挂载配置，后续挂载为容器的/etc/resolv.conf
配置创建容器所需的cni选项，这里添加了配置的环境变量、hostname、挂载卷和/dev的挂载设备
创建一个containerd snapshot Service，用于给容器提供rootfs
配置创建容器的选项，这里用到了上面创建的cni选项和rootfs
创建containerd 容器和task并启动 task

以下都是标准的启动containerd容器的流程，感兴趣的话也可以在containerd源码的_test.go文件中查找使用例子：

func (cc *ctdClient) RunContainer(image meta.OCIImageRef, config *runtime.ContainerConfig, name, id string) (s string, err error) {
	img, err := cc.client.GetImage(cc.ctx, image.Normalized())
	if err != nil {
		return
	}

	// Remove the container if it exists
	if err = cc.RemoveContainer(name); err != nil {
		return
	}

	// Load the default snapshotter
	snapshotter := cc.client.SnapshotService(containerd.DefaultSnapshotter)

	// Add the /etc/resolv.conf mount, this isn't done automatically by containerd
	// Ensure a resolv.conf exists in the vmDir. Calculate path using the vm id
	resolvConfPath := filepath.Join(constants.VM_DIR, id, resolvConfName)
  //读取主机的/etc/resolv.conf并写入vm目录的runtime.containerd.resolv.conf中
	err = resolvconf.EnsureResolvConf(resolvConfPath, constants.DATA_DIR_FILE_PERM) 
	if err != nil {
		return
	}
	config.Binds = append(
		config.Binds,
		&runtime.Bind{
			HostPath:      resolvConfPath,
			ContainerPath: "/etc/resolv.conf", //将runtime.containerd.resolv.conf挂载到容器中
		},
	)

	// Add the stop timeout as a label, as containerd doesn't natively support it
	config.Labels[stopTimeoutLabel] = strconv.FormatUint(uint64(config.StopTimeout), 10)

	// Build the OCI specification
	opts := []oci.SpecOpts{
		oci.WithDefaultSpec(),
		oci.WithDefaultUnixDevices,
		oci.WithTTY,
		oci.WithImageConfigArgs(img, config.Cmd),
		oci.WithEnv(config.EnvVars),
		withAddedCaps(config.CapAdds),
		withHostname(config.Hostname),
		withMounts(config.Binds), //挂载卷
		withDevices(config.Devices), //挂载设备
	}

	// Known limitations, containerd doesn't support the following config fields:
	// - StopTimeout
	// - AutoRemove
	// - NetworkMode (only CNI supported)
	// - PortBindings

	snapshotOpt := containerd.WithSnapshot(name)
	if _, err = snapshotter.Stat(cc.ctx, name); errdefs.IsNotFound(err) {
		// Even if "read only" is set, we don't use a KindView snapshot here (#1495).
		// We pass the writable snapshot to the OCI runtime, and the runtime remounts
		// it as read-only after creating some mount points on-demand.
		snapshotOpt = containerd.WithNewSnapshot(name, img)
	} else if err != nil {
		return
	}

	cOpts := []containerd.NewContainerOpts{
		containerd.WithImage(img),
		snapshotOpt,
		//containerd.WithImageStopSignal(img, "SIGTERM"),
		containerd.WithNewSpec(opts...),
		containerd.WithContainerLabels(config.Labels),
	}

	cont, err := cc.client.NewContainer(cc.ctx, name, cOpts...)
	if err != nil {
		return
	}

	// This is a dummy PTY to silence output
	// when starting without attach breaking
	con, _, err := console.NewPty()
	if err != nil {
		return
	}
	defer util.DeferErr(&err, con.Close)

	// We need a temporary dummy stdin reader that
	// actually works, can't use nullReader here
	dummyReader, _, err := os.Pipe()
	if err != nil {
		return
	}
	defer util.DeferErr(&err, dummyReader.Close)

	// Spawn the Creator with the dummy streams
	ioCreator := cio.NewCreator(cio.WithTerminal, cio.WithStreams(dummyReader, con, con))

	task, err := cont.NewTask(cc.ctx, ioCreator)
	if err != nil {
		return
	}

	if err = task.Start(cc.ctx); err != nil {
		return
	}

	// TODO: Save task.Pid() somewhere for attaching?
	s = task.ID()
	return
}

至此已经完成容器的启动，在容器启动之后会通过ignite-spawn命令调用firecracker来启动vm。在Dockerfile中可以看到，容器启动命令为：

ENTRYPOINT ["/usr/local/bin/ignite-spawn"]

firecracker start vm

解析配置

第一步是将挂载的IGNITE_SPAWN_VM_FILE_PATH转变为一个vm对象，该文件是在start vm时挂载的/var/lib/firecracker/vm/<UID>/metadata.json文件，里面包含了创建vm的规则，如CPU、内存、磁盘、网络等。需要注意的是启动vm的操作是在容器中执行的。

func decodeVM(vmID string) (*api.VM, error) {
	filePath := constants.IGNITE_SPAWN_VM_FILE_PATH
	obj, err := scheme.Serializer.DecodeFile(filePath, true)
	if err != nil {
		return nil, err
	}

	vm, ok := obj.(*api.VM)
	if !ok {
		return nil, fmt.Errorf("object couldn't be converted to VM")
	}

	// Explicitly set the GVK on this object
	vm.SetGroupVersionKind(api.SchemeGroupVersion.WithKind(api.KindVM.Title()))
	return vm, nil
}

启动vm

启动vm需要完成如下三步：

配置容器网络：主要是检查接口地址是否正常，并为vm创建接口
配置DHCP
启动vm：这一步使用firecracker启动vm，用到了第一步中准备的接口、主机上的devicemapper设备等

func StartVM(vm *api.VM) (err error) {

	// Setup networking inside of the container, return the available interfaces
	fcIfaces, dhcpIfaces, err := container.SetupContainerNetworking(vm)//配置容器网络
	if err != nil {
		return fmt.Errorf("network setup failed: %v", err)
	}

	// Serve DHCP requests for those interfaces
	// This function returns the available IP addresses that are being
	// served over DHCP now
	if err = container.StartDHCPServers(vm, dhcpIfaces); err != nil { //配置DHCP
		return
	}

	// Serve metrics over an unix socket in the VM's own directory
	metricsSocket := path.Join(vm.ObjectPath(), constants.PROMETHEUS_SOCKET)
	serveMetrics(metricsSocket)

	// Patches the VM object to set state to stopped, and clear IP addresses
	defer util.DeferErr(&err, func() error { return patchStopped(vm) })

	// Remove the snapshot overlay post-run, which also removes the detached backing loop devices
	defer util.DeferErr(&err, func() error { return dmlegacy.DeactivateSnapshot(vm) })

	// Remove the Prometheus socket post-run
	defer util.DeferErr(&err, func() error { return os.Remove(metricsSocket) })

	// Execute Firecracker
	if err = container.ExecuteFirecracker(vm, fcIfaces); err != nil { //启动vm
		return fmt.Errorf("runtime error for VM %q: %v", vm.GetUID(), err)
	}

	return
}

配置容器网络

firecracker是在容器中创建vm的，因此需要在容器中为vm准备网络环境。通过vm对象的annotation ignite.weave.works/interface可以为vm添加额外的接口。

注：ignite支持两种网络模式，MODE_DHCP和MODE_TC，目前用的是MODE_DHCP。

func SetupContainerNetworking(vm *api.VM) (firecracker.NetworkInterfaces, []DHCPInterface, error) {
   var dhcpIntfs []DHCPInterface
   var fcIntfs firecracker.NetworkInterfaces

   //通过vm的metadata.json的annotation ignite.weave.works/interface可以添加额外的接口
  vmIntfs := parseExtraIntfs(vm) //vmIntfs: map[<interface_name>][interface_mode]

   // 如果没有eth0接口，则添加该接口，并设置为dhcp模式
   if _, ok := vmIntfs[mainInterface]; !ok {
      vmIntfs[mainInterface] = MODE_DHCP
   }

   interval := 1 * time.Second

  //等待接口就绪，就绪则返回true
   err := wait.PollImmediate(interval, constants.IGNITE_SPAWN_TIMEOUT, func() (bool, error) {

      // 检查接口是否存在且配置正确
      retry, err := collectInterfaces(vmIntfs)

      if err == nil {
         // We're done here
         return true, nil
      }
      if retry {
         // We got an error, but let's ignore it and try again
         log.Warnf("Got an error while trying to set up networking, but retrying: %v", err)
         return false, nil
      }
      // The error was fatal, return it
      return false, err
   })

   if err != nil {
      return nil, nil, err
   }

   //为vm准备接口等网络环境
   if err := networkSetup(&fcIntfs, &dhcpIntfs, vmIntfs); err != nil {
      return nil, nil, err
   }

   return fcIntfs, dhcpIntfs, nil
}

collectInterfaces方法用于检查接口是否存在且配置正确，流程为：

获取当前的所有接口
验证是否存在预期的接口vmIntfs，以及接口是否配置了IP地址

func collectInterfaces(vmIntfs map[string]string) (bool, error) {
  //获取所有接口
   allIntfs, err := net.Interfaces()
   if err != nil || allIntfs == nil || len(allIntfs) == 0 {
      return false, fmt.Errorf("cannot get local network interfaces: %v", err)
   }

   // create a map of candidate interfaces
   foundIntfs := make(map[string]net.Interface)
   for _, intf := range allIntfs {
      if _, ok := ignoreInterfaces[intf.Name]; ok {
         continue
      }

      foundIntfs[intf.Name] = intf

      // If the interface is explicitly defined, no changes are needed
      if _, ok := vmIntfs[intf.Name]; ok { //如果已经定义接口，则无需再为接口配置mode
         continue
      }

      // default fallback behaviour to always consider intfs with an address
      addrs, _ := intf.Addrs()
      if len(addrs) > 0 {
         vmIntfs[intf.Name] = MODE_DHCP
      }
   }

   // 校验是否已经创建期望的接口
   for intfName, mode := range vmIntfs {
      if _, ok := foundIntfs[intfName]; !ok {
         return true, fmt.Errorf("interface %q (mode %q) is still not found", intfName, mode)
      }

      // for DHCP interface, we need to make sure IP and route exist
      if mode == MODE_DHCP {
         intf := foundIntfs[intfName]
        _, _, _, noIPs, err := getAddress(&intf) //返回接口的IP/掩码、网关和物理接口(link)，这里判断是否接口配置了IP
         if err != nil {
            return true, err
         }

         if noIPs {
            return true, fmt.Errorf("IP is still not found on %q", intfName)
         }
      }
   }
   return false, nil
}

在对vmIntfs接口进行校验之后，就可以为vm配置网络，主要流程为：

遍历容器中所有预期的接口，然后获取这些接口的第一个地址信息，并从接口上删除该地址，并返回地址信息，后续作为firecracker vm的地址，相当于vm借用了容器的地址信息。
针对每个预期的接口，创建一个tab接口和一个bridge接口，然后将预期接口和tab接口桥接到该bridge接口上。后续会将tab接口配置给firecracker 创建出来的vm，并配置上第一步返回的地址信息

func networkSetup(fcIntfs *firecracker.NetworkInterfaces, dhcpIntfs *[]DHCPInterface, vmIntfs map[string]string) error {

	// The order in which interfaces are plugged in is intentionally deterministic
	// All interfaces are sorted alphabetically and 'eth0' is always first
	var keys []string
	for k := range vmIntfs {
		keys = append(keys, k)
	}
	sort.Strings(keys)
	sort.Slice(keys, func(i, j int) bool {
		return keys[i] == mainInterface
	})

	for _, intfName := range keys {

		intf, err := net.InterfaceByName(intfName) //根据接口名称获取容器的接口实例
		if err != nil {
			return fmt.Errorf("cannot find interface %q: %s", intfName, err)
		}

		switch vmIntfs[intfName] {
		case MODE_DHCP:
			ipNet, gw, err := takeAddress(intf) //获取容器接口的第一个地址信息，并从容器接口上删除该地址信息，然后返回该信息，后续作为vm的接口地址
			if err != nil {
				return fmt.Errorf("error parsing interface %q: %s", intfName, err)
			}

			dhcpIface, err := bridge(intf) //创建tab和bridge接口，并配置桥接。返回给vm使用的接口dhcpIface
			if err != nil {
				return fmt.Errorf("bridging interface %q failed: %v", intfName, err)
			}

			dhcpIface.VMIPNet = ipNet
			dhcpIface.GatewayIP = gw

			*dhcpIntfs = append(*dhcpIntfs, *dhcpIface) //添加dhcp接口

			*fcIntfs = append(*fcIntfs, firecracker.NetworkInterface{
				StaticConfiguration: &firecracker.StaticNetworkConfiguration{
					MacAddress:  dhcpIface.MACFilter,
					HostDevName: dhcpIface.VMTAP,
				},
			})
		case MODE_TC:
			tcInterface, err := addTcRedirect(intf)
			if err != nil {
				log.Errorf("Failed to setup tc redirect %v", err)
				continue
			}

			*fcIntfs = append(*fcIntfs, *tcInterface)
		}
	}

	return nil
}

下面是使用bridge CNI时创建的容器接口，可以看到eth0接口上的IP被(takeAddress)删除了，vm_eth0和br_eth0分别是为eth0创建的tab接口和bridge接口：

$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
3: eth0@if14: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br_eth0 state UP group default 
    link/ether 8e:c9:3a:f0:50:67 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::8cc9:3aff:fef0:5067/64 scope link 
       valid_lft forever preferred_lft foreve
4: vm_eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel master br_eth0 state UP group default qlen 1000
    link/ether d6:20:6b:d4:3e:2a brd ff:ff:ff:ff:ff:ff
    inet6 fe80::d420:6bff:fed4:3e2a/64 scope link 
       valid_lft forever preferred_lft forever
5: br_eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 92:79:7d:39:28:b6 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::9079:7dff:fe39:28b6/64 scope link 
       valid_lft forever preferred_lft forever

$ ip link show master br_eth0
3: eth0@if14: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br_eth0 state UP mode DEFAULT group default 
    link/ether 8e:c9:3a:f0:50:67 brd ff:ff:ff:ff:ff:ff link-netnsid 0
4: vm_eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel master br_eth0 state UP mode DEFAULT group default qlen 1000
    link/ether d6:20:6b:d4:3e:2a brd ff:ff:ff:ff:ff:ff

vm的接口和路由如下，eth0就是从容器中的vm_eth0

$ ip a 
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: bond0: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    link/ether 1e:7d:6c:90:99:18 brd ff:ff:ff:ff:ff:ff
3: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether e6:18:3d:88:a9:ff brd ff:ff:ff:ff:ff:ff
4: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 76:6f:77:5d:6d:1c brd ff:ff:ff:ff:ff:ff
    inet 10.61.0.41/16 brd 10.61.255.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::746f:77ff:fe5d:6d1c/64 scope link 
       valid_lft forever preferred_lft forever
$ ip route
default via 10.61.0.1 dev eth0

呈现的接口如下：

配置DHCP

这一步在bridge接口上启动了dhcp服务：

// StartDHCPServers starts multiple DHCP servers for the VM, one per interface
// It returns the IP addresses that the API object may post in .status, and a potential error
func StartDHCPServers(vm *api.VM, dhcpIfaces []DHCPInterface) error {

	// Fetch the DNS servers given to the container
	clientConfig, err := dns.ClientConfigFromFile("/etc/resolv.conf")
	if err != nil {
		return fmt.Errorf("failed to get DNS configuration: %v", err)
	}

	for i := range dhcpIfaces {
		dhcpIface := &dhcpIfaces[i]
		// Set the VM hostname to the VM ID
		dhcpIface.Hostname = vm.GetUID().String()

		// Add the DNS servers from the container
		dhcpIface.SetDNSServers(clientConfig.Servers)

		go func() {
			log.Infof("Starting DHCP server for interface %q (%s)\n", dhcpIface.Bridge, dhcpIface.VMIPNet.IP)
			if err := dhcpIface.StartBlockingServer(); err != nil {
				log.Errorf("%q DHCP server error: %v\n", dhcpIface.Bridge, err)
			}
		}()
	}

	return nil
}

启动vm

使用firecracker启动vm时需要配置如下基本参数：

从vm元数据中获取设置的CPU和内存资源
配置devicemapper设备、接口和挂载卷
初始并启动一个firecracker vm

func ExecuteFirecracker(vm *api.VM, fcIfaces firecracker.NetworkInterfaces) (err error) {
	drivePath := vm.SnapshotDev() //获取vm的devicemapper设备，由于容器中挂载了host的/dev，因此可以直接查看使用

	vCPUCount := int64(vm.Spec.CPUs) //获取CPU和内存资源
	memSizeMib := int64(vm.Spec.Memory.MBytes())

	cmdLine := vm.Spec.Kernel.CmdLine
	if len(cmdLine) == 0 {
		// if for some reason cmdline would be unpopulated, set it to the default
		cmdLine = constants.VM_DEFAULT_KERNEL_ARGS
	}

	// Convert the logrus error level to a Firecracker compatible error level.
	// Firecracker accepts "Error", "Warning", "Info", and "Debug", case-sensitive.
	fcLogLevel := "Debug"
	switch logs.Logger.Level {
	case log.InfoLevel:
		fcLogLevel = "Info"
	case log.WarnLevel:
		fcLogLevel = "Warning"
	case log.ErrorLevel, log.FatalLevel, log.PanicLevel:
		fcLogLevel = "Error"
	}

	firecrackerSocketPath := path.Join(vm.ObjectPath(), constants.FIRECRACKER_API_SOCKET)
	logSocketPath := path.Join(vm.ObjectPath(), constants.LOG_FIFO)
	metricsSocketPath := path.Join(vm.ObjectPath(), constants.METRICS_FIFO)
	cfg := firecracker.Config{
		SocketPath:      firecrackerSocketPath,
		KernelImagePath: constants.IGNITE_SPAWN_VMLINUX_FILE_PATH, //挂载到容器中的vmlinux文件路径
		KernelArgs:      cmdLine,
		Drives: []models.Drive{{
			DriveID:      firecracker.String("1"),
			IsReadOnly:   firecracker.Bool(false),
			IsRootDevice: firecracker.Bool(true),
			PathOnHost:   &drivePath, //设置devicemapper设备
		}},
		NetworkInterfaces: fcIfaces, //设置vm接口
		MachineCfg: models.MachineConfiguration{
			VcpuCount:  &vCPUCount,
			MemSizeMib: &memSizeMib,
			HtEnabled:  firecracker.Bool(true),
		},
		//JailerCfg: firecracker.JailerConfig{
		//	GID:      firecracker.Int(0),
		//	UID:      firecracker.Int(0),
		//	ID:       vm.ID,
		//	NumaNode: firecracker.Int(0),
		//	ExecFile: "firecracker",
		//},

		LogLevel: fcLogLevel,
		// TODO: We could use /dev/null, but firecracker-go-sdk issues Mkfifo which collides with the existing device
		LogFifo:     logSocketPath,
		MetricsFifo: metricsSocketPath,
	}

	// Add the volumes to the VM
	for i, volume := range vm.Spec.Storage.Volumes {
		volumePath := path.Join(constants.IGNITE_SPAWN_VOLUME_DIR, volume.Name)
		if !util.FileExists(volumePath) {
			log.Warnf("Skipping nonexistent volume: %q", volume.Name)
			continue // Skip all nonexistent volumes
		}

		cfg.Drives = append(cfg.Drives, models.Drive{
			DriveID:      firecracker.String(strconv.Itoa(i + 2)),
			IsReadOnly:   firecracker.Bool(false), // TODO: Support read-only volumes
			IsRootDevice: firecracker.Bool(false),
			PathOnHost:   &volumePath, //设置挂载卷，这部分是从host-->container-->vm
		})
	}

	// Remove these FIFOs for now
	defer os.Remove(logSocketPath)
	defer os.Remove(metricsSocketPath)

	ctx, vmmCancel := context.WithCancel(context.Background())
	defer vmmCancel()

	cmd := firecracker.VMCommandBuilder{}.
		WithBin("firecracker").
		WithSocketPath(firecrackerSocketPath).
		WithStdin(os.Stdin).
		WithStdout(os.Stdout).
		WithStderr(os.Stderr).
		Build(ctx)

	m, err := firecracker.NewMachine(ctx, cfg, firecracker.WithProcessRunner(cmd))
	if err != nil {
		return fmt.Errorf("failed to create machine: %s", err)
	}

	//defer os.Remove(cfg.SocketPath)

	//if opts.validMetadata != nil {
	//	m.EnableMetadata(opts.validMetadata)
	//}

	if err = m.Start(ctx); err != nil { //启动vm
		return fmt.Errorf("failed to start machine: %v", err)
	}
	defer util.DeferErr(&err, m.StopVMM)

	installSignalHandlers(ctx, m)

	// wait for the VMM to exit
	if err = m.Wait(ctx); err != nil {
		return fmt.Errorf("wait returned an error %s", err)
	}

	return
}

Run vm

下面是vm运行的入口，可以看到其内部只调用了vm create和vm start两种方法，即执行了上面的"Create vm"和"Start vm"两个步骤：

func Run(ro *RunOptions, fs *flag.FlagSet) error {
   if err := Create(ro.CreateOptions); err != nil {
      return err
   }

   // Copy the pointer over for Start
   // TODO: This is pretty bad, fix this
   ro.vm = ro.VM

   return Start(ro.StartOptions, fs)
}

Kill VM

kill vm用于强制停止vm，但不会删除vm，vm的元数据和存储都还在/var/lib/firecracker/vm/<UID>目录下。

kill vm主要用的就是remove vm中调用的StopVM方法，但执行的是providers.Runtime.KillContainer，用于停止containerd task。

在删除containerd的task之前必须kill task

注意这里释放了网络资源，在执行ignite vm start的时候会重新配容器的网络资源。

func StopVM(vm *api.VM, kill, silent bool) error {
	var err error
	container := vm.PrefixedID()
	action := "stop"

	if !vm.Running() && !logs.Quiet {
		log.Warnf("VM %q is not running but trying to cleanup networking for stopped container\n", vm.GetUID())
	}

	// 释放网络资源
	if err = removeNetworking(vm.Status.Runtime.ID, vm.Spec.Network.Ports...); err != nil {
		log.Warnf("Failed to cleanup networking for stopped container %s %q: %v", vm.GetKind(), vm.GetUID(), err)

		return err
	}

	if vm.Running() {
		// Stop or kill the VM container
		if kill {
			action = "kill"
			err = providers.Runtime.KillContainer(container, signalSIGQUIT) // TODO: common constant for SIGQUIT
		} else {
			err = providers.Runtime.StopContainer(container, nil)
		}

		if err != nil {
			return fmt.Errorf("failed to %s container for %s %q: %v", action, vm.GetKind(), vm.GetUID(), err)
		}

		if silent {
			return nil
		}

		if logs.Quiet {
			fmt.Println(vm.GetUID())
		} else {
			log.Infof("Stopped %s with name %q and ID %q", vm.GetKind(), vm.GetName(), vm.GetUID())
		}
	}

	return nil
}

KillContainer的实现如下，即获取containerd容器进程并通过向该进程发送syscall.SIGQUIT信号来强制停止该容器进程，此处使用cont.Task来等待进程退出。

func (cc *ctdClient) KillContainer(container, signal string) (err error) {
	cont, err := cc.client.LoadContainer(cc.ctx, container)
	if err != nil {
		// If the container is not found, return nil, no-op.
		if errdefs.IsNotFound(err) {
			log.Warn(err)
			err = nil
		}
		return
	}

	task, err := cont.Task(cc.ctx, cio.Load)
	if err != nil {
		// If the task is not found, return nil, no-op.
		if errdefs.IsNotFound(err) {
			log.Warn(err)
			err = nil
		}
		return
	}

	// Initiate a wait
	waitC, err := task.Wait(cc.ctx)
	if err != nil {
		return
	}

	// Send a SIGQUIT signal to force stop
	if err = task.Kill(cc.ctx, syscall.SIGQUIT); err != nil {
		return
	}

	// Wait for the container to stop
	<-waitC

	// Delete the task
	_, err = task.Delete(cc.ctx)
	return
}

Stop VM

stop vm使用的也是StopVM方法，但执行的是providers.Runtime.StopContainer，相比kill vm增加了等待时间，更优雅一些。

首先向容器进程发送syscall.SIGTERM命令来优雅停机
如果在超时时间(30s)内进程没有退出，则向容器进程发送syscall.SIGQUIT信号来强制停机
最后调用task.Delete删除vm进程

核心代码如下：

	waitC, err := task.Wait(cc.ctx)
	if err != nil {
		return
	}

	// Send a SIGTERM signal to request a clean shutdown
	if err = task.Kill(cc.ctx, syscall.SIGTERM); err != nil {
		return
	}

	// After sending the signal, start the timer to force-kill the task
	timeoutC := make(chan error)
	timer := time.AfterFunc(*timeout, func() {
		timeoutC <- task.Kill(cc.ctx, syscall.SIGQUIT)
	})

	// Wait for the task to stop or the timer to fire
	select {
	case exitStatus := <-waitC:
		timer.Stop()             // Cancel the force-kill timer
		err = exitStatus.Error() // TODO: Handle exit code
	case err = <-timeoutC: // The kill timer has fired
	}

	// Delete the task
	if _, e := task.Delete(cc.ctx); e != nil {
		if err != nil {
			err = fmt.Errorf("%v, task deletion failed: %v", err, e) // TODO: Multierror
		} else {
			err = e
		}
	}

Remove vm

下面是删除一个vm的入口：

func Rm(ro *RmOptions) error {
	for _, vm := range ro.vms {
		// 如果vm是运行状态，则需要指定强制删除才能继续删除，这与docker命令行删除一个运行的容器一样
		if vm.Running() && !ro.Force {
			return fmt.Errorf("%s is running", vm.GetUID())
		}

		// Runtime and network info are present only when the VM is running.
		if vm.Running() {
			// Set the runtime and network-plugin providers from the VM status.
			if err := config.SetAndPopulateProviders(vm.Status.Runtime.Name, vm.Status.Network.Plugin); err != nil {
				return err
			}
		}

		// This will first kill the VM container, and then remove it
		if err := operations.DeleteVM(providers.Client, vm); err != nil {
			return err
		}
	}

	return nil
}

一个运行的vm包含几种资源：containerd task、containerd container、cni网络、vm挂载的devicemapper snapshot设备、vm日志文件以及Storage中保存的vm对象。移除一个vm意味着需要清理这些资源。

func DeleteVM(c *client.Client, vm *api.VM) error {
	if err := CleanupVM(vm); err != nil {
		return err
	}

  //清除vm对象以及/var/lib/firecracker/vm/<UID>/目录
	return c.VMs().Delete(vm.GetUID())
}

CleanupVM是主要的清理方法。首先调用StopVM(参见"kill vm" 和"stop vm"章节)停止并删除容器进程，移除容器网络，然后调用RemoveVMContainer清理containerd相关资源，最后调用dmlegacy.DeactivateSnapshot移除vm的文件系统(内部调用dmsetup remove命令行)。步骤为：

如果vm正在运行，则调用StopVM移除网络、停止containerd 容器的task

在移除vm时也需要移除对应的容器，否则会导致资源泄露，参见：issue
删除vm所在的容器
移除vm挂载的devicemapper snapshot设备以及vm日志文件

func CleanupVM(vm *api.VM) error {
	// Runtime information is available only when the VM is running.
	if vm.Running() {
		// Inspect the container before trying to stop it and it gets auto-removed
		inspectResult, _ := providers.Runtime.InspectContainer(vm.PrefixedID())

		// If the VM is running, try to kill it first so we don't leave dangling containers. Otherwise, try to cleanup VM networking.
		if err := StopVM(vm, true, true); err != nil {
			if vm.Running() {
				return err
			}
		}

		// Remove the VM container if it exists
		// TODO should this function return a proper error?
		RemoveVMContainer(inspectResult)
	}

	// After removing the VM container, if the Snapshot Device is still there, clean up
	if _, err := os.Stat(vm.SnapshotDev()); err == nil {
		// try remove it again with DeactivateSnapshot
		if err := dmlegacy.DeactivateSnapshot(vm); err != nil {
			return err
		}
	}

	if logs.Quiet {
		fmt.Println(vm.GetUID())
	} else {
		log.Infof("Removed %s with name %q and ID %q", vm.GetKind(), vm.GetName(), vm.GetUID())
	}

	return nil
}

RemoveContainer的清理操作如下：

通过名称从containerd中加载vm所在的容器
获取并删除该容器的task
删除容器本身
移除vm日志文件/tmp/<containerName>.log

func (cc *ctdClient) RemoveContainer(container string) error {
	// Remove the container if it exists
	cont, contLoadErr := cc.client.LoadContainer(cc.ctx, container)
	if errdefs.IsNotFound(contLoadErr) {
		log.Debug(contLoadErr)
		return nil
	} else if contLoadErr != nil {
		return contLoadErr
	}

	// Load the container's task without attaching
	task, taskLoadErr := cont.Task(cc.ctx, nil)
	if errdefs.IsNotFound(taskLoadErr) {
		log.Debug(taskLoadErr)
	} else if taskLoadErr != nil {
		return taskLoadErr
	} else {
		_, taskDeleteErr := task.Delete(cc.ctx)
		if taskDeleteErr != nil {
			log.Debug(taskDeleteErr)
		}
	}

	// Delete the container
	deleteContErr := cont.Delete(cc.ctx, containerd.WithSnapshotCleanup)
	if errdefs.IsNotFound(contLoadErr) {
		log.Debug(contLoadErr)
	} else if deleteContErr != nil {
		return deleteContErr
	}

	// Remove the log file if it exists
	logFile := fmt.Sprintf(logPathTemplate, container)
	if util.FileExists(logFile) {
		logDeleteErr := os.RemoveAll(logFile)
		if logDeleteErr != nil {
			return logDeleteErr
		}
	}

	return nil
}

辅助命令

vm logs

获取vm日志其实就是获取vm所在容器的task的打印信息，然后输出到/tmp/ignite-<UID>.log文件中：

func (cc *ctdClient) ContainerLogs(container string) (r io.ReadCloser, err error) {
	var (
		cont containerd.Container
	)

	if cont, err = cc.client.LoadContainer(cc.ctx, container); err != nil {
		return
	}

	var retriever *logRetriever
	if retriever, err = newlogRetriever(fmt.Sprintf(logPathTemplate, container)); err != nil {
		return
	}

	if _, err = cont.Task(cc.ctx, cio.NewAttach(retriever.Opt())); err != nil {
		return
	}

	// Currently we have no way of detecting if the task's attach has filled the stdout and stderr
	// buffers without asynchronous I/O (syscall.Conn and syscall.Splice). If the read reaches
	// the end, the application hangs indefinitely waiting for new output from the container.
	// TODO: Get rid of this, implement asynchronous I/O and read until the streams have been exhausted
	time.Sleep(time.Second)

	// Close the writer to signal EOF
	if err = retriever.CloseWriter(); err != nil {
		return
	}

	return retriever, nil
}

Attach vm

ignite连接终端的方式有两种：一种是attach，另一种是ssh。不同之处是，每次执行ssh会生成新的会话，而每次attach则操作的是系统的终端，因此通常使用ssh来获取终端会话。

ignite的这部分代码参考了containerd中attach的实现。

attach操作首先获取当前的终端，然后处理输入输出。ignite启动时会使用oci.WithTTY配置终端。

func (cc *ctdClient) AttachContainer(container string) (err error) {
	var (
		cont containerd.Container
		spec *oci.Spec
	)

	if cont, err = cc.client.LoadContainer(cc.ctx, container); err != nil {
		return
	}

	if spec, err = cont.Spec(cc.ctx); err != nil {
		return
	}

	var (
		con console.Console
		tty = spec.Process.Terminal
	)

	if tty {
		con = console.Current() //获取当前的终端
		defer util.DeferErr(&err, con.Reset)
		if err = con.SetRaw(); err != nil {
			return
		}
	}

	var (
		task     containerd.Task
		statusC  <-chan containerd.ExitStatus
		igniteIO *igniteIO
	)

	if igniteIO, err = newIgniteIO(fmt.Sprintf(logPathTemplate, container)); err != nil {
		return
	}
	defer util.DeferErr(&err, igniteIO.Close)

	if task, err = cont.Task(cc.ctx, cio.NewAttach(igniteIO.Opt())); err != nil { //配置日志相关的输出
		return
	}

	if statusC, err = task.Wait(cc.ctx); err != nil {
		return
	}

	if tty {
		if err := HandleConsoleResize(cc.ctx, task, con); err != nil {
			log.Errorf("console resize failed: %v", err)
		}
	} else {
		sigc := ForwardAllSignals(cc.ctx, task)
		defer StopCatch(sigc)
	}

	var code uint32
	select {
	case ec := <-statusC:
		code, _, err = ec.Result()
	case <-igniteIO.Detach():
		fmt.Println() // Use a new line for the log entry
		log.Println("Detached")
	}

	if code != 0 && err == nil {
		err = fmt.Errorf("attach exited with code %d", code)
	}

	return
}

Inspect vm

inspect可以查看image/kernel/vm三种资源，从Storage中加载对象，然后进行解码输出即可。

ssh vm

还可以通过在执行create时指定--ssh标志来启用ssh：

$ ignite ssh my-vm

在"create vm->配置ssh"中已经介绍了vm是如何配置ssh服务的。这里看下客户端是如何连接vm的ssh的。

此处使用了密钥对来进行ssh连接，大部分都是标准的ssh连接代码，参考demo。

// runSSH creates and runs ssh session based on the provided arguments.
// If the command list is empty, ssh shell is created, else the ssh command is
// executed.
func runSSH(vm *api.VM, privKeyFile string, command []string, tty bool, timeout uint32) (err error) {
	// Check if the VM is running.
	if !vm.Running() {
		return fmt.Errorf("VM %q is not running", vm.GetUID())
	}

	// Get the IP address.
	ipAddrs := vm.Status.Network.IPAddresses //获取ssh连接的ip地址
	if len(ipAddrs) == 0 {
		return fmt.Errorf("VM %q has no usable IP addresses", vm.GetUID())
	}

	// Get private key file path.
	if len(privKeyFile) == 0 { //获取本地私钥
		privKeyFile = path.Join(vm.ObjectPath(), fmt.Sprintf(constants.VM_SSH_KEY_TEMPLATE, vm.GetUID()))
		if !util.FileExists(privKeyFile) {
			return fmt.Errorf("no private key found for VM %q", vm.GetUID())
		}
	}

	// Create a new ssh signer for the private key.
	signer, err := newSignerForKey(privKeyFile)
	if err != nil {
		return fmt.Errorf("unable to create signer for private key: %v", err)
	}

	// Defer exit here and set the exit code based on any ssh error, so that
	// this ssh command returns the correct ssh exit code. Since this function
	// results in an os.Exit, any error returned by this function won't be
	// received by the caller. Print the error to make the errror message
	// visible and set the error code when an error is found.
	exitCode := 0
	defer func() {
		os.Exit(exitCode)
	}()

	// printErrAndSetExitCode is used to print an error message, set exit code
	// and return nil. This is needed because once the ssh connection is
	// estabilish, to return the error code of the actual ssh session, instead
	// of returning an error, the runSSH function defers os.Exit with the ssh
	// exit code. For showing any error to the user, it needs to be printed.
	printErrAndSetExitCode := func(errMsg error, exitCode *int, code int) error {
		log.Errorf("%v\n", errMsg)
		*exitCode = code
		return nil
	}

	// Create an SSH client, and connect.
	config := newSSHConfig(signer, timeout)
	client, err := ssh.Dial(defaultSSHNetwork, net.JoinHostPort(ipAddrs[0].String(), defaultSSHPort), config)
	if err != nil {
		return printErrAndSetExitCode(fmt.Errorf("failed to dial: %v", err), &exitCode, 1)
	}
	defer util.DeferErr(&err, client.Close)

	// Create a session.
	session, err := client.NewSession()
	if err != nil {
		return printErrAndSetExitCode(fmt.Errorf("failed to create session: %v", err), &exitCode, 1)
	}
	defer util.DeferErr(&err, session.Close)

	// Configure tty if requested.
	if tty {
		// Get stdin file descriptor reference.
		fd := int(os.Stdin.Fd())

		// Store the raw state of the terminal.
		state, err := terminal.MakeRaw(fd)
		if err != nil {
			return printErrAndSetExitCode(fmt.Errorf("failed to make terminal raw: %v", err), &exitCode, 1)
		}
		defer util.DeferErr(&err, func() error { return terminal.Restore(fd, state) })

		// Get the terminal dimensions.
		w, h, err := terminal.GetSize(fd)
		if err != nil {
			return printErrAndSetExitCode(fmt.Errorf("failed to get terminal size: %v", err), &exitCode, 1)
		}

		// Set terminal modes.
		modes := ssh.TerminalModes{
			ssh.ECHO: 1,
		}

		// Read the TERM environment variable and use it to request the PTY.
		term := os.Getenv("TERM")
		if term == "" {
			term = defaultTerm
		}

		if err = session.RequestPty(term, h, w, modes); err != nil {
			return printErrAndSetExitCode(fmt.Errorf("request for pseudo terminal failed: %v", err), &exitCode, 1)
		}
	}

	// Connect input / output.
	// TODO: these should come from the cobra command instead of hardcoding
	// os.Stderr etc.
	session.Stderr = os.Stderr
	session.Stdout = os.Stdout
	session.Stdin = os.Stdin

	if len(command) == 0 {
		if err = session.Shell(); err != nil {
			return printErrAndSetExitCode(fmt.Errorf("failed to start shell: %v", err), &exitCode, 1)
		}

		if err = session.Wait(); err != nil {
			if e, ok := err.(*ssh.ExitError); ok {
				return printErrAndSetExitCode(err, &exitCode, e.ExitStatus())
			}
			return printErrAndSetExitCode(fmt.Errorf("failed waiting for session to exit: %v", err), &exitCode, 1)
		}
	} else {
		if err = session.Run(joinShellCommand(command)); err != nil {
			if e, ok := err.(*ssh.ExitError); ok {
				return printErrAndSetExitCode(err, &exitCode, e.ExitStatus())
			}
			return printErrAndSetExitCode(fmt.Errorf("failed to run shell command: %s", err), &exitCode, 1)
		}
	}
	return
}

func newSSHConfig(publicKey ssh.Signer, timeout uint32) *ssh.ClientConfig {
   return &ssh.ClientConfig{
      User: "root",
      Auth: []ssh.AuthMethod{
         ssh.PublicKeys(publicKey),
      },
      HostKeyCallback: ssh.InsecureIgnoreHostKey(), // TODO: use ssh.FixedPublicKey instead
      Timeout:         time.Second * time.Duration(timeout),
   }
}

exec vm

可以看到exec方式内部其实用的就是ssh方式，首先使用waitForSSH等待ssh服务正常工作，然后使用runSSH登录：

func Exec(eo *ExecOptions) error {
	if err := waitForSSH(eo.vm, constants.SSH_DEFAULT_TIMEOUT_SECONDS, time.Duration(eo.Timeout)*time.Second); err != nil {
		return err
	}
	return runSSH(eo.vm, eo.IdentityFile, eo.command, eo.Tty, eo.Timeout)
}

func waitForSSH(vm *ignite.VM, dialSeconds int, sshTimeout time.Duration) error {
	if err := dialSuccess(vm, dialSeconds); err != nil { //验证ssh服务是否可达
		return err
	}

	certCheck := &ssh.CertChecker{
		IsHostAuthority: func(auth ssh.PublicKey, address string) bool {
			return true
		},
		IsRevoked: func(cert *ssh.Certificate) bool {
			return false
		},
		HostKeyFallback: func(hostname string, remote net.Addr, key ssh.PublicKey) error {
			return nil
		},
	}

	config := &ssh.ClientConfig{ //配置无认证方式登录
		HostKeyCallback: certCheck.CheckHostKey,
		Timeout:         sshTimeout,
	}

	addr := vm.Status.Network.IPAddresses[0].String() + ":22"
	sshConn, err := ssh.Dial("tcp", addr, config) //验证ssh服务是否能够返回无法认证的错误，以此判断ssh服务是否正常
	if err != nil {
		if strings.Contains(err.Error(), "unable to authenticate") {
			// we connected to the ssh server and recieved the expected failure
			return nil
		}
		return err
	}

	defer sshConn.Close()
	return fmt.Errorf("waitForSSH: connected successfully with no authentication -- failure was expected")
}

rm image

根据镜像ID从Storage中找到镜像对象，同时获取所有的vm对象
移除镜像时如果指定了--force参数，则会同时删除掉使用该镜像的vm
删除镜像所在的目录/var/lib/firecracker/image/<UID>

如果指定了多个镜像，则需要遍历处理。

rm kernel

和rm image处理逻辑相同

更多CLI操作参见官方文档

Ignited Daemon

Ignited daemon是ignite的守护进程，当用户在constants.MANIFEST_DIR(默认为/etc/firecracker/manifests)目录下创建vm的描述文件时，ignited会自动发现文件变动，并从constants.DATA_DIR(默认为/var/lib/firecracker)读取生成vm所需要的镜像和元数据。

ignited使用一个ManifestStorage来管理constants.MANIFEST_DIR和constants.DATA_DIR这两个目录，并将生成的manifestStorage保存到providers.Storage中:

func SetManifestStorage() (err error) {
	log.Trace("Initializing the ManifestStorage provider...")
	ManifestStorage, err = manifest.NewTwoWayManifestStorage(constants.MANIFEST_DIR, constants.DATA_DIR, scheme.Serializer)
	if err != nil {
		return
	}

	providers.Storage = cache.NewCache(ManifestStorage)
	return
}

由于需要watch constants.MANIFEST_DIR目录的变动，此处用到了一个GenericWatchStorage存储，它内部使用rjeczalik/notify库来通知文件的变动(Create/Modify/Delete)。

constants.DATA_DIR存储的是镜像相关的文件，只需要在创建vm的时候读取即可，不需要对其watch，因此使用了GenericStorage来对其进行管理。

func NewTwoWayManifestStorage(manifestDir, dataDir string, ser serializer.Serializer) (*ManifestStorage, error) {
   ws, err := watch.NewGenericWatchStorage(storage.NewGenericStorage(storage.NewGenericMappedRawStorage(manifestDir), ser))
   if err != nil {
      return nil, err
   }

   ss := sync.NewSyncStorage(
      storage.NewGenericStorage(
         storage.NewGenericRawStorage(dataDir), ser),
      ws)

   return &ManifestStorage{
      Storage: ss,
   }, nil
}

syncStorage和ignited daemon主流程(ReconcileManifests)的关系如下。一个syncStorage可以对接多个Storage，用于同时操作多个Storage资源，例如从多个Storage中获取/设置/删除某个资源对象。

watchStorage会通过一个名为eventStream的chan将文件事件传递给syncStorage，而syncStorage则会通过一个名为updateStream的chan将该事件传递给ignited daemon的主流程，在主流程中根据事件类型以及产生事件的对象来做出相应的动作(增/删/改等)。需要注意的是，主流程只关心vm对象的事件。

有了上述认知，ignited daemon主流程就比较简单了。根据vm的事件类型，对vm进行相应的操作即可。

func ReconcileManifests(s *manifest.ManifestStorage) {
	startMetricsThread()

	// Wrap the Manifest Storage with a cache for better performance, and create a client
	c = client.NewClient(cache.NewCache(s))

	// 监听syncStorage中传过来的事件
	for upd := range s.GetUpdateStream() {

		// 仅关心vm资源的事件
		if upd.APIType.GetKind() != api.KindVM {
			log.Tracef("GitOps: Ignoring kind %s", upd.APIType.GetKind())
			kindIgnored.Inc()
			continue
		}

		var vm *api.VM
		var err error
    //如果是删除事件，此时 manifeststorage.ManifestStorage 中vm的描述文件已经被删除，无法从ManifestStorage中获取vm
		if upd.Event == update.ObjectEventDelete { 
			// As we know this VM was deleted, it wouldn't show up in a Get() call
			// Construct a temporary VM object for passing to the delete function
			vm = &api.VM{
				TypeMeta:   *upd.APIType.GetTypeMeta(),
				ObjectMeta: *upd.APIType.GetObjectMeta(),
				Status: api.VMStatus{
					Running: true, // TODO: Fix this in StopVM
				},
			}
		} else {
			// Get the real API object
			vm, err = c.VMs().Get(upd.APIType.GetUID())
			if err != nil {
				log.Errorf("Getting %s %q returned an error: %v", upd.APIType.GetKind(), upd.APIType.GetUID(), err)
				continue
			}

			// If the object was existent in the storage; validate it
			// Validate the VM object
			// TODO: Validate name uniqueness
			if err := validation.ValidateVM(vm).ToAggregate(); err != nil {
				log.Warnf("Skipping %s of %s %q, not valid: %v.", upd.Event, upd.APIType.GetKind(), upd.APIType.GetUID(), err)
				continue
			}
		}

		// TODO: Parallelization
		switch upd.Event {
		case update.ObjectEventCreate, update.ObjectEventModify: //处理创建和修改事件
			runHandle(func() error {
				return handleChange(vm)
			})

		case update.ObjectEventDelete: //处理删除事件
			runHandle(func() error {
				// TODO: Temporary VM Object for removal
				return handleDelete(vm)
			})
		default:
			log.Infof("Unrecognized Git update type %s\n", upd.Event)
			continue
		}
	}
}

下面是处理事件的具体内容：

func handleChange(vm *api.VM) (err error) {
	// Only apply the new state if it
	// differs from the current state
	running := currentState(vm)
	if vm.Status.Running && !running { // 如果vm元数据中状态是 running，而实际非running， 则启动vm
		err = start(vm)
	} else if !vm.Status.Running && running { // 如果vm元数据中状态非running，而实际是running，则停止vm
		err = stop(vm)
	}

	return
}

func handleDelete(vm *api.VM) error {
	return remove(vm)
}

func remove(vm *api.VM) error {
	log.Infof("Removing VM %q with name %q...", vm.GetUID(), vm.GetName())
	vmDeleted.Inc()
	// Object deletion is performed by the SyncStorage, so we just
	// need to clean up any remaining resources of the VM here
	return operations.CleanupVM(vm)
}

CNI

ignite使用CNI来配置主机和容器网络。

默认CNI

默认的cni是bridge，源码位于plugins，其主要缺点是无法实现vm的跨节点通信，结构如下：

bridge的cni配置/etc/cni/net.d/10-ignite.conflist如下：

{
	"cniVersion": "0.4.0",
	"name": "ignite-cni-bridge",
	"plugins": [
		{
			"type": "bridge",
			"bridge": "ignite0",
			"isGateway": true,
			"isDefaultGateway": true,
			"promiscMode": true,
			"ipMasq": true,
			"ipam": {
				"type": "host-local",
				"subnet": "10.61.0.0/16"
			}
		},
		{
			"type": "portmap",
			"capabilities": {
				"portMappings": true
			}
		},
		{
			"type": "firewall"
		}
	]
}

ignite的使用go-cni来配置cni，实际调用的就是基本的go-cni用法，即如下接口：

New(config ...CNIOpt) (CNI, error)  //初始化一个cni对象
// Setup setup the network for the namespace
Setup(ctx context.Context, id string, path string, opts ...NamespaceOpts) (*CNIResult, error)
// Remove tears down the network of the namespace.
Remove(ctx context.Context, id string, path string, opts ...NamespaceOpts) error
// Load loads the cni network config
Load(opts ...CNIOpt) error

此外还可以采用Flannel插件实现跨主机通信，但flannel需要etcd来维护网络。更多参见官方文档。

编译和镜像制作

参考

posted @ 2023-08-16 09:33 charlieroro 阅读(1023) 评论(0) 收藏举报

刷新页面返回顶部

charlieroro

使用容器方式创建firecracker虚拟机

简介

运行

制作vm文件系统

制作vm基础文件系统文件

创建contianerdClient

创建cniInstance

拉取基础镜像

创建基础文件系统文件

制作vm内核文件

Create vm

配置vm对象

创建vm文件系统

配置ssh

Start vm

firecracker start vm

解析配置

启动vm

配置容器网络

配置DHCP

启动vm

Run vm

Kill VM

Stop VM

Remove vm

辅助命令

vm logs

Attach vm

Inspect vm

ssh vm

exec vm

rm image

rm kernel

Ignited Daemon

CNI

默认CNI

编译和镜像制作

参考

公告