Linux Namespaces 转载（Ed King）

Building Containers from Scratch in Go （github source code)

netns - network namespaces in go (git hub source code)

how to build a container from scratch

Linux containers in 500 lines of code by Lizzie Dixon
Building Containers from Scratch with Go by Liz Rice
Building Containers from Scratch in Go by mugli
- Part 1: Linux Namespaces https://medium.com/@teddyking/linux-namespaces-850489d3ccf
- Part 2: Namespaces in Go - Basics https://medium.com/@teddyking/namespaces-in-go-basics-e3f0fc1ff69a
- Part 3: Namespaces in Go - User https://medium.com/@teddyking/namespaces-in-go-user-a54ef9476f2a
- Part 4: Namespaces in Go - reexec https://medium.com/@teddyking/namespaces-in-go-reexec-3d1295b91af8
- Part 5: Namespaces in Go - Mount https://medium.com/@teddyking/namespaces-in-go-mount-e4c04fe9fb29
- Part 6: Namespaces in Go - Network https://medium.com/@teddyking/namespaces-in-go-network-fdcf63e76100
- Part 7: Namespaces in Go - UTS https://medium.com/@teddyking/namespaces-in-go-uts-d47aebcdf00e
Build Your Own Container Using Less than 100 Lines of Go by Julian Friedman
Creating Your Own Containers
Building Containers in Pure Bash and C
HN: https://news.ycombinator.com/item?id=16734440

原文已不能打开，采用了Google cache

https://medium.com/@teddyking/linux-namespaces-850489d3ccf

Linux Namespaces

Ed King

Dec 10, 2016·3 min read

Linux namespaces comprise some of the fundamental technologies behind most modern-day container implementations. At a high level, they allow for isolation of global system resources between independent processes. For example, the PID namespace isolates the process ID number space. This means that two processes running on the same host can have the same PID!

This level of isolation is clearly useful in the world of containers. Without namespaces, a process running in container A could, for example, umount an important filesystem in container B, or change the hostname of container C, or remove a network interface from container D. By namespacing these resources, the process in container A isn’t even aware that the processes in containers B, C and D exist.

It follows that you can’t interfere with something if it’s not visible to you. And that’s really what namespaces provide - a way to limit what a process can see, to make it appear as though it’s the only process running on a host.

Note that namespaces do not restrict access to physical resources such as CPU, memory and disk. That access is metered and restricted by a kernel feature called ‘cgroups’.

👟 Kicking the tyres

The following has been tested on an Ubuntu 16.04 Xenial machine

Let’s jump straight in with a practical example of namespaces in action.

$ unshare -h

Usage:  
 unshare [options] <program> [<argument>...]

Run a program with some namespaces unshared from the parent.

Options:  
 -m, --mount[=<file>]      unshare mounts namespace
 -u, --uts[=<file>]        unshare UTS namespace (hostname etc)
...

The unshare command allows you to run a program with some namespaces ‘unshared’ from its parent. Essentially what this means is that unshare will run whatever program you pass it in a new set of namespaces.

Let’s run through an example using the UTS namespace. The UTS namespace provides isolation of the hostname and domainname system identifiers. This isolation can be tested by running hostname my-new-hostname inside a UTS namespaced /bin/sh process, and confirming that the hostname change is not reflected outside that process.

$ sudo su                   # become root user
$ hostname                  # check current hostname
dev-ubuntu  
$ unshare -u /bin/sh        # create a shell in new UTS namespace
$ hostname my-new-hostname  # set hostname
$ hostname                  # confirm new hostname
my-new-hostname  
$ exit                      # exit new UTS namespace
$ hostname                  # confirm original hostname unchanged
dev-ubuntu

Breaking this down, we start by running sudo su to become the root user. Root privileges are required to create most namespaces (the exception being the user namespace - more on that in a later article). Then we run hostname to confirm our current hostname ('dev-ubuntu' in my case).

Now for the exciting part! The unshare -u /bin/sh command drops us into a shell that's running in a new, separate UTS namespace. We then run hostname my-new-hostname to set the hostname inside the new UTS namespace only. The change can be confirmed by running hostname again.

Lastly we exit the namespaced shell and run hostname one last time. We can see that the value for the hostname matches the original value, despite having run hostname my-new-hostname in between. This is because that change only took effect inside the new UTS namespace.

👑 7 namespaces to rule them all

The above example demonstrates the UTS namespace, but the fun doesn’t end there. At the time of writing there are 7 namespaces available:

Mount - isolate filesystem mount points
UTS - isolate hostname and domainname
IPC - isolate interprocess communication (IPC) resources
PID - isolate the PID number space
Network - isolate network interfaces
User - isolate UID/GID number spaces
Cgroup - isolate cgroup root directory

Most container implementations make use of the above namespaces in order to provide the highest level of isolation between separate container processes. Although note that the cgroup namespace is slightly more recent than the others and isn’t as widely used.

📺 On the next …

The unshare command is great, but what happens when we want more fine-grained control over the namespaces in our programs? The answer to this and plenty more coming up, stay tuned…

Update: Part 2, “Namespaces in Go - Basics” has been published and is available here.

WRITTEN BY

Ed King

A Software Engineer currently working with Cloud Foundry and Kubernetes.

Namespaces in Go - Basics

Ed King

Dec 11, 2016·5 min read

In the previous article we dipped our toes in the namespace waters with the unshare command. unshare is great for simple scripting around namespaces but it's not so well suited for when we need more fine-grained and precise control, as is the case with containers. For this use case it's much better to have the support of a fully fledged programming language.

Go has emerged as the container implementation language of choice. This is due in part to the fact that Docker was, and still is, written in Go. Docker is one of the most successful open source Go projects to date (37,680 GitHub ⭐️s at time of writing) and it showed the world that Go was a language to be taken seriously.

The Docker developers have previously outlined the reasons they chose to write Docker in Go. Some of the top reasons include static compilation, good asynchronous primitives, low-level interfaces, a full development environment and strong cross compilation support.

For me personally the real beauty of Go is in its apparent simplicity. Containers are hard! And by using a ‘simple’ language it makes it much easier to reason about what exactly is going on under the hood. There is a great talk by Rob Pike, “Simplicity is Complicated”, in which he discusses how simplicity is part of Go’s design. It’s definitely worth a watch if you’re interested.

👉 Let’s Go

The aim for this series of articles is to provide an understanding of how to work with Linux namespaces inside Go programs. To achieve this, we will be building out a sample application named ns-process.

ns-process will be fairly simple to begin with - it will create a /bin/sh process in a new set of namespaces. Over the course of the next few articles it will evolve in to something much more exciting - a program capable of creating unprivileged containers! Don’t worry if you’re not sure what “unprivileged” means in this context, all will be explaining along the way.

The code for ns-process is available on GitHub and I highly recommend cloning the repo so you can follow along at home.

# Git repo: https://github.com/teddyking/ns-process
# Git tag: 1.0
# Filename: ns_process.gopackage main

import (
	"fmt"
	"os"
	"os/exec"
	"syscall"
)

func main() {
	cmd := exec.Command("/bin/sh")

	cmd.Stdin = os.Stdin
	cmd.Stdout = os.Stdout
	cmd.Stderr = os.Stderr

	cmd.Env = []string{"PS1=-[ns-process]- # "}

	cmd.SysProcAttr = &syscall.SysProcAttr{
		Cloneflags: syscall.CLONE_NEWUTS,
	}

	if err := cmd.Run(); err != nil {
		fmt.Printf("Error running the /bin/sh command - %s\n", err)
		os.Exit(1)
	}
}

As you can see, there’s nothing particularly complicated here. We’re simply creating a *exec.Cmd, piping through stdin/out/err from the calling process and setting the PS1 environment variable on the new process (this just makes it easier to identify the namespaced shell when executing the program).

The interesting part is cmd.SysProcAttr, but before understanding SysProcAttr we need to take a deeper look at the underlying system calls that make up the namespaces API.

📝 The namespaces API

The namespaces(7) man page tells us there are 3 system calls that make up the API:

clone(2) - creates a new process
setns(2) - allows the calling process to join an existing namespace
unshare(2) - moves the calling process to a new namespace

unshare() may look familiar from the previous article. This is the system call that gets invoked when running the unshare command. The call we're interested in this time is clone(), as clone() gets called as part of Go’s exec.Run().

When calling clone() it's possible to pass one or more CLONE_* flags. Each namespace has a corresponding CLONE flag - CLONE_NEWNS, CLONE_NEWUTS, CLONE_NEWIPC, CLONE_NEWPID, CLONE_NEWNET, CLONE_NEWUSER and CLONE_NEWCGROUP. The execution context of the cloned process is, in part, defined by the flags passed in.

Back up to Go land and SysProcAttr, SysProcAttr allows us to set attributes on the *exec.Cmd. By specifying the Cloneflags attribute, we're telling Go to pass the corresponding CLONE_* flags through to system calls to clone(). And thus we can control which namespaces we'd like our process to be executed in.

Compile and run the program and you will be dropped into a /bin/sh process that's running in a new UTS namespace. Note that the program must be run as the root user.

💁 The following has been tested on Ubuntu 16.04 Xenial with Go 1.7.1

$ go build
$ sudo ./ns-process
-[ns-process]- #

Great! We’ve been dropped into a new shell that’s supposedly running in a new UTS namespace. Let’s confirm that this is the case.

-[ns-process]- # readlink /proc/self/ns/uts
uts:[4026532410]
-[ns-process]- # exit
$ readlink /proc/self/ns/uts
uts:[4026531838]

The contents of /proc/self/ns/uts include the namespace type (uts) and the inode number of the namespace. The fact that the inode number is different inside the ns-process shell compared to outside it implies that these two processes are indeed running in different UTS namespaces.

Not bad at all! But, we can do better. At the moment we’re only requesting a single new namespace for the process. Let’s throw in a few more to spice things up a little. This can be achieved by adding additional flags to Cloneflags, as follows.

# Git repo: https://github.com/teddyking/ns-process
# Git tag: 1.1
# Filename: ns_process.go...
cmd.SysProcAttr = &syscall.SysProcAttr{
		Cloneflags: syscall.CLONE_NEWNS |
			syscall.CLONE_NEWUTS |
			syscall.CLONE_NEWIPC |
			syscall.CLONE_NEWPID |
			syscall.CLONE_NEWNET |
			syscall.CLONE_NEWUSER,
	}
...

Compile and run the program again, and this time you’ll be dropped into a /bin/sh process that's running in a new Mount, UTS, IPC, PID, Network and User namespace.

💡 When requesting a new User namespace alongside other namespaces, the User namespace will be created first. User namespaces can be created without root permissions, which means we can now drop the sudo and run our program as a non-root user! I’ll go into more detail about the user namespace in a later article.

This is all well and good, and at a basic level does allow us to run processes in new namespaces from Go. However, IRL it’s not really all that useful … We’re missing a lot of setup required to fully initialise and configure the namespaces. For example:

We’ve requested a new Mount namespace (CLONE_NEWNS) but are currently piggybacking off the host's mounts and rootfs
We’ve requested a new PID namespace (CLONE_NEWPID) but haven't mounted a new /proc filesystem
We’ve requested a new Network namespace (CLONE_NEWNET) but haven't setup any interfaces inside the namespace
We’ve requested a new User namespace (CLONE_NEWUSER) but have failed to provide a UID/GID mapping

And so it appears that we’ve still got plenty of work cut out for us.

📺 On the next…

We’ve seen how to run a process in a new set of namespaces using Go, but how do we configure and initialise the namespaces so they are ready for use? The answer to this and plenty more coming up, stay tuned…

Update: Part 3, “Namespaces in Go - User” has been published and is available here.

Namespaces in Go - User

Ed King

Dec 13, 2016·4 min read

In the previous article we saw how to create and run a process in various Linux namespaces using Go. We left with some code that runs a /bin/sh process in a new Mount, UTS, IPC, PID, Network and User namespace.

You may recall that once we added the User namespace to ns-process we no longer had to run it as the root user. This is a great feature to have as it means ns-process can be run much more securely. However, in adding the User namespace to the program, we have inadvertently introduced some less desirable behaviour.

This behaviour can be demonstrated by comparing the output of whoami from within the namespaced shell both before and after we added the User namespace, as follows.

# Git repo: https://github.com/teddyking/ns-process
# Git tag: 1.0
# Prior to adding the User namespace$ go build
$ sudo ./ns-process
-[ns-process]- # whoami
root
-[ns-process]- # id root
uid=0(root) gid=0(root) groups=0(root)# Git tag: 1.1
# After adding the User namespace$ go build
$ ./ns-process
-[ns-process]- # whoami
nobody
-[ns-process]- # id nobody
uid=65534(nobody) gid=65534(nogroup) groups=65534(nogroup)

Although we are now able to run ns-process as a non-root user, once inside the namespaced shell we have lost our root identity.

In this article we will work through a fix for this regression, and learn a little bit more about the User namespace along the way.

🗺 UID and GID mapping

The reason behind our loss of identity is that we’re missing some important configuration. It is not enough to simply add the CLONE_NEWUSER flag and expect the User namespace to be ready for use. In order to setup the namespace properly, we also need to provide what is know as a UID and a GID mapping.

💁 If you’re not interested in the theory and are eager to crack on with the Go coding, feel free to skip the rest of this section

ID mapping and how it relates to User namespaces is a huge topic in itself, and it falls mostly out of scope for this article. Having said that, there are a few things you need to know in order to understand how we’re going to fix our identity crisis. Here are the TL;DR essentials.

The User namespace provides isolation of UIDs and GIDs
There can be multiple, distinct User namespaces in use on the same host at any given time
Every Linux process runs in one of these User namespaces
User namespaces allow for the UID of a process in User namespace 1 to be different to the UID for the same process in User namespace 2
UID/GID mapping provides a mechanism for mapping IDs between two separate User namespaces

The following diagram attempts to visualise the above.

Pictured are two User namespaces, 1 and 2, with their corresponding UID and GID tables. Note that process C, running as non-root-user is able to spawn Process D, which is running as root.

The key implementation detail, and the thing that prevents the universe from imploding is the mapping between the two User namespaces (represented here by the dashed lines).

Process D only has root privileges within the context of User namespace 2. From the perspective of processes in User namespace 1, process D is running as non-root-user, and as such, doesn’t have those all-important root privileges.

This mapping is exactly what’s missing from ns-process at the moment, and it’s about time we sorted that out.

👉 Let’s Go

ID mappings can be applied by setting the UidMappings and GidMappings fields on cmd.SysProcAttr. Both fields are of type SysProcIDMap found in Go’s syscall package.

type SysProcIDMap struct {
        ContainerID int // Container ID.
        HostID      int // Host ID.
        Size        int // Size.
}

The ContainerID and HostID fields should be fairly self-explanatory. Size is slightly less so. Size basically determines the range of IDs to map, which allows us to map more than one ID at a time. Let’s update our program to include some mappings.

# Git repo: https://github.com/teddyking/ns-process
# Git tag: 2.0
# Filename: ns_process.go# ...
cmd.SysProcAttr = &syscall.SysProcAttr{
		Cloneflags: syscall.CLONE_NEWNS |
			syscall.CLONE_NEWUTS |
			syscall.CLONE_NEWIPC |
			syscall.CLONE_NEWPID |
			syscall.CLONE_NEWNET |
			syscall.CLONE_NEWUSER,
		UidMappings: []syscall.SysProcIDMap{
			{
				ContainerID: 0,
				HostID:      os.Getuid(),
				Size:        1,
			},
		},
		GidMappings: []syscall.SysProcIDMap{
			{
				ContainerID: 0,
				HostID:      os.Getgid(),
				Size:        1,
			},
		},
	}
# ...

Here we are adding a single UID and GID mapping. We set ContainerID to 0, HostID to the current user’s UID/GID and Size equal to 1. In other words, we are mapping ID = 0 (aka root) in our new User namespace to the ID of the user who invokes the ns-process command.

With all this in place, we should be able to build and run ns-process and see that we now become the root user inside the namespaced shell.

💁 The following has been tested on Ubuntu 16.04 Xenial with Go 1.7.1

$ go build
$ ./ns-process
-[ns-process]- # whoami
root
-[ns-process]- # id
uid=0(root) gid=0(root) groups=0(root)

And there we have it! With the addition of a simple UidMapping/GidMapping we have been able to restore our root identity inside the namespaced shell, while retaining the ability to run ns-process as a non-root user.

📺 On the next…

In the next article we’ll take a look at reexec. What is reexec and why is it relevant to Namespaces in Go? The answer to this and plenty more coming up, stay tuned…

Update: Part 4, “Namespaces in Go - Reexec” has been published and is available here.

Namespaces in Go - reexec

Ed King

Dec 14, 2016·4 min read

In the previous article we learnt how to apply a UID/GID mapping to ns-process such that we are now running as the root user once inside the namespaced shell.

The purpose of this article is to provide an understanding of the reexec package. reexec is part of the Docker codebase and provides a convenient way for an executable to “re-exec” itself. In all honesty reexec is a bit of a hack, but it’s a really useful one that is required to circumvent a limitation in how Go handles process forking. Before going into too much more detail, let’s take a look at the problem reexec helps to solve.

It’s probably best to demonstrate the problem by way of an example. Consider the following - we want to update ns-process such that a randomly-generated hostname is set inside the new UTS namespace we’ve cloned. For security reasons, it’s essential that the hostname has been set before the namespaced /bin/sh process starts running. After all, we don’t want programs running inside ns-process to be able to discover the Host’s hostname.

As far as I’m aware, Go doesn’t provide a built-in way to allow us to do this. Namespaces are created by setting attributes on an *exec.Cmd, which is also where we specify the process we'd like to run. For example:

cmd := exec.Command("/bin/echo", "Process already running")
cmd.SysProcAttr = &syscall.SysProcAttr{
 Cloneflags: syscall.CLONE_NEWUTS,
}
cmd.Run()

Once cmd.Run() is called, the namespaces get cloned and then the process gets started straight away. There’s no hook or anything here that allows us to run code after the namespace creation but before the process starts. This is where reexec comes in.

🎤 reexec yourself before you wreck yourself

Let’s open up the reexec package and take a look at what’s inside (I won’t paste full code snippets here for sake of simplicity, but I advise you read along with the full implementations of the methods).

// Register adds an initialization func under the specified name
func Register(name string, initializer func()) {
 # ...
}

First up we have Register, which exposes a way for us to register arbitrary functions by some name and to store them in memory. We will use this to register some sort of “Initialise Namespace” function when ns-process first starts up.

// Init is called as the first part of the exec process
// and returns true if an initialization function was called.
func Init() bool {
 # ...
}

Next up we have Init, which gives us a mechanism for determining whether or not the process is running after having been reexeced, and for running one of the registered functions if we have. It does this by checking os.Args[0] for the name of one of the previously-registered functions.

// Command returns *exec.Cmd which have Path as current binary.
// ...
func Command(args ...string) *exec.Cmd {
 return &exec.Cmd{
  Path: Self(),
  Args: args,
  SysProcAttr: &syscall.SysProcAttr{
   Pdeathsig: syscall.SIGTERM,
  },
 }
}

Command ties it all together by creating an *exec.Cmd with Path set to Self(), which evaluates to /proc/self/exe on Linux machines. We can choose which of the registered functions we’d like to invoke upon reexec by providing the registered name of the function in args[0].

💁 /proc/self/exe is a symlink file that points to the path of the currently-running executable

Now that we have an understanding of how reexec works, it’s time to wire it up inside ns-process.

👉 Let’s Go

The first thing we need to do is to create a function and register it using reexec.

# Git repo: https://github.com/teddyking/ns-process
# Git tag: 3.0
# Filename: ns_process.go# ...
func init() {
	reexec.Register("nsInitialisation", nsInitialisation)
	if reexec.Init() {
		os.Exit(0)
	}
}
# ...

There are two important things happening here. First, we register a function nsInitialisation under the name “nsInitialisation”. We'll add that function in a moment. Secondly, we call reexec.Init() and os.Exit(0) the program if it returns true. This is vitally important to prevent an infinite loop situation whereby the program gets stuck reexecing itself forever! Let’s add nsInitialisation next.

# Git repo: https://github.com/teddyking/ns-process
# Git tag: 3.0
# Filename: ns_process.go# ...
func nsInitialisation() {
	fmt.Printf("\n>> namespace setup code goes here <<\n\n")
	nsRun()
}func nsRun() {
	cmd := exec.Command("/bin/sh")

	cmd.Stdin = os.Stdin
	cmd.Stdout = os.Stdout
	cmd.Stderr = os.Stderr

	cmd.Env = []string{"PS1=-[ns-process]- # "}

	if err := cmd.Run(); err != nil {
		fmt.Printf("Error running the /bin/sh command - %s\n", err)
		os.Exit(1)
	}
}

Here we’ve added nsInitialisation() simply as a placeholder function. It will become much more important in future articles when we actually need to start configuring the namespaces. For now, it simply passes through to nsRun(), which runs the /bin/sh process.

All that’s left to do now is modify main() such that it runs the /bin/sh process via reexec and nsInitialisation rather than calling it directly.

# Git repo: https://github.com/teddyking/ns-process
# Git tag: 3.0
# Filename: ns_process.gofunc main() {
    cmd := reexec.Command("nsInitialisation")
    # ...
}

By specifying nsInitialisation as the first arg to Command, we're essentially telling reexec to run /proc/self/exe with os.Args[0] set to nsInitialisation. Finally, once the program has been reexeced, Init will detected the registered function and then actually Run it. Let’s give it a whirl.

💁 The following has been tested on Ubuntu 16.04 Xenial with Go 1.7.1

$ go build
$ ./ns-process

>> namespace setup code goes here <<-[ns-process]- #

And there we have it. We now have nsInitialisation available in which to run any namespace setup we need, including the ability, as discussed earlier, to set the hostname in the new UTS namespace if we so desire.

📺 On the next…

We’re now in a position to configure our namespaces, but what configuration remains to be done? The answer to this and plenty more coming up, stay tuned…

Update: Part 5, “Namespaces in Go - Mount” has been published and is available here.

Namespaces in Go - Mount

Ed King

Dec 17, 2016·7 min read

One of the fundamental features of container implementations today is the ability to run containers of differing linux distros on the same host machine. It’s not uncommon, for example, to install Docker on an Ubuntu host and to then start a bunch of containers on that host using BusyBox, CentOS, or any other distro you like the look of.

In this article we will will take a look at what makes this possible - namely a combination of the Mount namespace and the pivot_root system call. Let's start by reviewing the Mount namespace implementation in ns-process as it currently stands. If you’ve not been following along with this series so far, be sure to check out the previous article(s) first.

💁 The following has been tested on Ubuntu 16.04 Xenial with Go 1.7.1

# Git repo: https://github.com/teddyking/ns-process
# Git tag: 3.0$ go build
$ ./ns-process>> namespace setup code goes here <<-[ns-process]- # cat /proc/mounts
/dev/sda1 / ext4 rw,relatime,data=ordered 0 0
tmpfs /dev/shm tmpfs rw,nosuid,nodev 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
# ...

There are a number of mounts already listed in the /proc/mounts file. This may be little surprising given that we’re requesting a new Mount namespace (via the CLONE_NEWNS flag) and have yet to do any explicit Mount namespace setup.

This doesn’t feel very container-like. Our namespaced process should know as little as possible about the host it’s running on, and certainly shouldn’t be able to see a list of all the host’s mounts. So why’s this happening? Fortunately, an explanation can be found within the mount_namespaces(7) man page.

“When a process creates a new mount namespace using clone(2) or unshare(2) with the CLONE_NEWNS flag, the mount point list for the new namespace is a copy of the caller’s mount point list.”

It seems that this is actually intended behaviour, and it explains why /proc/mounts is already populated as soon as our namespaced process starts. With this in mind the question now becomes, “What do we do about it?”. We need some way of clearing the host’s mounts from the new Mount namespace in order to keep them secure and away from prying eyes - we need to pivot_root.

🔄 pivot_root

pivot_root allows you to set a new root filesystem for the calling process. I.e. it allows you to change what / is. It does this by mounting the current root filesystem somewhere else while simultaneously mounting some new root filesystem on /. Once the previous root has been moved, it is then possible to umount it. Thus we have a mechanism for 'clearing' the hosts's mounts from inside a new Mount namespace - we simply pivot away and then umount them!

This is what allows the aforementioned Ubuntu host machine to run a CentOS container. As long as the Ubuntu host has a copy of a CentOS filesystem on disk, we can create a new Mount namespace, call pivot_root pointing to the CentOS filesystem and then run whatever processes we want to inside the 'pivoted' namespace. The processes will believe they’re running on CentOS the entire time.

Incidentally this is where the reexec from the previous article comes in handy. pivot_root must be called from within the new Mount namespace, otherwise we'll end up changing the host's / which is not the intention! And we want all this to happen before the namespaced shell starts so that the requested root filesystem is ready for when it does.

👉 Let’s Go

In Go, pivot_root is implemented via the PivotRoot func found in the syscall package.

func PivotRoot(newroot string, putold string) (err error)

newroot is the path to the desired new root filesystem and putold is a path to a directory in which to move the current root. There are a few restrictions imposed on newroot and putold by the underlying pivot_root sys call that we need to be aware of:

They must both be directories
They must not be on the same filesystem as the current root
putold must be underneath newroot
No other filesystem may be mounted on putold

Most of these are fine but the second point there will require a small workaround, as we’ll see in a moment. We’re also going to need a suitable newroot in which to pivot to.

The process of preparing a newroot filesystem can be quite a detailed and complex one. Take for example Docker’s layered filesystem approach in which many filesystem “layers” are joined together to present a single coherent root. We’re going to do something much simpler, which is to to assume that a suitable root filesystem has already been prepared for use.

# Git repo: https://github.com/teddyking/ns-process
# Git tag: 4.0$ mkdir -p /tmp/ns-process/rootfs
$ tar -C /tmp/ns-process/rootfs -xf assets/busybox.tar

From now on, ns-process will expect a root filesystem to exist at this path and will raise an error if one can’t be found. Note that although we’re using BusyBox for this particular example, you could just as easily use any other distro.

Now that we have our newroot, let’s write some code to make use of it.

# Git repo: https://github.com/teddyking/ns-process
# Git tag: 4.0
# Filename: rootfs.gofunc pivotRoot(newroot string) error {
	putold := filepath.Join(newroot, "/.pivot_root")

	// bind mount newroot to itself - this is a slight hack
	// needed to work around a pivot_root requirement
	if err := syscall.Mount(
		newroot,
		newroot,
		"",
		syscall.MS_BIND|syscall.MS_REC,
		"",
	); err != nil {
		return err
	}

	// create putold directory
	if err := os.MkdirAll(putold, 0700); err != nil {
		return err
	}

	// call pivot_root
	if err := syscall.PivotRoot(newroot, putold); err != nil {
		return err
	}

	// ensure current working directory is set to new root
	if err := os.Chdir("/"); err != nil {
		return err
	}

	// umount putold, which now lives at /.pivot_root
	putold = "/.pivot_root"
	if err := syscall.Unmount(
		putold,
		syscall.MNT_DETACH,
	); err != nil {
		return err
	}

	// remove putold
	if err := os.RemoveAll(putold); err != nil {
		return err
	}

	return nil
}

With the pivotRoot func in place, it’s time to put nsInitialisation to good use.

# Git repo: https://github.com/teddyking/ns-process
# Git tag: 4.0
# Filename: ns_process.gofunc nsInitialisation() {
	newrootPath := os.Args[1]

	if err := pivotRoot(newrootPath); err != nil {
		fmt.Printf("Error running pivot_root - %s\n", err)
		os.Exit(1)
	}

	nsRun()
}

func main() {
	var rootfsPath string
	// ...	cmd := reexec.Command("nsInitialisation", rootfsPath)
}

Notice that we’re now passing an argument, rootfsPath, to nsInitialisation. Once reexeced, this argument can be picked up by reading from os.Args[1]. Also notice how the call to pivotRoot comes before nsRun. By doing this, we're ensuring that the new root filesystem will already have been pivoted to before the /bin/sh process starts.

With all that in place, let's run the updated Go program and check to see which mounts, if any, are available to us now.

💁 The following has been tested on Ubuntu 16.04 Xenial with Go 1.7.1

# Git repo: https://github.com/teddyking/ns-process
# Git tag: 4.0$ go build
$ ./ns-process
-[ns-process]- # cat /proc/mounts
cat: can't open '/proc/mounts': No such file or directory

Ah … now that we’ve pivoted to a new /, we no longer have a /proc! This is actually a good thing as it means we definitely can’t see the host’s mounts anymore, which is one of the main reasons for doing all this work in the first place. But, there’s probably only so far we can get without a working /proc, so let’s add one to our new root.

# Git repo: https://github.com/teddyking/ns-process
# Git tag: 4.1
# Filename: rootfs.gofunc mountProc(newroot string) error {
	source := "proc"
	target := filepath.Join(newroot, "/proc")
	fstype := "proc"
	flags := 0
	data := ""

	os.MkdirAll(target, 0755)
	if err := syscall.Mount(
		source,
		target,
		fstype,
		uintptr(flags),
		data,
	); err != nil {
		return err
	}

	return nil
}

And just as with pivotRoot, mountProc should be called from nsInitialisation.

# Git repo: https://github.com/teddyking/ns-process
# Git tag: 4.1
# Filename: ns_process.gofunc nsInitialisation() {
	newrootPath := os.Args[1]

	if err := mountProc(newrootPath); err != nil {
		fmt.Printf("Error mounting /proc - %s\n", err)
		os.Exit(1)
	}

	if err := pivotRoot(newrootPath); err != nil {
		fmt.Printf("Error running pivot_root - %s\n", err)
		os.Exit(1)
	}

	nsRun()
}

Ok, that should now be everything. Let’s try it out.

# Git repo: https://github.com/teddyking/ns-process
# Git tag: 4.1$ go build
$ ./ns-process
-[ns-process]- # cat /proc/mounts
/dev/sda1 / ext4 rw,relatime,data=ordered 0 0
proc /proc proc rw,nodev,relatime 0 0

That’s looking much better - the host’s mounts are no longer visible to us and we have a new /proc mounted and ready for action. But wait … there is one more thing …

🤔 PID namespace

The changes implemented above have had an unintentional side effect on the PID namespace setup. Prior to mounting the new /proc, running ps inside the namespaced shell would’ve resulted in all the host’s processes being listed. This is because ps relies on /proc to detect running processes and we were still referencing the host’s /proc.

This is obviously a pretty terrible thing to happen from a container perspective! But fortunately now that we have our own /proc(and are requesting a new PID namespace via the CLONE_NEWPID flag), running ps shows only processes that are relevant to us.

# Git repo: https://github.com/teddyking/ns-process
# Git tag: 4.1$ go build
$ ./ns-process
-[ns-process]- # ps
PID   USER     TIME   COMMAND
    1 root       0:00 {exe} nsInitialisation /tmp/ns-process/rootfs
    5 root       0:00 /bin/sh
    8 root       0:00 ps

📺 On the next…

We’re nearing the season finale of “Namespaces in Go”, but we’re still missing one key piece of configuration - networking. What needs to be done to allow our namespaced shell to talk to the Internets? The answer to this and plenty more coming up, stay tuned…

Update: Part 6, “Namespaces in Go - Network” has been published and is available here.

Namespaces in Go - Network

Ed King

Jan 9, 2017·8 min read

In the previous article we saw how to make use of PivotRoot and the Mount namespace to swap in a new root filesystem for ns-process. With that change in place, ns-process is starting to look and feel an awful lot like any other container. Sure, it only runs a single /bin/sh process at the moment, but it does have a number of extremely cool features:

Can be run as a non-root user thanks to the User namespace
Can choose a root filesystem to run in thanks to the Mount namespace
Cannot see any of the host’s processes thanks to the PID namespace

That’s pretty impressive! But there’s still a piece of vital functionality missing - networking. At the moment, ns-process doesn’t have any network connectivity!

💁 The following has been tested on Ubuntu 16.04 Xenial with Go 1.7.1

# Git repo: https://github.com/teddyking/ns-process
# Git tag: 4.1$ go build
$ ./ns-process
-[ns-process]- # ifconfig
-[ns-process]- # route
Kernel IP routing table
Destination     Gateway         Genmask         ...   Use Iface
-[ns-process]- # ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8): 56 data bytes
ping: sendto: Network is unreachable

That’s slightly less impressive… The reason for this lack of connectivity is due to the fact that ns-process clones a new Network namespace, the very purpose of which is to isolate all network-related resources (IPs, ports, interfaces, etc.).

In this article we will set about configuring the new Network namespace such that it ends up with an interface and a routable IP address.

🌐 A quick lesson in networking

If we are to have any hope of adding network connectivity to ns-process, a solid understanding of the Network namespace is going to be essential. To that end, I highly recommend you read through Introducing Linux Network Namespaces. The knowledge and ideas presented in that article will form the basis for the Network namespace configuration in ns-process. To briefly summarise, here’s what we’ll need to do:

Create a bridge device in the host’s Network namespace
Create a veth pair
Attach one side of the pair to the bridge
Place the other side of the pair in ns-process's Network namespace
Ensure all traffic originating in the namespaced process gets routed via the veth

The general idea is to establish a connection between ns-process's Network namespace and the host’s Network namespace. Visually this looks a little something like this:

This is actually a fair amount of work! And it’s made complicated by the fact that setup and configuration needs to occur in two different Network namespaces. There’s also a further complexity in that the network setup requires root privileges, which means we could end up regressing on one of ns-process's most lovely features - that it can be run as a non-root user.

Fortunately this can be avoided by making use of setuid. setuid allows a process to run as the user that owns an executable. The idea then is to extract the network setup code into a separate executable, ensure the executable is owned by the root user and to apply the setuid permission on it. We can then call out to the executable from within ns-process (running as a non-root user) as and when we need to. With all this in mind, allow me to introduce netsetgo.

🚦 On your marks, net set, GO!

netsetgo is a small binary that helps to setup Network namespaces for containers. It achieves this by applying the configuration outlined above. For sake of brevity I’m not going to paste the full netsetgo code here, but I will briefly point out the most useful parts so you can take a more detailed look for yourself.

Bridge creation occurs here via a call to netlink.LinkAdd
Veth creation occurs here via another call to netlink.LinkAdd
The veth is attached to the bridge here via a call to netlink.LinkSetMaster
The veth is moved to the new Network namespace here via a call to netlink.LinkSetNsPid
A default route is added to the new Network namespace here via a call to netlink.RouteAdd

In order to make use of netsetgo fromns-process, we’ll need to download the binary and set the correct permissions on it, as follows.

# Git repo: https://github.com/teddyking/ns-process
# Git tag: 4.1$ wget "https://github.com/teddyking/netsetgo/releases/download/0.0.1/netsetgo"
$ sudo mv netsetgo /usr/local/bin/
$ sudo chown root:root /usr/local/bin/netsetgo
$ sudo chmod 4755 /usr/local/bin/netsetgo

The 4 in the chmod 4755 signifies that the setuid bit should be set.

👉 Let’s Go

Now that netsetgo is primed and ready it’s time to turn our attention back to ns-process. We need to modify ns-process so that it calls out to netsetgo to configure the network. At first glance this would appear to be relatively simple - we can just create a *exec.Cmd pointing to netsetgo and run it at the appropriate moment?

Of course, nothing’s ever quite as easy as it seems, and here the question of when to run netsetgo requires a bit more thought. Let’s start by looking at how we kick off Namespace creation at the moment (output trimmed for simplicity).

# Git repo: https://github.com/teddyking/ns-process
# Git tag: 4.1
# Filename: ns_process.gofunc main() {
	cmd := reexec.Command("nsInitialisation", rootfsPath)

	cmd.SysProcAttr = &syscall.SysProcAttr{
		Cloneflags: syscall.CLONE_NEWNS |
			syscall.CLONE_NEWUTS |
			syscall.CLONE_NEWIPC |
			syscall.CLONE_NEWPID |
			syscall.CLONE_NEWNET |
			syscall.CLONE_NEWUSER,
	}

	if err := cmd.Run(); err != nil {
		fmt.Printf("Error running Command - %s\n", err)
		os.Exit(1)
	}
}

Here we’re using cmd.Run() to run a reexec command with a number of CLONE_NEW* flags set. Note that cmd.Run() does not return until the underlying process has exited. Up until now this has been fine because all subsequent namespace configuration has taken place inside the newly-cloned namespaces (via the nsInitialisation func to be specific).

However, netsetgo needs to configure the host’s Network namespace as well as the new one, which means we can no longer rely on the blocking call to cmd.Run().

Fortunately cmd.Run() can be split into two separate calls - cmd.Start() (which returns immediately) and cmd.Wait() (which blocks until the started command exits). This is exactly what we need as it allows us to run netsetgo after the new namespaces have been created but while still executing in the host’s namespaces. Let’s see this in action.

# Git repo: https://github.com/teddyking/ns-process
# Git tag: 5.0
# Filename: ns_process.goif err := cmd.Start(); err != nil {
	fmt.Printf("Error starting the reexec.Command - %s\n", err)
	os.Exit(1)
}

pid := fmt.Sprintf("%d", cmd.Process.Pid)
netsetgoCmd := exec.Command(netsetgoPath, "-pid", pid)
if err := netsetgoCmd.Run(); err != nil {
	fmt.Printf("Error running netsetgo - %s\n", err)
	os.Exit(1)
}

if err := cmd.Wait(); err != nil {
	fmt.Printf("Error waiting for reexec.Command - %s\n", err)
	os.Exit(1)
}

Great! This change allows netsetgo to configure the networking across both Network namespaces as required. All that’s left to do now is to ensure that the namespaced /bin/sh process doesn’t start until the network is ready.

Let’s consider the network to be ready once a veth interface has appeared in the new Network namespace. We can use a simple for loop to wait until this is true, as follows.

# Git repo: https://github.com/teddyking/ns-process
# Git tag: 5.0
# Filename: net.gofunc waitForNetwork() error {
	maxWait := time.Second * 3
	checkInterval := time.Second
	timeStarted := time.Now()

	for {
		interfaces, err := net.Interfaces()
		if err != nil {
			return err
		}

		// pretty basic check ...
		// > 1 as a lo device will already exist
		if len(interfaces) > 1 {
			return nil
		}

		if time.Since(timeStarted) > maxWait {
			return fmt.Errorf("Timeout after %s waiting for network", maxWait)
		}

		time.Sleep(checkInterval)
	}
}

Here we have a very basic for loop which blocks until either more than one network interface is reported or a timeout of 3 seconds is reached. As the comment mentions, we check for more than one interface as the loopback interface will already exist by default.

Finally, let’s update nsInitialisation to call the above function.

# Git repo: https://github.com/teddyking/ns-process
# Git tag: 5.0
# Filename: ns_process.gofunc nsInitialisation() {
	newrootPath := os.Args[1]

	if err := mountProc(newrootPath); err != nil {
		fmt.Printf("Error mounting /proc - %s\n", err)
		os.Exit(1)
	}

	if err := pivotRoot(newrootPath); err != nil {
		fmt.Printf("Error running pivot_root - %s\n", err)
		os.Exit(1)
	}

	if err := waitForNetwork(); err != nil {
		fmt.Printf("Error waiting for network - %s\n", err)
		os.Exit(1)
	}

	nsRun()
}

With all that in place, let’s run the updated Go program.

💁 The following has been tested on Ubuntu 16.04 Xenial with Go 1.7.1

# Git repo: https://github.com/teddyking/ns-process
# Git tag: 5.0$ go build
$ ./ns-process
-[ns-process]- # ifconfig
veth1     Link encap:Ethernet  HWaddr 6A:DD:B4:30:1A:49
          inet addr:10.10.10.2  Bcast:0.0.0.0  Mask:255.255.255.0
          inet6 addr: fe80::68dd:b4ff:fe30:1a49/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:18 errors:0 dropped:0 overruns:0 frame:0
          TX packets:7 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:2359 (2.3 KiB)  TX bytes:578 (578.0 B)-[ns-process]- # route
Kernel IP routing table
Destination     Gateway         Genmask         ... Iface
default         10.10.10.1      0.0.0.0         ... veth1
10.10.10.0      *               255.255.255.0   ... veth1
-[ns-process]- # ping 10.10.10.1
PING 10.10.10.1 (10.10.10.1): 56 data bytes
64 bytes from 10.10.10.1: seq=0 ttl=64 time=0.098 ms
^C
--- 10.10.10.1 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 0.098/0.098/0.098 ms

Much better! We now have a network interface veth1 available and a routable IP address of 10.10.10.2.

☁️ Internet connectivity

Enabling Internet access for ns-process is a little out of scope for this particular article. This is mostly because a lack of Internet connectivity could be the result of any number of things, and attempting to cover all environmental setups would be pretty difficult.

Having said that, the following steps do enable Internet connectivity for ns-process on my generic Ubuntu 16.04 Xenial machine. There’s no guarantee this will work for you, but feel free to try it out if you’re interested.

First up we need to configure a few iptables rules on the host.

# Git repo: https://github.com/teddyking/ns-process
# Git tag: 5.0$ sudo iptables -tnat -N netsetgo
$ sudo iptables -tnat -A PREROUTING -m addrtype --dst-type LOCAL -j netsetgo
$ sudo iptables -tnat -A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j netsetgo
$ sudo iptables -tnat -A POSTROUTING -s 10.10.10.0/24 ! -o brg0 -j MASQUERADE
$ sudo iptables -tnat -A netsetgo -i brg0 -j RETURN

And then we also need to add a DNS nameserver for the namespaced process.

# Git repo: https://github.com/teddyking/ns-process
# Git tag: 5.0$ go build
$ ./ns-process
-[ns-process]- # echo "nameserver 8.8.8.8" >> /etc/resolv.conf
-[ns-process]- # ping google.com
PING google.com (172.217.23.14): 56 data bytes
64 bytes from 172.217.23.14: seq=0 ttl=51 time=4.766 ms

And there we have it - ns-process running with full Internet connectivity.

📺 On the next…

With network configuration complete, ns-process is now setup to configure the User, Mount, Pid and Network namespaces, but what needs to be done about the remaining namespaces? The answer to this and plenty more coming up, stay tuned…

Update: Part 7, “Namespaces in Go - UTS” has been published and is available here.

Namespaces in Go - UTS

Ed King

Jan 13, 2017·2 min read

In the previous article we configured the Network namespace to provide ns-process with a routable IP address. Now that ns-process is able to join a network, it’d be a good idea to make sure it starts up with a unique hostname. In this article (the last in the series) we will configure the UTS namespace to make this so. Let’s start, as always, by reviewing the current behaviour.

💁 The following has been tested on Ubuntu 16.04 Xenial with Go 1.7.1

# Git repo: https://github.com/teddyking/ns-process
# Git tag: 5.0$ hostname
ubuntu-xenial
$ go build
$ ./ns-process
-[ns-process]- # hostname
ubuntu-xenial

The hostname reported inside the namespaced /bin/sh process is the same as the hostname reported on the host. Obviously this isn’t ideal and could lead to confusion further down the line.

Fortunately the fix for this is pretty simple (much easier than the network setup from before) so let’s jump straight in.

👉 Let’s Go

In Go, the hostname can be set via the SetHostname func from the syscall package.

# Git repo: https://github.com/teddyking/ns-process
# Git tag: 6.0
# Filename: ns_process.gofunc nsInitialisation() {
	newrootPath := os.Args[1]

	if err := mountProc(newrootPath); err != nil {
		fmt.Printf("Error mounting /proc - %s\n", err)
		os.Exit(1)
	}

	if err := pivotRoot(newrootPath); err != nil {
		fmt.Printf("Error running pivot_root - %s\n", err)
		os.Exit(1)
	}

	if err := syscall.Sethostname([]byte("ns-process")); err != nil {
		fmt.Printf("Error setting hostname - %s\n", err)
		os.Exit(1)
	}

	if err := waitForNetwork(); err != nil {
		fmt.Printf("Error waiting for network - %s\n", err)
		os.Exit(1)
	}

	nsRun()
}

The call to Sethostname occurs just before the wait for the network. As you can see, the hostname has been hardcoded to ns-process here. Most container implementations today set the hostname to the ID/name of the container, which is usually some random UUID by default.

And that’s really all there is to it! Let’s confirm our implementation works as expected.

💁 The following has been tested on Ubuntu 16.04 Xenial with Go 1.7.1

# Git repo: https://github.com/teddyking/ns-process
# Git tag: 6.0$ hostname
ubuntu-xenial
$ go build
$ ./ns-process
-[ns-process]- # hostname
ns-process

Perfect!

🎬 That’s a wrap

That’s all for this particular series of articles! Many congratulations on making it to the end. You should now be fully equipped to head out into container land to write your very own Docker. I hope you’ve had fun and have maybe learnt a little bit about Linux namespaces in Go in the process.

If you’ve got any feedback, questions or rants you’d like to send my way you can find me over on twitter as edking2 (damn you edking and edking1!).

📺 Epilogue

More astute readers may have noticed that in publishing the last article in this series I’ve totally ignored 2 of the 7 namespaces - IPC and Cgroup. This isn’t an oversight, rather that I’ve never actually had to configure these two myself. The IPC namespace seems to Just Work™ and the Cgroup namespace is so new that I just haven’t got round to playing with it yet. Besides, I need to save some material for season 2…

WRITTEN BY

Ed King

A Software Engineer currently working with Cloud Foundry and Kubernetes.

ref:

28. put a program in jail (一系列linux的文章)

Linux Namespaces and Go Started to Mix

Linux Namespaces and Go Don't Mix

Linux Namespaces and Go Started to Mix

Package net

posted @ 2020-12-06 20:27 lvmxh 阅读(169) 评论(0) 编辑收藏举报

刷新页面返回顶部

Linux Namespaces 转载（Ed King）

Linux Namespaces

Ed King

Dec 10, 2016·3 min read

👟 Kicking the tyres

👑 7 namespaces to rule them all

📺 On the next …

Ed King

A Software Engineer currently working with Cloud Foundry and Kubernetes.

Namespaces in Go - Basics

Ed King

Dec 11, 2016·5 min read

👉 Let’s Go

📝 The namespaces API

📺 On the next…

Namespaces in Go - User

Ed King

Dec 13, 2016·4 min read

🗺 UID and GID mapping

👉 Let’s Go

📺 On the next…

Namespaces in Go - reexec

Ed King

Dec 14, 2016·4 min read

🎤 reexec yourself before you wreck yourself

👉 Let’s Go

📺 On the next…

Namespaces in Go - Mount

Ed King

Dec 17, 2016·7 min read

🔄 pivot_root

👉 Let’s Go

🤔 PID namespace

📺 On the next…

Namespaces in Go - Network

Ed King

Jan 9, 2017·8 min read

🌐 A quick lesson in networking

🚦 On your marks, net set, GO!

👉 Let’s Go

☁️ Internet connectivity

📺 On the next…

Namespaces in Go - UTS

Ed King

Jan 13, 2017·2 min read

👉 Let’s Go

🎬 That’s a wrap

📺 Epilogue

Ed King

A Software Engineer currently working with Cloud Foundry and Kubernetes.

ref:

公告