Linux Namespaces 转载(Ed King)
Building Containers from Scratch in Go (github source code)
netns - network namespaces in go (git hub source code)
how to build a container from scratch
- Linux containers in 500 lines of code by Lizzie Dixon
- Building Containers from Scratch with Go by Liz Rice
- Building Containers from Scratch in Go by mugli
- Part 1: Linux Namespaces https://medium.com/@teddyking/linux-namespaces-850489d3ccf
- Part 2: Namespaces in Go - Basics https://medium.com/@teddyking/namespaces-in-go-basics-e3f0fc1ff69a
- Part 3: Namespaces in Go - User https://medium.com/@teddyking/namespaces-in-go-user-a54ef9476f2a
- Part 4: Namespaces in Go - reexec https://medium.com/@teddyking/namespaces-in-go-reexec-3d1295b91af8
- Part 5: Namespaces in Go - Mount https://medium.com/@teddyking/namespaces-in-go-mount-e4c04fe9fb29
- Part 6: Namespaces in Go - Network https://medium.com/@teddyking/namespaces-in-go-network-fdcf63e76100
- Part 7: Namespaces in Go - UTS https://medium.com/@teddyking/namespaces-in-go-uts-d47aebcdf00e
- Build Your Own Container Using Less than 100 Lines of Go by Julian Friedman
- Creating Your Own Containers
- Building Containers in Pure Bash and C
- HN: https://news.ycombinator.com/item?id=16734440
原文已不能打开,采用了Google cache
https://medium.com/@teddyking/linux-namespaces-850489d3ccf
Linux namespaces comprise some of the fundamental technologies behind most modern-day container implementations. At a high level, they allow for isolation of global system resources between independent processes. For example, the PID namespace isolates the process ID number space. This means that two processes running on the same host can have the same PID!
This level of isolation is clearly useful in the world of containers. Without namespaces, a process running in container A could, for example, umount an important filesystem in container B, or change the hostname of container C, or remove a network interface from container D. By namespacing these resources, the process in container A isn’t even aware that the processes in containers B, C and D exist.
It follows that you can’t interfere with something if it’s not visible to you. And that’s really what namespaces provide - a way to limit what a process can see, to make it appear as though it’s the only process running on a host.
Note that namespaces do not restrict access to physical resources such as CPU, memory and disk. That access is metered and restricted by a kernel feature called ‘cgroups’.
👟 Kicking the tyres
The following has been tested on an Ubuntu 16.04 Xenial machine
Let’s jump straight in with a practical example of namespaces in action.
$ unshare -h
Usage:
unshare [options] <program> [<argument>...]
Run a program with some namespaces unshared from the parent.
Options:
-m, --mount[=<file>] unshare mounts namespace
-u, --uts[=<file>] unshare UTS namespace (hostname etc)
...
The unshare
command allows you to run a program with some namespaces ‘unshared’ from its parent. Essentially what this means is that unshare
will run whatever program you pass it in a new set of namespaces.
Let’s run through an example using the UTS namespace. The UTS namespace provides isolation of the hostname and domainname system identifiers. This isolation can be tested by running hostname my-new-hostname
inside a UTS namespaced /bin/sh
process, and confirming that the hostname change is not reflected outside that process.
$ sudo su # become root user
$ hostname # check current hostname
dev-ubuntu
$ unshare -u /bin/sh # create a shell in new UTS namespace
$ hostname my-new-hostname # set hostname
$ hostname # confirm new hostname
my-new-hostname
$ exit # exit new UTS namespace
$ hostname # confirm original hostname unchanged
dev-ubuntu
Breaking this down, we start by running sudo su
to become the root user. Root privileges are required to create most namespaces (the exception being the user namespace - more on that in a later article). Then we run hostname
to confirm our current hostname ('dev-ubuntu' in my case).
Now for the exciting part! The unshare -u /bin/sh
command drops us into a shell that's running in a new, separate UTS namespace. We then run hostname my-new-hostname
to set the hostname inside the new UTS namespace only. The change can be confirmed by running hostname
again.
Lastly we exit
the namespaced shell and run hostname
one last time. We can see that the value for the hostname matches the original value, despite having run hostname my-new-hostname
in between. This is because that change only took effect inside the new UTS namespace.
👑 7 namespaces to rule them all
The above example demonstrates the UTS namespace, but the fun doesn’t end there. At the time of writing there are 7 namespaces available:
- Mount - isolate filesystem mount points
- UTS - isolate hostname and domainname
- IPC - isolate interprocess communication (IPC) resources
- PID - isolate the PID number space
- Network - isolate network interfaces
- User - isolate UID/GID number spaces
- Cgroup - isolate cgroup root directory
Most container implementations make use of the above namespaces in order to provide the highest level of isolation between separate container processes. Although note that the cgroup namespace is slightly more recent than the others and isn’t as widely used.
📺 On the next …
The unshare
command is great, but what happens when we want more fine-grained control over the namespaces in our programs? The answer to this and plenty more coming up, stay tuned…
Update: Part 2, “Namespaces in Go - Basics” has been published and is available here.
WRITTEN BY
Ed King
A Software Engineer currently working with Cloud Foundry and Kubernetes.
In the previous article we dipped our toes in the namespace waters with the unshare
command. unshare
is great for simple scripting around namespaces but it's not so well suited for when we need more fine-grained and precise control, as is the case with containers. For this use case it's much better to have the support of a fully fledged programming language.
Go has emerged as the container implementation language of choice. This is due in part to the fact that Docker was, and still is, written in Go. Docker is one of the most successful open source Go projects to date (37,680 GitHub ⭐️s at time of writing) and it showed the world that Go was a language to be taken seriously.
The Docker developers have previously outlined the reasons they chose to write Docker in Go. Some of the top reasons include static compilation, good asynchronous primitives, low-level interfaces, a full development environment and strong cross compilation support.
For me personally the real beauty of Go is in its apparent simplicity. Containers are hard! And by using a ‘simple’ language it makes it much easier to reason about what exactly is going on under the hood. There is a great talk by Rob Pike, “Simplicity is Complicated”, in which he discusses how simplicity is part of Go’s design. It’s definitely worth a watch if you’re interested.
👉 Let’s Go
The aim for this series of articles is to provide an understanding of how to work with Linux namespaces inside Go programs. To achieve this, we will be building out a sample application named ns-process
.
ns-process
will be fairly simple to begin with - it will create a /bin/sh
process in a new set of namespaces. Over the course of the next few articles it will evolve in to something much more exciting - a program capable of creating unprivileged containers! Don’t worry if you’re not sure what “unprivileged” means in this context, all will be explaining along the way.
The code for ns-process
is available on GitHub and I highly recommend cloning the repo so you can follow along at home.
# Git repo: https://github.com/teddyking/ns-process
# Git tag: 1.0
# Filename: ns_process.gopackage main
import (
"fmt"
"os"
"os/exec"
"syscall"
)
func main() {
cmd := exec.Command("/bin/sh")
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
cmd.Env = []string{"PS1=-[ns-process]- # "}
cmd.SysProcAttr = &syscall.SysProcAttr{
Cloneflags: syscall.CLONE_NEWUTS,
}
if err := cmd.Run(); err != nil {
fmt.Printf("Error running the /bin/sh command - %s\n", err)
os.Exit(1)
}
}
As you can see, there’s nothing particularly complicated here. We’re simply creating a *exec.Cmd
, piping through stdin/out/err from the calling process and setting the PS1
environment variable on the new process (this just makes it easier to identify the namespaced shell when executing the program).
The interesting part is cmd.SysProcAttr
, but before understanding SysProcAttr
we need to take a deeper look at the underlying system calls that make up the namespaces API.
📝 The namespaces API
The namespaces(7) man page tells us there are 3 system calls that make up the API:
- clone(2) - creates a new process
- setns(2) - allows the calling process to join an existing namespace
- unshare(2) - moves the calling process to a new namespace
unshare()
may look familiar from the previous article. This is the system call that gets invoked when running the unshare
command. The call we're interested in this time is clone()
, as clone()
gets called as part of Go’s exec.Run()
.
When calling clone()
it's possible to pass one or more CLONE_*
flags. Each namespace has a corresponding CLONE flag - CLONE_NEWNS
, CLONE_NEWUTS
, CLONE_NEWIPC
, CLONE_NEWPID
, CLONE_NEWNET
, CLONE_NEWUSER
and CLONE_NEWCGROUP
. The execution context of the cloned process is, in part, defined by the flags passed in.
Back up to Go land and SysProcAttr
, SysProcAttr
allows us to set attributes on the *exec.Cmd
. By specifying the Cloneflags
attribute, we're telling Go to pass the corresponding CLONE_*
flags through to system calls to clone()
. And thus we can control which namespaces we'd like our process to be executed in.
Compile and run the program and you will be dropped into a /bin/sh
process that's running in a new UTS namespace. Note that the program must be run as the root user.
💁 The following has been tested on Ubuntu 16.04 Xenial with Go 1.7.1
$ go build
$ sudo ./ns-process
-[ns-process]- #
Great! We’ve been dropped into a new shell that’s supposedly running in a new UTS namespace. Let’s confirm that this is the case.
-[ns-process]- # readlink /proc/self/ns/uts
uts:[4026532410]
-[ns-process]- # exit
$ readlink /proc/self/ns/uts
uts:[4026531838]
The contents of /proc/self/ns/uts
include the namespace type (uts) and the inode number of the namespace. The fact that the inode number is different inside the ns-process
shell compared to outside it implies that these two processes are indeed running in different UTS namespaces.
Not bad at all! But, we can do better. At the moment we’re only requesting a single new namespace for the process. Let’s throw in a few more to spice things up a little. This can be achieved by adding additional flags to Cloneflags
, as follows.
# Git repo: https://github.com/teddyking/ns-process
# Git tag: 1.1
# Filename: ns_process.go...
cmd.SysProcAttr = &syscall.SysProcAttr{
Cloneflags: syscall.CLONE_NEWNS |
syscall.CLONE_NEWUTS |
syscall.CLONE_NEWIPC |
syscall.CLONE_NEWPID |
syscall.CLONE_NEWNET |
syscall.CLONE_NEWUSER,
}
...
Compile and run the program again, and this time you’ll be dropped into a /bin/sh
process that's running in a new Mount, UTS, IPC, PID, Network and User namespace.
💡 When requesting a new User namespace alongside other namespaces, the User namespace will be created first. User namespaces can be created without root permissions, which means we can now drop the sudo
and run our program as a non-root user! I’ll go into more detail about the user namespace in a later article.
This is all well and good, and at a basic level does allow us to run processes in new namespaces from Go. However, IRL it’s not really all that useful … We’re missing a lot of setup required to fully initialise and configure the namespaces. For example:
- We’ve requested a new Mount namespace (
CLONE_NEWNS
) but are currently piggybacking off the host's mounts and rootfs - We’ve requested a new PID namespace (
CLONE_NEWPID
) but haven't mounted a new/proc
filesystem - We’ve requested a new Network namespace (
CLONE_NEWNET
) but haven't setup any interfaces inside the namespace - We’ve requested a new User namespace (
CLONE_NEWUSER
) but have failed to provide a UID/GID mapping
And so it appears that we’ve still got plenty of work cut out for us.
📺 On the next…
We’ve seen how to run a process in a new set of namespaces using Go, but how do we configure and initialise the namespaces so they are ready for use? The answer to this and plenty more coming up, stay tuned…
Update: Part 3, “Namespaces in Go - User” has been published and is available here.
In the previous article we saw how to create and run a process in various Linux namespaces using Go. We left with some code that runs a /bin/sh
process in a new Mount, UTS, IPC, PID, Network and User namespace.
You may recall that once we added the User namespace to ns-process
we no longer had to run it as the root user. This is a great feature to have as it means ns-process
can be run much more securely. However, in adding the User namespace to the program, we have inadvertently introduced some less desirable behaviour.
This behaviour can be demonstrated by comparing the output of whoami
from within the namespaced shell both before and after we added the User namespace, as follows.
# Git repo: https://github.com/teddyking/ns-process
# Git tag: 1.0
# Prior to adding the User namespace$ go build
$ sudo ./ns-process
-[ns-process]- # whoami
root
-[ns-process]- # id root
uid=0(root) gid=0(root) groups=0(root)# Git tag: 1.1
# After adding the User namespace$ go build
$ ./ns-process
-[ns-process]- # whoami
nobody
-[ns-process]- # id nobody
uid=65534(nobody) gid=65534(nogroup) groups=65534(nogroup)
Although we are now able to run ns-process
as a non-root user, once inside the namespaced shell we have lost our root identity.
In this article we will work through a fix for this regression, and learn a little bit more about the User namespace along the way.
🗺 UID and GID mapping
The reason behind our loss of identity is that we’re missing some important configuration. It is not enough to simply add the CLONE_NEWUSER
flag and expect the User namespace to be ready for use. In order to setup the namespace properly, we also need to provide what is know as a UID and a GID mapping.
💁 If you’re not interested in the theory and are eager to crack on with the Go coding, feel free to skip the rest of this section
ID mapping and how it relates to User namespaces is a huge topic in itself, and it falls mostly out of scope for this article. Having said that, there are a few things you need to know in order to understand how we’re going to fix our identity crisis. Here are the TL;DR essentials.
- The User namespace provides isolation of UIDs and GIDs
- There can be multiple, distinct User namespaces in use on the same host at any given time
- Every Linux process runs in one of these User namespaces
- User namespaces allow for the UID of a process in User namespace 1 to be different to the UID for the same process in User namespace 2
- UID/GID mapping provides a mechanism for mapping IDs between two separate User namespaces
The following diagram attempts to visualise the above.
Pictured are two User namespaces, 1 and 2, with their corresponding UID and GID tables. Note that process C, running as non-root-user
is able to spawn Process D, which is running as root
.
The key implementation detail, and the thing that prevents the universe from imploding is the mapping between the two User namespaces (represented here by the dashed lines).
Process D only has root privileges within the context of User namespace 2. From the perspective of processes in User namespace 1, process D is running as non-root-user
, and as such, doesn’t have those all-important root privileges.
This mapping is exactly what’s missing from ns-process
at the moment, and it’s about time we sorted that out.
👉 Let’s Go
ID mappings can be applied by setting the UidMappings
and GidMappings
fields on cmd.SysProcAttr
. Both fields are of type SysProcIDMap
found in Go’s syscall
package.
type SysProcIDMap struct {
ContainerID int // Container ID.
HostID int // Host ID.
Size int // Size.
}
The ContainerID
and HostID
fields should be fairly self-explanatory. Size
is slightly less so. Size
basically determines the range of IDs to map, which allows us to map more than one ID at a time. Let’s update our program to include some mappings.
# Git repo: https://github.com/teddyking/ns-process
# Git tag: 2.0
# Filename: ns_process.go# ...
cmd.SysProcAttr = &syscall.SysProcAttr{
Cloneflags: syscall.CLONE_NEWNS |
syscall.CLONE_NEWUTS |
syscall.CLONE_NEWIPC |
syscall.CLONE_NEWPID |
syscall.CLONE_NEWNET |
syscall.CLONE_NEWUSER,
UidMappings: []syscall.SysProcIDMap{
{
ContainerID: 0,
HostID: os.Getuid(),
Size: 1,
},
},
GidMappings: []syscall.SysProcIDMap{
{
ContainerID: 0,
HostID: os.Getgid(),
Size: 1,
},
},
}
# ...
Here we are adding a single UID and GID mapping. We set ContainerID
to 0, HostID
to the current user’s UID/GID and Size
equal to 1. In other words, we are mapping ID = 0 (aka root) in our new User namespace to the ID of the user who invokes the ns-process
command.
With all this in place, we should be able to build and run ns-process
and see that we now become the root user inside the namespaced shell.
💁 The following has been tested on Ubuntu 16.04 Xenial with Go 1.7.1
$ go build
$ ./ns-process
-[ns-process]- # whoami
root
-[ns-process]- # id
uid=0(root) gid=0(root) groups=0(root)
And there we have it! With the addition of a simple UidMapping/GidMapping
we have been able to restore our root identity inside the namespaced shell, while retaining the ability to run ns-process
as a non-root user.
📺 On the next…
In the next article we’ll take a look at reexec
. What is reexec
and why is it relevant to Namespaces in Go? The answer to this and plenty more coming up, stay tuned…
Update: Part 4, “Namespaces in Go - Reexec” has been published and is available here.
In the previous article we learnt how to apply a UID/GID mapping to ns-process
such that we are now running as the root user once inside the namespaced shell.
The purpose of this article is to provide an understanding of the reexec
package. reexec
is part of the Docker codebase and provides a convenient way for an executable to “re-exec” itself. In all honesty reexec
is a bit of a hack, but it’s a really useful one that is required to circumvent a limitation in how Go handles process forking. Before going into too much more detail, let’s take a look at the problem reexec
helps to solve.
It’s probably best to demonstrate the problem by way of an example. Consider the following - we want to update ns-process
such that a randomly-generated hostname is set inside the new UTS namespace we’ve cloned. For security reasons, it’s essential that the hostname has been set before the namespaced /bin/sh
process starts running. After all, we don’t want programs running inside ns-process
to be able to discover the Host’s hostname.
As far as I’m aware, Go doesn’t provide a built-in way to allow us to do this. Namespaces are created by setting attributes on an *exec.Cmd
, which is also where we specify the process we'd like to run. For example:
cmd := exec.Command("/bin/echo", "Process already running")
cmd.SysProcAttr = &syscall.SysProcAttr{
Cloneflags: syscall.CLONE_NEWUTS,
}
cmd.Run()
Once cmd.Run()
is called, the namespaces get cloned and then the process gets started straight away. There’s no hook or anything here that allows us to run code after the namespace creation but before the process starts. This is where reexec
comes in.
🎤 reexec yourself before you wreck yourself
Let’s open up the reexec
package and take a look at what’s inside (I won’t paste full code snippets here for sake of simplicity, but I advise you read along with the full implementations of the methods).
// Register adds an initialization func under the specified name
func Register(name string, initializer func()) {
# ...
}
First up we have Register
, which exposes a way for us to register arbitrary functions by some name and to store them in memory. We will use this to register some sort of “Initialise Namespace” function when ns-process
first starts up.
// Init is called as the first part of the exec process
// and returns true if an initialization function was called.
func Init() bool {
# ...
}
Next up we have Init
, which gives us a mechanism for determining whether or not the process is running after having been reexec
ed, and for running one of the registered functions if we have. It does this by checking os.Args[0]
for the name of one of the previously-registered functions.
// Command returns *exec.Cmd which have Path as current binary.
// ...
func Command(args ...string) *exec.Cmd {
return &exec.Cmd{
Path: Self(),
Args: args,
SysProcAttr: &syscall.SysProcAttr{
Pdeathsig: syscall.SIGTERM,
},
}
}
Command
ties it all together by creating an *exec.Cmd
with Path set to Self()
, which evaluates to /proc/self/exe
on Linux machines. We can choose which of the registered functions we’d like to invoke upon reexec
by providing the registered name of the function in args[0]
.
💁 /proc/self/exe
is a symlink file that points to the path of the currently-running executable
Now that we have an understanding of how reexec
works, it’s time to wire it up inside ns-process
.
👉 Let’s Go
The first thing we need to do is to create a function and register it using reexec
.
# Git repo: https://github.com/teddyking/ns-process
# Git tag: 3.0
# Filename: ns_process.go# ...
func init() {
reexec.Register("nsInitialisation", nsInitialisation)
if reexec.Init() {
os.Exit(0)
}
}
# ...
There are two important things happening here. First, we register a function nsInitialisation
under the name “nsInitialisation”. We'll add that function in a moment. Secondly, we call reexec.Init()
and os.Exit(0)
the program if it returns true. This is vitally important to prevent an infinite loop situation whereby the program gets stuck reexec
ing itself forever! Let’s add nsInitialisation
next.
# Git repo: https://github.com/teddyking/ns-process
# Git tag: 3.0
# Filename: ns_process.go# ...
func nsInitialisation() {
fmt.Printf("\n>> namespace setup code goes here <<\n\n")
nsRun()
}func nsRun() {
cmd := exec.Command("/bin/sh")
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
cmd.Env = []string{"PS1=-[ns-process]- # "}
if err := cmd.Run(); err != nil {
fmt.Printf("Error running the /bin/sh command - %s\n", err)
os.Exit(1)
}
}
Here we’ve added nsInitialisation()
simply as a placeholder function. It will become much more important in future articles when we actually need to start configuring the namespaces. For now, it simply passes through to nsRun()
, which runs the /bin/sh
process.
All that’s left to do now is modify main()
such that it runs the /bin/sh
process via reexec
and nsInitialisation
rather than calling it directly.
# Git repo: https://github.com/teddyking/ns-process
# Git tag: 3.0
# Filename: ns_process.gofunc main() {
cmd := reexec.Command("nsInitialisation")
# ...
}
By specifying nsInitialisation
as the first arg to Command
, we're essentially telling reexec
to run /proc/self/exe
with os.Args[0]
set to nsInitialisation
. Finally, once the program has been reexec
ed, Init
will detected the registered function and then actually Run it. Let’s give it a whirl.
💁 The following has been tested on Ubuntu 16.04 Xenial with Go 1.7.1
$ go build
$ ./ns-process
>> namespace setup code goes here <<-[ns-process]- #
And there we have it. We now have nsInitialisation
available in which to run any namespace setup we need, including the ability, as discussed earlier, to set the hostname in the new UTS namespace if we so desire.
📺 On the next…
We’re now in a position to configure our namespaces, but what configuration remains to be done? The answer to this and plenty more coming up, stay tuned…
Update: Part 5, “Namespaces in Go - Mount” has been published and is available here.
One of the fundamental features of container implementations today is the ability to run containers of differing linux distros on the same host machine. It’s not uncommon, for example, to install Docker on an Ubuntu host and to then start a bunch of containers on that host using BusyBox, CentOS, or any other distro you like the look of.
In this article we will will take a look at what makes this possible - namely a combination of the Mount namespace and the pivot_root
system call. Let's start by reviewing the Mount namespace implementation in ns-process
as it currently stands. If you’ve not been following along with this series so far, be sure to check out the previous article(s) first.
💁 The following has been tested on Ubuntu 16.04 Xenial with Go 1.7.1
# Git repo: https://github.com/teddyking/ns-process
# Git tag: 3.0$ go build
$ ./ns-process>> namespace setup code goes here <<-[ns-process]- # cat /proc/mounts
/dev/sda1 / ext4 rw,relatime,data=ordered 0 0
tmpfs /dev/shm tmpfs rw,nosuid,nodev 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
# ...
There are a number of mounts already listed in the /proc/mounts
file. This may be little surprising given that we’re requesting a new Mount namespace (via the CLONE_NEWNS
flag) and have yet to do any explicit Mount namespace setup.
This doesn’t feel very container-like. Our namespaced process should know as little as possible about the host it’s running on, and certainly shouldn’t be able to see a list of all the host’s mounts. So why’s this happening? Fortunately, an explanation can be found within the mount_namespaces(7) man page.
“When a process creates a new mount namespace using clone(2) or unshare(2) with the CLONE_NEWNS flag, the mount point list for the new namespace is a copy of the caller’s mount point list.”
It seems that this is actually intended behaviour, and it explains why /proc/mounts
is already populated as soon as our namespaced process starts. With this in mind the question now becomes, “What do we do about it?”. We need some way of clearing the host’s mounts from the new Mount namespace in order to keep them secure and away from prying eyes - we need to pivot_root
.
🔄 pivot_root
pivot_root
allows you to set a new root filesystem for the calling process. I.e. it allows you to change what /
is. It does this by mounting the current root filesystem somewhere else while simultaneously mounting some new root filesystem on /
. Once the previous root has been moved, it is then possible to umount it. Thus we have a mechanism for 'clearing' the hosts's mounts from inside a new Mount namespace - we simply pivot away and then umount them!
This is what allows the aforementioned Ubuntu host machine to run a CentOS container. As long as the Ubuntu host has a copy of a CentOS filesystem on disk, we can create a new Mount namespace, call pivot_root
pointing to the CentOS filesystem and then run whatever processes we want to inside the 'pivoted' namespace. The processes will believe they’re running on CentOS the entire time.
Incidentally this is where the reexec from the previous article comes in handy. pivot_root
must be called from within the new Mount namespace, otherwise we'll end up changing the host's /
which is not the intention! And we want all this to happen before the namespaced shell starts so that the requested root filesystem is ready for when it does.
👉 Let’s Go
In Go, pivot_root
is implemented via the PivotRoot
func found in the syscall
package.
func PivotRoot(newroot string, putold string) (err error)
newroot
is the path to the desired new root filesystem and putold
is a path to a directory in which to move the current root. There are a few restrictions imposed on newroot
and putold
by the underlying pivot_root
sys call that we need to be aware of:
- They must both be directories
- They must not be on the same filesystem as the current root
putold
must be underneathnewroot
- No other filesystem may be mounted on
putold
Most of these are fine but the second point there will require a small workaround, as we’ll see in a moment. We’re also going to need a suitable newroot
in which to pivot to.
The process of preparing a newroot
filesystem can be quite a detailed and complex one. Take for example Docker’s layered filesystem approach in which many filesystem “layers” are joined together to present a single coherent root. We’re going to do something much simpler, which is to to assume that a suitable root filesystem has already been prepared for use.
# Git repo: https://github.com/teddyking/ns-process
# Git tag: 4.0$ mkdir -p /tmp/ns-process/rootfs
$ tar -C /tmp/ns-process/rootfs -xf assets/busybox.tar
From now on, ns-process
will expect a root filesystem to exist at this path and will raise an error if one can’t be found. Note that although we’re using BusyBox for this particular example, you could just as easily use any other distro.
Now that we have our newroot
, let’s write some code to make use of it.
# Git repo: https://github.com/teddyking/ns-process
# Git tag: 4.0
# Filename: rootfs.gofunc pivotRoot(newroot string) error {
putold := filepath.Join(newroot, "/.pivot_root")
// bind mount newroot to itself - this is a slight hack
// needed to work around a pivot_root requirement
if err := syscall.Mount(
newroot,
newroot,
"",
syscall.MS_BIND|syscall.MS_REC,
"",
); err != nil {
return err
}
// create putold directory
if err := os.MkdirAll(putold, 0700); err != nil {
return err
}
// call pivot_root
if err := syscall.PivotRoot(newroot, putold); err != nil {
return err
}
// ensure current working directory is set to new root
if err := os.Chdir("/"); err != nil {
return err
}
// umount putold, which now lives at /.pivot_root
putold = "/.pivot_root"
if err := syscall.Unmount(
putold,
syscall.MNT_DETACH,
); err != nil {
return err
}
// remove putold
if err := os.RemoveAll(putold); err != nil {
return err
}
return nil
}
With the pivotRoot
func in place, it’s time to put nsInitialisation
to good use.
# Git repo: https://github.com/teddyking/ns-process
# Git tag: 4.0
# Filename: ns_process.gofunc nsInitialisation() {
newrootPath := os.Args[1]
if err := pivotRoot(newrootPath); err != nil {
fmt.Printf("Error running pivot_root - %s\n", err)
os.Exit(1)
}
nsRun()
}
func main() {
var rootfsPath string
// ... cmd := reexec.Command("nsInitialisation", rootfsPath)
}
Notice that we’re now passing an argument, rootfsPath
, to nsInitialisation
. Once reexec
ed, this argument can be picked up by reading from os.Args[1]
. Also notice how the call to pivotRoot
comes before nsRun
. By doing this, we're ensuring that the new root filesystem will already have been pivoted to before the /bin/sh
process starts.
With all that in place, let's run the updated Go program and check to see which mounts, if any, are available to us now.
💁 The following has been tested on Ubuntu 16.04 Xenial with Go 1.7.1
# Git repo: https://github.com/teddyking/ns-process
# Git tag: 4.0$ go build
$ ./ns-process
-[ns-process]- # cat /proc/mounts
cat: can't open '/proc/mounts': No such file or directory
Ah … now that we’ve pivoted to a new /
, we no longer have a /proc
! This is actually a good thing as it means we definitely can’t see the host’s mounts anymore, which is one of the main reasons for doing all this work in the first place. But, there’s probably only so far we can get without a working /proc
, so let’s add one to our new root.
# Git repo: https://github.com/teddyking/ns-process
# Git tag: 4.1
# Filename: rootfs.gofunc mountProc(newroot string) error {
source := "proc"
target := filepath.Join(newroot, "/proc")
fstype := "proc"
flags := 0
data := ""
os.MkdirAll(target, 0755)
if err := syscall.Mount(
source,
target,
fstype,
uintptr(flags),
data,
); err != nil {
return err
}
return nil
}
And just as with pivotRoot
, mountProc
should be called from nsInitialisation
.
# Git repo: https://github.com/teddyking/ns-process
# Git tag: 4.1
# Filename: ns_process.gofunc nsInitialisation() {
newrootPath := os.Args[1]
if err := mountProc(newrootPath); err != nil {
fmt.Printf("Error mounting /proc - %s\n", err)
os.Exit(1)
}
if err := pivotRoot(newrootPath); err != nil {
fmt.Printf("Error running pivot_root - %s\n", err)
os.Exit(1)
}
nsRun()
}
Ok, that should now be everything. Let’s try it out.
# Git repo: https://github.com/teddyking/ns-process
# Git tag: 4.1$ go build
$ ./ns-process
-[ns-process]- # cat /proc/mounts
/dev/sda1 / ext4 rw,relatime,data=ordered 0 0
proc /proc proc rw,nodev,relatime 0 0
That’s looking much better - the host’s mounts are no longer visible to us and we have a new /proc
mounted and ready for action. But wait … there is one more thing …
🤔 PID namespace
The changes implemented above have had an unintentional side effect on the PID namespace setup. Prior to mounting the new /proc
, running ps
inside the namespaced shell would’ve resulted in all the host’s processes being listed. This is because ps
relies on /proc
to detect running processes and we were still referencing the host’s /proc
.
This is obviously a pretty terrible thing to happen from a container perspective! But fortunately now that we have our own /proc
(and are requesting a new PID namespace via the CLONE_NEWPID
flag), running ps
shows only processes that are relevant to us.
# Git repo: https://github.com/teddyking/ns-process
# Git tag: 4.1$ go build
$ ./ns-process
-[ns-process]- # ps
PID USER TIME COMMAND
1 root 0:00 {exe} nsInitialisation /tmp/ns-process/rootfs
5 root 0:00 /bin/sh
8 root 0:00 ps
📺 On the next…
We’re nearing the season finale of “Namespaces in Go”, but we’re still missing one key piece of configuration - networking. What needs to be done to allow our namespaced shell to talk to the Internets? The answer to this and plenty more coming up, stay tuned…
Update: Part 6, “Namespaces in Go - Network” has been published and is available here.
In the previous article we saw how to make use of PivotRoot
and the Mount namespace to swap in a new root filesystem for ns-process
. With that change in place, ns-process
is starting to look and feel an awful lot like any other container. Sure, it only runs a single /bin/sh
process at the moment, but it does have a number of extremely cool features:
- Can be run as a non-root user thanks to the User namespace
- Can choose a root filesystem to run in thanks to the Mount namespace
- Cannot see any of the host’s processes thanks to the PID namespace
That’s pretty impressive! But there’s still a piece of vital functionality missing - networking. At the moment, ns-process
doesn’t have any network connectivity!
💁 The following has been tested on Ubuntu 16.04 Xenial with Go 1.7.1
# Git repo: https://github.com/teddyking/ns-process
# Git tag: 4.1$ go build
$ ./ns-process
-[ns-process]- # ifconfig
-[ns-process]- # route
Kernel IP routing table
Destination Gateway Genmask ... Use Iface
-[ns-process]- # ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8): 56 data bytes
ping: sendto: Network is unreachable
That’s slightly less impressive… The reason for this lack of connectivity is due to the fact that ns-process
clones a new Network namespace, the very purpose of which is to isolate all network-related resources (IPs, ports, interfaces, etc.).
In this article we will set about configuring the new Network namespace such that it ends up with an interface and a routable IP address.
🌐 A quick lesson in networking
If we are to have any hope of adding network connectivity to ns-process
, a solid understanding of the Network namespace is going to be essential. To that end, I highly recommend you read through Introducing Linux Network Namespaces. The knowledge and ideas presented in that article will form the basis for the Network namespace configuration in ns-process
. To briefly summarise, here’s what we’ll need to do:
- Create a bridge device in the host’s Network namespace
- Create a veth pair
- Attach one side of the pair to the bridge
- Place the other side of the pair in
ns-process
's Network namespace - Ensure all traffic originating in the namespaced process gets routed via the veth
The general idea is to establish a connection between ns-process
's Network namespace and the host’s Network namespace. Visually this looks a little something like this:
This is actually a fair amount of work! And it’s made complicated by the fact that setup and configuration needs to occur in two different Network namespaces. There’s also a further complexity in that the network setup requires root privileges, which means we could end up regressing on one of ns-process
's most lovely features - that it can be run as a non-root user.
Fortunately this can be avoided by making use of setuid
. setuid
allows a process to run as the user that owns an executable. The idea then is to extract the network setup code into a separate executable, ensure the executable is owned by the root user and to apply the setuid
permission on it. We can then call out to the executable from within ns-process
(running as a non-root user) as and when we need to. With all this in mind, allow me to introduce netsetgo
.
🚦 On your marks, net set, GO!
netsetgo
is a small binary that helps to setup Network namespaces for containers. It achieves this by applying the configuration outlined above. For sake of brevity I’m not going to paste the full netsetgo
code here, but I will briefly point out the most useful parts so you can take a more detailed look for yourself.
- Bridge creation occurs here via a call to
netlink.LinkAdd
- Veth creation occurs here via another call to
netlink.LinkAdd
- The veth is attached to the bridge here via a call to
netlink.LinkSetMaster
- The veth is moved to the new Network namespace here via a call to
netlink.LinkSetNsPid
- A default route is added to the new Network namespace here via a call to
netlink.RouteAdd
In order to make use of netsetgo
fromns-process
, we’ll need to download the binary and set the correct permissions on it, as follows.
# Git repo: https://github.com/teddyking/ns-process
# Git tag: 4.1$ wget "https://github.com/teddyking/netsetgo/releases/download/0.0.1/netsetgo"
$ sudo mv netsetgo /usr/local/bin/
$ sudo chown root:root /usr/local/bin/netsetgo
$ sudo chmod 4755 /usr/local/bin/netsetgo
The 4
in the chmod 4755
signifies that the setuid
bit should be set.
👉 Let’s Go
Now that netsetgo
is primed and ready it’s time to turn our attention back to ns-process
. We need to modify ns-process
so that it calls out to netsetgo
to configure the network. At first glance this would appear to be relatively simple - we can just create a *exec.Cmd
pointing to netsetgo
and run it at the appropriate moment?
Of course, nothing’s ever quite as easy as it seems, and here the question of when to run netsetgo
requires a bit more thought. Let’s start by looking at how we kick off Namespace creation at the moment (output trimmed for simplicity).
# Git repo: https://github.com/teddyking/ns-process
# Git tag: 4.1
# Filename: ns_process.gofunc main() {
cmd := reexec.Command("nsInitialisation", rootfsPath)
cmd.SysProcAttr = &syscall.SysProcAttr{
Cloneflags: syscall.CLONE_NEWNS |
syscall.CLONE_NEWUTS |
syscall.CLONE_NEWIPC |
syscall.CLONE_NEWPID |
syscall.CLONE_NEWNET |
syscall.CLONE_NEWUSER,
}
if err := cmd.Run(); err != nil {
fmt.Printf("Error running Command - %s\n", err)
os.Exit(1)
}
}
Here we’re using cmd.Run()
to run a reexec
command with a number of CLONE_NEW*
flags set. Note that cmd.Run()
does not return until the underlying process has exited. Up until now this has been fine because all subsequent namespace configuration has taken place inside the newly-cloned namespaces (via the nsInitialisation
func to be specific).
However, netsetgo
needs to configure the host’s Network namespace as well as the new one, which means we can no longer rely on the blocking call to cmd.Run()
.
Fortunately cmd.Run()
can be split into two separate calls - cmd.Start()
(which returns immediately) and cmd.Wait()
(which blocks until the started command exits). This is exactly what we need as it allows us to run netsetgo
after the new namespaces have been created but while still executing in the host’s namespaces. Let’s see this in action.
# Git repo: https://github.com/teddyking/ns-process
# Git tag: 5.0
# Filename: ns_process.goif err := cmd.Start(); err != nil {
fmt.Printf("Error starting the reexec.Command - %s\n", err)
os.Exit(1)
}
pid := fmt.Sprintf("%d", cmd.Process.Pid)
netsetgoCmd := exec.Command(netsetgoPath, "-pid", pid)
if err := netsetgoCmd.Run(); err != nil {
fmt.Printf("Error running netsetgo - %s\n", err)
os.Exit(1)
}
if err := cmd.Wait(); err != nil {
fmt.Printf("Error waiting for reexec.Command - %s\n", err)
os.Exit(1)
}
Great! This change allows netsetgo
to configure the networking across both Network namespaces as required. All that’s left to do now is to ensure that the namespaced /bin/sh
process doesn’t start until the network is ready.
Let’s consider the network to be ready once a veth interface has appeared in the new Network namespace. We can use a simple for loop to wait until this is true, as follows.
# Git repo: https://github.com/teddyking/ns-process
# Git tag: 5.0
# Filename: net.gofunc waitForNetwork() error {
maxWait := time.Second * 3
checkInterval := time.Second
timeStarted := time.Now()
for {
interfaces, err := net.Interfaces()
if err != nil {
return err
}
// pretty basic check ...
// > 1 as a lo device will already exist
if len(interfaces) > 1 {
return nil
}
if time.Since(timeStarted) > maxWait {
return fmt.Errorf("Timeout after %s waiting for network", maxWait)
}
time.Sleep(checkInterval)
}
}
Here we have a very basic for loop which blocks until either more than one network interface is reported or a timeout of 3 seconds is reached. As the comment mentions, we check for more than one interface as the loopback interface will already exist by default.
Finally, let’s update nsInitialisation
to call the above function.
# Git repo: https://github.com/teddyking/ns-process
# Git tag: 5.0
# Filename: ns_process.gofunc nsInitialisation() {
newrootPath := os.Args[1]
if err := mountProc(newrootPath); err != nil {
fmt.Printf("Error mounting /proc - %s\n", err)
os.Exit(1)
}
if err := pivotRoot(newrootPath); err != nil {
fmt.Printf("Error running pivot_root - %s\n", err)
os.Exit(1)
}
if err := waitForNetwork(); err != nil {
fmt.Printf("Error waiting for network - %s\n", err)
os.Exit(1)
}
nsRun()
}
With all that in place, let’s run the updated Go program.
💁 The following has been tested on Ubuntu 16.04 Xenial with Go 1.7.1
# Git repo: https://github.com/teddyking/ns-process
# Git tag: 5.0$ go build
$ ./ns-process
-[ns-process]- # ifconfig
veth1 Link encap:Ethernet HWaddr 6A:DD:B4:30:1A:49
inet addr:10.10.10.2 Bcast:0.0.0.0 Mask:255.255.255.0
inet6 addr: fe80::68dd:b4ff:fe30:1a49/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:18 errors:0 dropped:0 overruns:0 frame:0
TX packets:7 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:2359 (2.3 KiB) TX bytes:578 (578.0 B)-[ns-process]- # route
Kernel IP routing table
Destination Gateway Genmask ... Iface
default 10.10.10.1 0.0.0.0 ... veth1
10.10.10.0 * 255.255.255.0 ... veth1
-[ns-process]- # ping 10.10.10.1
PING 10.10.10.1 (10.10.10.1): 56 data bytes
64 bytes from 10.10.10.1: seq=0 ttl=64 time=0.098 ms
^C
--- 10.10.10.1 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 0.098/0.098/0.098 ms
Much better! We now have a network interface veth1
available and a routable IP address of 10.10.10.2
.
☁️ Internet connectivity
Enabling Internet access for ns-process
is a little out of scope for this particular article. This is mostly because a lack of Internet connectivity could be the result of any number of things, and attempting to cover all environmental setups would be pretty difficult.
Having said that, the following steps do enable Internet connectivity for ns-process
on my generic Ubuntu 16.04 Xenial machine. There’s no guarantee this will work for you, but feel free to try it out if you’re interested.
First up we need to configure a few iptables rules on the host.
# Git repo: https://github.com/teddyking/ns-process
# Git tag: 5.0$ sudo iptables -tnat -N netsetgo
$ sudo iptables -tnat -A PREROUTING -m addrtype --dst-type LOCAL -j netsetgo
$ sudo iptables -tnat -A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j netsetgo
$ sudo iptables -tnat -A POSTROUTING -s 10.10.10.0/24 ! -o brg0 -j MASQUERADE
$ sudo iptables -tnat -A netsetgo -i brg0 -j RETURN
And then we also need to add a DNS nameserver for the namespaced process.
# Git repo: https://github.com/teddyking/ns-process
# Git tag: 5.0$ go build
$ ./ns-process
-[ns-process]- # echo "nameserver 8.8.8.8" >> /etc/resolv.conf
-[ns-process]- # ping google.com
PING google.com (172.217.23.14): 56 data bytes
64 bytes from 172.217.23.14: seq=0 ttl=51 time=4.766 ms
And there we have it - ns-process
running with full Internet connectivity.
📺 On the next…
With network configuration complete, ns-process
is now setup to configure the User, Mount, Pid and Network namespaces, but what needs to be done about the remaining namespaces? The answer to this and plenty more coming up, stay tuned…
Update: Part 7, “Namespaces in Go - UTS” has been published and is available here.
In the previous article we configured the Network namespace to provide ns-process
with a routable IP address. Now that ns-process
is able to join a network, it’d be a good idea to make sure it starts up with a unique hostname. In this article (the last in the series) we will configure the UTS namespace to make this so. Let’s start, as always, by reviewing the current behaviour.
💁 The following has been tested on Ubuntu 16.04 Xenial with Go 1.7.1
# Git repo: https://github.com/teddyking/ns-process
# Git tag: 5.0$ hostname
ubuntu-xenial
$ go build
$ ./ns-process
-[ns-process]- # hostname
ubuntu-xenial
The hostname reported inside the namespaced /bin/sh
process is the same as the hostname reported on the host. Obviously this isn’t ideal and could lead to confusion further down the line.
Fortunately the fix for this is pretty simple (much easier than the network setup from before) so let’s jump straight in.
👉 Let’s Go
In Go, the hostname can be set via the SetHostname
func from the syscall
package.
# Git repo: https://github.com/teddyking/ns-process
# Git tag: 6.0
# Filename: ns_process.gofunc nsInitialisation() {
newrootPath := os.Args[1]
if err := mountProc(newrootPath); err != nil {
fmt.Printf("Error mounting /proc - %s\n", err)
os.Exit(1)
}
if err := pivotRoot(newrootPath); err != nil {
fmt.Printf("Error running pivot_root - %s\n", err)
os.Exit(1)
}
if err := syscall.Sethostname([]byte("ns-process")); err != nil {
fmt.Printf("Error setting hostname - %s\n", err)
os.Exit(1)
}
if err := waitForNetwork(); err != nil {
fmt.Printf("Error waiting for network - %s\n", err)
os.Exit(1)
}
nsRun()
}
The call to Sethostname
occurs just before the wait for the network. As you can see, the hostname has been hardcoded to ns-process
here. Most container implementations today set the hostname to the ID/name of the container, which is usually some random UUID by default.
And that’s really all there is to it! Let’s confirm our implementation works as expected.
💁 The following has been tested on Ubuntu 16.04 Xenial with Go 1.7.1
# Git repo: https://github.com/teddyking/ns-process
# Git tag: 6.0$ hostname
ubuntu-xenial
$ go build
$ ./ns-process
-[ns-process]- # hostname
ns-process
Perfect!
🎬 That’s a wrap
That’s all for this particular series of articles! Many congratulations on making it to the end. You should now be fully equipped to head out into container land to write your very own Docker. I hope you’ve had fun and have maybe learnt a little bit about Linux namespaces in Go in the process.
If you’ve got any feedback, questions or rants you’d like to send my way you can find me over on twitter as edking2 (damn you edking and edking1!).
📺 Epilogue
More astute readers may have noticed that in publishing the last article in this series I’ve totally ignored 2 of the 7 namespaces - IPC and Cgroup. This isn’t an oversight, rather that I’ve never actually had to configure these two myself. The IPC namespace seems to Just Work™ and the Cgroup namespace is so new that I just haven’t got round to playing with it yet. Besides, I need to save some material for season 2…
WRITTEN BY
Ed King
A Software Engineer currently working with Cloud Foundry and Kubernetes.
ref:
- Linux namespace in Go - Part 1, UTS and PID
- Linux namespace in Go - Part 2, UID and Mount
- Linux namespace in Go - Part 3, Cgroups resource limit
28. put a program in jail (一系列linux的文章)