Worked investigation - Build a container by hand¶

Companion to Linux Kernel -> Month 03 -> Week 9-11: Namespaces, Cgroups. The chapter explains containers are "just namespaces + cgroups." This page makes you build one without Docker - create the isolation primitives one at a time and watch a process lose sight of the rest of the system. By the end "a container is not a thing, it's a process with a restricted view" stops being a slogan and becomes something you've done with your hands. ~40 minutes, any Linux box with root.

The symptom you're learning to understand¶

"What actually is a container?" People say "a lightweight VM" (wrong - there's no second kernel) or "an image" (that's just the filesystem) or wave at Docker as if it's magic. Engineers who can't answer this precisely make bad decisions about security, networking, and debugging. You're going to dispel the magic by building the isolation by hand, primitive by primitive.

Step 0: the one fact¶

A container is a normal Linux process that the kernel has given a restricted view of the system, using two unrelated features:

Namespaces - control what a process can see: which PIDs, which network interfaces, which mounts, which hostname. Each namespace type isolates one kind of resource.
Cgroups - control what a process can use: how much CPU, memory, I/O (the OOM and scheduler investigations).

That's it. Docker, Podman, Kubernetes are orchestration and packaging on top of these kernel primitives. Strip them away and a container is clone() with some flags plus a cgroup. We'll do exactly that with unshare (the command-line front-end to the namespace syscalls).

Step 1: a process that can't see other processes (PID namespace)¶

Normally every process sees every other process. Watch us take that away:

$ ps aux | wc -l
312                              # this shell sees 312 processes - the whole system

$ sudo unshare --pid --fork --mount-proc bash    # new PID namespace + fresh /proc
# now inside the new namespace:
$ ps aux | wc -l
3                               # sees only 3 processes!
$ echo $$
1                               # this bash is PID 1 in its namespace
$ ps aux
USER  PID  ...  COMMAND
root    1  ...  bash            # we ARE init now
root    9  ...  ps aux

Inside, ps shows 3 processes, not 312. This shell believes it is PID 1 - the init process - because the PID namespace gives it a private PID numbering starting at 1. The other 309 processes still exist; this process simply can't see them. That's isolation: not removal, restricted view.

The --mount-proc matters: ps reads /proc, so we gave the namespace a fresh /proc reflecting only its own processes. Without it, ps would read the host's /proc and the illusion would break - a real lesson in how isolation is layered.

Exit (exit) and confirm you're back to seeing all 312. The host never changed.

Step 2: a process with its own network (network namespace)¶

A network namespace gives a process its own interfaces, routes, and firewall - the basis of every container's networking:

$ ip addr | grep -c '^[0-9]'        # host: several interfaces (lo, eth0, docker0...)
4
$ sudo unshare --net bash
# inside the new network namespace:
$ ip addr
1: lo: <LOOPBACK> mtu 65536 ...      # ONLY loopback, and it's DOWN
$ ping -c1 8.8.8.8
ping: connect: Network is unreachable     # no route, no connectivity at all

Inside, there is exactly one interface (loopback, down) and no connectivity - a completely empty network world. This is why a fresh container can't reach anything until the runtime wires up a virtual ethernet pair (veth) bridging the namespace to the host - the plumbing Docker's docker0 bridge and Kubernetes' CNI (the Cilium investigation) do for you. You're seeing the blank canvas they start from. Exit to return to the host's network.

Step 3: a process with its own hostname (UTS namespace)¶

Small but illustrative - the namespace that lets each container have its own hostname:

$ hostname
my-laptop
$ sudo unshare --uts bash
$ hostname container-01           # change it INSIDE
$ hostname
container-01                      # the namespace sees the new name
# (in another terminal, the host still says my-laptop - unchanged)
$ exit
$ hostname
my-laptop                        # host hostname never changed

You changed "the hostname" and the host was unaffected, because the UTS namespace gave this process a private copy. This is how docker run --hostname works.

Step 4: a process with its own root filesystem (mount namespace + pivot)¶

The filesystem isolation - the part people mistake for "the container." A mount namespace plus changing the root directory gives a process a completely different /:

# Get a minimal root filesystem (a few MB):
$ mkdir /tmp/newroot && cd /tmp/newroot
$ docker export $(docker create alpine) | tar -x    # borrow alpine's rootfs (or debootstrap)
# (no docker? use `debootstrap` or busybox; the point is a directory tree with /bin, /lib, etc.)

$ sudo unshare --mount --pid --fork chroot /tmp/newroot /bin/sh
# inside: a totally different filesystem
/ # ls /
bin  etc  lib  proc  sys  usr        # alpine's files, NOT your host's
/ # cat /etc/os-release
NAME="Alpine Linux"                  # we're "in" alpine - on your kernel
/ # ls /home                         # your host's /home is invisible
(empty)

The process now sees alpine's filesystem as /. Your host's files are gone from its view. But run uname -r inside - it's your kernel version. That's the whole difference from a VM: the container shares the host kernel; only the filesystem and namespaces differ. "An image" is just this rootfs directory, tarred up. Now you know what docker pull actually downloads.

Step 5: cap what it can USE (cgroup) - the other half¶

Namespaces restrict view; cgroups restrict resources. Combine them and you have a real container. Put the namespaced process in a memory- and CPU-limited cgroup (from the OOM and scheduler investigations):

$ sudo mkdir /sys/fs/cgroup/handmade
$ echo "256M" | sudo tee /sys/fs/cgroup/handmade/memory.max
$ echo "50000 100000" | sudo tee /sys/fs/cgroup/handmade/cpu.max    # 50% of one core

# launch the contained process and drop it into the cgroup:
$ sudo unshare --pid --net --uts --mount --fork bash -c '
    echo $$ > /sys/fs/cgroup/handmade/cgroup.procs
    exec bash'
# this shell is now: isolated (PID/net/uts/mount) AND capped (256M RAM, 50% CPU)

You have now built, by hand, what docker run --memory=256m --cpus=0.5 alpine does: a process with a private PID space, private network, private hostname, private filesystem, and hard resource limits. No Docker daemon involved. The "magic" was four unshare flags and a cgroup.

What's healthy vs what people get wrong¶

ACCURATE mental model:                   WRONG mental models people hold:
"a process with a restricted view        "a lightweight VM"  (no - shares the kernel)
 (namespaces) and capped resources       "an image"          (no - that's just the rootfs)
 (cgroups), on the host kernel"          "magic from Docker" (no - kernel primitives)

The accurate model has real consequences: because the kernel is shared, a kernel exploit escapes every container (why you also need seccomp/AppArmor - the Container Internals security month); because it's just a process, you can strace it, nsenter into its namespaces, and see it in the host's process list (it's PID 1 inside, but a normal PID outside). Knowing it's a process, not a box, is what lets you debug it.

Clean up¶

$ exit                                          # leave the namespaces
$ sudo rmdir /sys/fs/cgroup/handmade            # remove the cgroup
$ rm -rf /tmp/newroot

Now you do it¶

PID namespace: unshare --pid --fork --mount-proc bash, run ps aux, confirm you see ~3 processes and you're PID 1. Then without --mount-proc and observe ps break (it reads host /proc) - understand why the flag matters.
Net namespace: unshare --net bash, confirm only a down loopback and no connectivity. (Bonus: research ip link add veth0 type veth peer name veth1 to wire it to the host - that's container networking from scratch.)
From the host, find your namespaced shell's PID and run sudo ls -l /proc/<pid>/ns/ - you'll see the namespace handles (pid, net, mnt, uts). Compare with your normal shell's - different inode numbers = different namespaces. This is how the kernel tracks them.
sudo nsenter --target <pid> --pid --net bash from the host to enter a running container's namespaces - exactly how docker exec and kubectl exec work under the hood.

What you might wonder¶

"Is this actually how Docker works?" Yes, fundamentally. Docker/containerd call clone() with namespace flags (the syscall behind unshare), set up cgroups, pivot to the image's rootfs, apply a seccomp filter and capabilities (Container Internals security), and configure veth networking. They add image management, layering (OverlayFS), and a daemon/API - but the isolation core is exactly what you just did by hand.

"User namespaces - why didn't we use one?" The user namespace (--user) maps UIDs so a process can be "root" inside but unprivileged outside - the basis of rootless containers and a major security boundary. It's the trickiest namespace (UID mapping setup) so we skipped it here, but it's why modern rootless Podman doesn't need root. Worth a follow-up once the others are solid.

"If it's just a process, how isolated is it really?" Less than a VM. Namespaces + cgroups + seccomp + capabilities + LSM (AppArmor/SELinux) together make container escape hard, but they all share one kernel - one kernel vuln can breach all of it. For hostile multi-tenant workloads, people add a second boundary: gVisor (a user-space kernel) or Kata (a real lightweight VM per container). The Container Internals path covers this; the key insight from this page is why it's needed - shared kernel.

"Why is my container showing host CPU/memory counts?" Same cgroup-awareness issue as the page-cache and OOM investigations - tools reading host /proc instead of the cgroup. The namespace isolates the process view but /proc/cpuinfo etc. aren't fully namespaced; cgroup-aware tools read /sys/fs/cgroup/.../. Now you understand the layering well enough to know where to look.

What this gave you¶

You built a container by hand: PID, network, UTS, and mount namespaces plus a resource-capped cgroup, no Docker.
You watched a process lose sight of other processes, the network, and the host filesystem - isolation as restricted view, not removal.
You know an "image" is just a rootfs directory and a container shares the host kernel (the real VM distinction).
You can enter a running container's namespaces with nsenter (how docker exec works) and inspect them via /proc/<pid>/ns/.
You understand why shared-kernel isolation needs seccomp/caps/LSM and when gVisor/Kata are warranted.

Back to the Namespaces & cgroups month, or on to the bpftrace investigation.