Build a Container Engine from Scratch with Linux Primitives

If you've used Docker for any length of time, you've probably internalized the vocabulary. Images. Containers. Dockerfiles. docker run -d --name my-app nginx. It works. You ship code. Everyone's happy.

But here's a question: what actually happens when you type docker run?

Most people answer with something about "lightweight VMs" or "process isolation." That's correct in spirit but useless in detail. The real answer involves several Linux kernel primitives that have existed for over a decade. Docker didn't invent containers. It packaged them.

In this post, I'm going to walk you through building a minimal container engine using raw Linux primitives. No Docker. No Podman. No LXC. Just the kernel, a C compiler and a few Go scripts. By the end, you'll understand isolation at the syscall level — and you'll never look at docker run the same way again.

I won't be building something production-ready. This is a learning tool. The goal is understanding, not shipping. If you want a real container runtime after this, go read the runC source. You'll actually understand it.

What a Container Actually Is

Before we touch a terminal, let's define what we're building. A container is a process with a lie told to it by the kernel.

That lie comes in three parts:

Visibility. The process can only see what the kernel lets it see — other processes, network interfaces, files, user IDs. This is namespaces.
Resources. The process can only consume what the kernel allocates — CPU shares, memory limits, I/O bandwidth. This is cgroups.
Filesystem. The process thinks it has a root filesystem, but it's actually looking at a directory (or a stack of directory layers) we prepared. This is chroot/pivot_root plus a union filesystem.

That's it. Three lies. Together they create the illusion of a dedicated machine. Docker's entire value proposition — the CLI, the registry, the Dockerfile format, the networking bridge — sits on top of these three primitives. Everything else is tooling.

Lie #1: Namespaces — Controlling What a Process Can See

Linux namespaces partition kernel resources so one set of processes sees a different reality from another. Modern Linux has eight namespace types: mount, PID, network, IPC, UTS, user, cgroup, and time. Containers typically rely most heavily on mount, PID, network, IPC, UTS, and user namespaces.

The Six Namespaces That Matter

Namespace	What It Isolates	Clone Flag	Why Containers Need It
PID	Process IDs	`CLONE_NEWPID`	So PID 1 inside the container isn't the host's init process
Mount	Filesystem mount points	`CLONE_NEWNS`	So the container has its own `/proc`, `/sys`, and root filesystem
Network	Network interfaces, routing tables, firewall rules	`CLONE_NEWNET`	So the container gets its own `eth0` and IP address
UTS	Hostname and domain name	`CLONE_NEWUTS`	So `hostname` inside the container returns something different
IPC	SysV message queues, shared memory, semaphores	`CLONE_NEWIPC`	So processes can't read shared memory segments from the host
User	UID/GID mappings	`CLONE_NEWUSER`	So root inside the container is unprivileged outside

Linux namespace isolation diagram — PID namespace (container sees PID 1), UTS namespace (hostname my-container), and Network namespace (separate eth0) shown as separate kernel partitions

The PID namespace is the one that surprises people most. When you create a new PID namespace, the first process inside gets PID 1. It looks like init. If that process dies, the kernel sends SIGKILL to every other process in the namespace — exactly like the host's init. This is why Docker containers stop when the main process exits.

Here's the simplest possible namespace demonstration. Save this as ns_demo.go:

package main

import (
    "fmt"
    "os"
    "os/exec"
    "syscall"
)

func main() {
    if len(os.Args) < 2 {
        fmt.Println("Usage: ns_demo <command> [args...]")
        os.Exit(1)
    }

    cmd := exec.Command(os.Args[1], os.Args[2:]...)
    cmd.Stdin = os.Stdin
    cmd.Stdout = os.Stdout
    cmd.Stderr = os.Stderr

    // Run the command in new namespaces
    cmd.SysProcAttr = &syscall.SysProcAttr{
        Cloneflags: syscall.CLONE_NEWUTS |  // new hostname
            syscall.CLONE_NEWPID |          // new PID space
            syscall.CLONE_NEWNS,            // new mount space
    }

    if err := cmd.Run(); err != nil {
        fmt.Fprintf(os.Stderr, "Error: %v\n", err)
        os.Exit(1)
    }
}

Build it, then run it with a shell:

go build -o ns_demo ns_demo.go
sudo ./ns_demo /bin/bash

Inside this shell, run hostname my-container and then hostname. You'll see my-container. Now open another terminal and run hostname — still your actual hostname. The UTS namespace isolated it. Run echo $$ — you'll see 1. The PID namespace gave you your own process tree.

You're already running a container. A terrible one with no filesystem isolation and no resource limits, but a container nonetheless. This is what Docker does under the hood — it just wraps it with a lot of tooling.

Lie #2: cgroups — Controlling What a Process Can Use

Namespaces control visibility. cgroups (control groups) control consumption. Without cgroups, a container process can eat 100% of the host CPU, fill all available RAM, and saturate the disk. Namespaces won't stop it.

cgroups are organized as a filesystem, typically mounted at /sys/fs/cgroup. In cgroup v1, controllers such as CPU, memory, blkio, and pids often had separate hierarchies. In cgroup v2, they live under a unified hierarchy. Creating a new cgroup means making a directory. Setting limits means writing to files inside that directory.

The cgroup v2 interface, available since the Linux 4.x era and now common on modern distributions, is simpler than cgroup v1 because controllers live under a unified hierarchy. Everything lives under a single tree.

Note: these commands assume a simple cgroup v2 setup where root can create and write to a child cgroup under /sys/fs/cgroup. On many systemd-managed distributions, direct writes may fail unless the process is running inside a delegated cgroup or scope.

Here's how you'd limit a container to 100MB of RAM and half a CPU core:

# Create a new cgroup for the container
CGROUP=/sys/fs/cgroup/my-container
sudo mkdir -p "$CGROUP"

# Limit memory to 100 MB
echo "104857600" | sudo tee "$CGROUP/memory.max"

# Limit CPU to 50% of one core (50000 microseconds per 100ms period)
echo "50000 100000" | sudo tee "$CGROUP/cpu.max"

# Limit to 10 PIDs (prevents fork bombs inside the container)
echo "10" | sudo tee "$CGROUP/pids.max"

# Move the container process into the cgroup
echo $CONTAINER_PID | sudo tee "$CGROUP/cgroup.procs"

cgroup v2 filesystem tree — /sys/fs/cgroup/my-container/ showing memory.max (104857600), cpu.max (50000 100000), pids.max (10), and cgroup.procs with the container PID

On a compatible cgroup v2 setup, those writes are enough to limit the container to 100MB of RAM, 50% of one CPU core, and a maximum of 10 processes. Docker's --memory, --cpus, and --pids-limit flags map directly to these files.

The cgroup is a teaching tool in itself. If you were building a real container engine, you'd create a new cgroup, write the limits, then fork the container process with its PID added to cgroup.procs. The kernel handles enforcement. If the container exceeds its memory limit, the OOM killer terminates the largest process inside. No polling. No daemon watching. Just the kernel saying "no."

Lie #3: The Filesystem — chroot, pivot_root, and Union Mounts

Namespaces control what the process sees. cgroups control what it uses. But the process can still see your host's entire filesystem — your SSH keys, your /etc/passwd, your home directory. That's a security disaster.

Phase 1: chroot (The Simple Way)

The chroot syscall changes the root directory of a process. Everything under / becomes the contents of whatever directory you point at.

func chrootInto(newRoot string) error {
    // chroot to the new root
    if err := syscall.Chroot(newRoot); err != nil {
        return err
    }
    return syscall.Chdir("/")
}

Give this an Ubuntu rootfs (download it with debootstrap) and the process thinks it's running on a dedicated Ubuntu machine. But chroot was never designed as a security boundary. A root process inside a chroot can escape in several well-documented ways. It's fine for learning but not for production.

Phase 2: pivot_root (The Proper Way)

pivot_root is the syscall designed for containers. It moves the current root filesystem to a subdirectory and places a new filesystem at /. Unlike chroot, it properly detaches the old root so the process can't escape.

func pivotRoot(newRoot string) error {
    // Bind mount newRoot to itself (required by pivot_root)
    if err := syscall.Mount(newRoot, newRoot, "", syscall.MS_BIND|syscall.MS_REC, ""); err != nil {
        return fmt.Errorf("bind mount rootfs: %w", err)
    }

    // Create a directory for the old root
    putOld := filepath.Join(newRoot, ".pivot_root")
    if err := os.MkdirAll(putOld, 0700); err != nil {
        return fmt.Errorf("mkdir putOld: %w", err)
    }

    // pivot_root: swap the root filesystem
    if err := syscall.PivotRoot(newRoot, putOld); err != nil {
        return fmt.Errorf("pivot_root: %w", err)
    }

    // Change working directory to the new root
    if err := syscall.Chdir("/"); err != nil {
        return fmt.Errorf("chdir /: %w", err)
    }

    // Unmount the old root so the process can't escape
    putOld = "/.pivot_root"
    if err := syscall.Unmount(putOld, syscall.MNT_DETACH); err != nil {
        return fmt.Errorf("unmount old root: %w", err)
    }
    return os.Remove(putOld)
}

This is what Docker does. It's also what LXC and runC do. The pattern is identical across runtimes because there's only one kernel API for this job.

Phase 3: OverlayFS (Where Images Come From)

Your container has a root filesystem — but where did that filesystem come from? Docker "pulls images." What's actually happening?

A Docker image is a stack of read-only layers, each representing a filesystem diff. When you RUN apt install nginx in a Dockerfile, it creates a new layer containing only the files that command added or changed. When you run the container, Docker combines those layers into a single view using a union filesystem — typically OverlayFS on modern kernels.

Here's how OverlayFS works:

# Create the directory structure
mkdir lower upper work merged

# lower = read-only base (e.g., Ubuntu rootfs)
# upper = writable layer (container changes go here)
# work  = internal OverlayFS scratch space
# merged = the unified view the container sees

# Mount the overlay
sudo mount -t overlay overlay \
    -o lowerdir=lower,upperdir=upper,workdir=work \
    merged

# Now 'merged' shows lower + upper combined
# Reads prefer upper; writes go to upper (copy-on-write)

OverlayFS layer diagram — lowerdir (read-only Ubuntu base layer), upperdir (container writes), workdir (internal scratch), and merged (unified view the container sees with copy-on-write semantics)

When you docker pull ubuntu:22.04, you're downloading compressed filesystem layer blobs, usually tar-based, along with metadata and digests. When you docker run, Docker mounts them with OverlayFS and passes the merged directory as the root filesystem to pivot_root. That's the whole image system. No magic. Just a clever filesystem trick that's been in the Linux kernel since 2014.

Putting It All Together: The Minimal Container Engine

Let's assemble what we've learned. Here's the flow our engine follows:

Prepare the rootfs. Either unpack a tarball or set up an OverlayFS mount
Create cgroup limits. Write to /sys/fs/cgroup/my-container/
Fork a child process with namespace flags (CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC)
Inside the child: call pivot_root to the prepared rootfs
Mount /proc so ps and friends work inside the container
Add the child PID to the cgroup for resource enforcement
Exec the container command (e.g., /bin/bash)

This is a simplified teaching demo. It shows the core mechanics, but it is not a portable or production-quality runtime. In particular, cgroup setup may differ on systemd-managed hosts, and the child process may start before all resource limits are fully applied.

Here's a complete Go program that does this. It's under 200 lines and runs a real container:

package main

import (
    "fmt"
    "os"
    "os/exec"
    "path/filepath"
    "syscall"
)

func main() {
    if len(os.Args) < 2 {
        fmt.Println("Usage: mini-container <command> [args...]")
        os.Exit(1)
    }

    rootfs := "/tmp/mini-container-rootfs"
    cgroupPath := "/sys/fs/cgroup/mini-container"

    // 1. Prepare rootfs (requires pre-existing directory or tarball)
    if _, err := os.Stat(rootfs); os.IsNotExist(err) {
        fmt.Fprintf(os.Stderr, "Rootfs not found at %s\n", rootfs)
        fmt.Fprintf(os.Stderr, "Create one with: sudo debootstrap stable %s\n", rootfs)
        os.Exit(1)
    }

    // 2. Set up cgroup
    if err := setupCgroup(cgroupPath); err != nil {
        fmt.Fprintf(os.Stderr, "cgroup setup failed: %v\n", err)
        os.Exit(1)
    }

    // 3. Fork the container process
    cmd := exec.Command("/proc/self/exe", append([]string{"init"}, os.Args[1:]...)...)
    cmd.Stdin = os.Stdin
    cmd.Stdout = os.Stdout
    cmd.Stderr = os.Stderr

    cmd.SysProcAttr = &syscall.SysProcAttr{
        Cloneflags: syscall.CLONE_NEWPID |
            syscall.CLONE_NEWNET |
            syscall.CLONE_NEWNS |
            syscall.CLONE_NEWUTS |
            syscall.CLONE_NEWIPC,
    }

    if err := cmd.Start(); err != nil {
        fmt.Fprintf(os.Stderr, "Failed to start container: %v\n", err)
        os.Exit(1)
    }

    // 4. Add child to cgroup
    if err := addToCgroup(cgroupPath, cmd.Process.Pid); err != nil {
        fmt.Fprintf(os.Stderr, "Failed to add process to cgroup: %v\n", err)
    }

    if err := cmd.Wait(); err != nil {
        if exitErr, ok := err.(*exec.ExitError); ok {
            os.Exit(exitErr.ExitCode())
        }
        os.Exit(1)
    }
}

// init runs inside the container after namespace creation
func init() {
    if len(os.Args) < 2 || os.Args[1] != "init" {
        return
    }

    rootfs := "/tmp/mini-container-rootfs"

    // pivot_root into the container rootfs
    if err := pivotRoot(rootfs); err != nil {
        fmt.Fprintf(os.Stderr, "pivot_root failed: %v\n", err)
        os.Exit(1)
    }

    // Mount /proc
    if err := syscall.Mount("proc", "/proc", "proc", 0, ""); err != nil {
        fmt.Fprintf(os.Stderr, "mount /proc failed: %v\n", err)
    }

    // Execute the requested command
    cmd := exec.Command(os.Args[2], os.Args[3:]...)
    cmd.Stdin = os.Stdin
    cmd.Stdout = os.Stdout
    cmd.Stderr = os.Stderr

    if err := cmd.Run(); err != nil {
        if exitErr, ok := err.(*exec.ExitError); ok {
            os.Exit(exitErr.ExitCode())
        }
        os.Exit(1)
    }
    os.Exit(0)
}

func pivotRoot(newRoot string) error {
    if err := syscall.Mount(newRoot, newRoot, "", syscall.MS_BIND|syscall.MS_REC, ""); err != nil {
        return err
    }
    putOld := filepath.Join(newRoot, ".pivot_root")
    os.MkdirAll(putOld, 0700)
    if err := syscall.PivotRoot(newRoot, putOld); err != nil {
        return err
    }
    syscall.Chdir("/")
    putOld = "/.pivot_root"
    syscall.Unmount(putOld, syscall.MNT_DETACH)
    return os.Remove(putOld)
}

func setupCgroup(path string) error {
    os.MkdirAll(path, 0755)
    // Memory limit: 100 MB
    os.WriteFile(filepath.Join(path, "memory.max"), []byte("104857600"), 0644)
    // CPU limit: 50% of a core
    os.WriteFile(filepath.Join(path, "cpu.max"), []byte("50000 100000"), 0644)
    // PID limit: 10
    os.WriteFile(filepath.Join(path, "pids.max"), []byte("10"), 0644)
    return nil
}

func addToCgroup(path string, pid int) error {
    return os.WriteFile(
        filepath.Join(path, "cgroup.procs"),
        []byte(fmt.Sprintf("%d", pid)),
        0644,
    )
}

To use it:

# Step 1: Create a root filesystem
sudo debootstrap stable /tmp/mini-container-rootfs

# Step 2: Build and run the container
go build -o mini-container main.go
sudo ./mini-container /bin/bash

# You're now root in a container with:
# - Its own PID namespace (ps aux shows only your processes)
# - Its own mount namespace (/ is the debootstrap rootfs)
# - Memory capped at 100 MB
# - CPU capped at 50%
# - Max 10 processes

That's a container. Not a complete one — there's no networking, no seccomp profile, no image pulling, no volume mounts. But every one of those features is just another syscall layered on top.

What Docker Adds (And Why You Should Still Use It)

After building this, you might wonder why anyone uses Docker at all. The short answer: you don't want to manage overlay mounts by hand. You don't want to write Go code every time you need a container. Docker gives you:

The registry. docker pull nginx downloads image layer blobs, verifies digests, and unpacks the filesystem layers. Building that from scratch is hundreds of lines of HTTP + tar + hash verification code.
The networking bridge. docker network create sets up bridge interfaces, iptables rules, and DNS resolution. Raw network namespaces give you only a loopback interface. Getting packets in and out requires veth pairs and manual routing.
The image cache. Docker tracks which layers you've already downloaded and reuses them. An nginx:latest base image pulled once is shared across every container that uses it. OverlayFS makes this efficient at the filesystem level, but Docker's daemon tracks metadata so you don't have to.
seccomp and capabilities. Docker runs containers with a reduced default capability set and applies a default seccomp profile based on an allowlist, blocking many higher-risk syscalls by default. Your mini-container above runs with full root capabilities. That's fine for learning; it's a disaster in production.

The point of building the minimal version isn't to replace Docker. It's to understand what Docker actually does. When a production container won't start, you'll know to check /proc/<pid>/ns/ to see which namespaces are active. When memory limits aren't working, you'll know to look at cgroup.procs. When a file written inside a container disappears after restart, you'll understand that it went to the upper OverlayFS layer and the container runtime cleaned it up.

How I Use This Knowledge in My Own Setup

I run Docker Compose on my homelab mini PC for Plex, n8n, and about a dozen other services. Before I understood the internals, Docker was a black box that either worked or didn't — and when it didn't, I was stuck reading Stack Overflow threads from 2019.

Understanding namespaces and cgroups changed how I debug. When a container leaked memory, I checked /sys/fs/cgroup/system.slice/docker-<id>.scope/memory.current instead of guessing. When networking broke, I inspected the veth pairs instead of restarting the Docker daemon. When a container couldn't write files, I checked the OverlayFS upper directory permissions instead of rebuilding the image.

You don't need to build a container engine to use these insights. Just knowing the kernel primitives gives you a troubleshooting map that docker logs doesn't provide.

If you're running a homelab on Proxmox, I covered the full setup in my

And if you're experimenting with running workloads on older hardware, my

shows what's possible when you understand the hardware-software boundary — which is the same mindset this container deep-dive builds.

The Kernel Is the Platform

Containers aren't lightweight VMs. They're processes with kernel-enforced blindfolds. The magic isn't in Docker Engine or containerd or runC. It's in Linux kernel primitives that have been developed and refined over many years.

Build the minimal version once — even if you never run it again. Once you've called pivot_root by hand and watched a process wake up inside a different filesystem, Docker stops being a mysterious platform and becomes what it actually is: a well-designed convenience layer on top of Linux primitives you can use yourself.

Learning Docker by Building a Container Engine from Scratch