Linux Kernel¶

Kernel foundations, mm, namespaces, cgroups, eBPF, networking.

Printing this page

Use your browser's Print → Save as PDF. The print stylesheet hides navigation, comments, and other site chrome; pages break cleanly at section boundaries; advanced content stays included regardless of beginner-mode state.

Linux Systems & Kernel Engineering-A 24-Week Mastery Roadmap¶

Authoring lens: Principal Systems Engineer / Linux Kernel Specialist. Target outcome: A graduate of this curriculum should be capable of (a) reading kernel source and contributing patches to a subsystem, (b) operating a fleet of Linux hosts with a coherent observability and security posture, and (c) writing custom kernel modules, eBPF programs, and systemd integrations to solve real production problems.

This is not "Linux command line in 24 weeks." It assumes the reader is already comfortable on the shell, has shipped userspace code, and is ready to read C source from linux/, man-pages, and the kernel documentation tree as primary literature.

Repository Layout¶

File	Purpose
`00_PRELUDE_AND_PHILOSOPHY.md`	The Linux design ethics; the kernel/userspace contract; reading list.
`01_MONTH_KERNEL_FOUNDATIONS.md`	Weeks 1–4. Boot, syscalls, VFS, processes & threads.
`02_MONTH_MEMORY_AND_SCHEDULING.md`	Weeks 5–8. Paging, swapping, HugePages, CFS/EEVDF scheduler.
`03_MONTH_NAMESPACES_CGROUPS_EBPF.md`	Weeks 9–12. Namespaces, cgroups v2, eBPF, observability.
`04_MONTH_NETWORKING.md`	Weeks 13–16. Netfilter, IPVS, XDP, bridges, OVS.
`05_MONTH_SECURITY_AND_HARDENING.md`	Weeks 17–20. SELinux/AppArmor, LUKS, sysctl, audit, secure boot.
`06_MONTH_KERNEL_MODULES_CAPSTONE.md`	Weeks 21–24. LKM development, perf tuning, capstone defense.
`APPENDIX_A_HARDENING_AND_TUNING.md`	sysctl, perf, SystemTap, BCC/bpftrace recipes.
`APPENDIX_B_TOOLBOX.md`	Build-from-scratch reference: a tiny init, a custom systemd unit, a kernel module skeleton, an eBPF skeleton.
`APPENDIX_C_CONTRIBUTING_TO_THE_KERNEL.md`	LKML; `git send-email`; first patch playbook; subsystem map.
`CAPSTONE_PROJECTS.md`	Three terminal projects: self-healing systemd unit, custom LKM, eBPF observability tool.

How Each Week Is Structured¶

Every weekly module follows the same five-section format:

Conceptual Core-the why, with a mental model.
Mechanical Detail-the how, down to kernel source and man-pages references.
Lab-a hands-on exercise that cannot be completed without internalizing the concept.
Hardening Drill-sysctl, AppArmor/SELinux, audit rules, or systemd security directives that follow from the topic.
Performance Tuning Slice-a perf/bpftrace/ftrace micro-task that compounds across weeks.

Each week is sized for ~12–16 focused hours.

Progression Strategy¶

Kernel Foundations ──► Memory & Scheduling ──► Namespaces / cgroups / eBPF
        │                       │                          │
        └───────────────┬───────┴──────────────────────────┘
                        ▼
                   Networking
                        │
                        ▼
            Security & Hardening
                        │
                        ▼
            LKM Development & Capstone

Non-Goals¶

This is not an LPIC/RHCSA exam-prep guide. The exam objectives focus on operational fluency; this curriculum focuses on internals.
Not a guide to a specific distribution. Examples skew toward modern systemd-based distros (Debian/Ubuntu, Fedora/RHEL, Arch), with kernel paths from upstream.
Not "How to use Docker." That belongs in the Container Internals curriculum.

Capstone Tracks (pick one in Month 6)¶

Kernel Module Track-a non-trivial out-of-tree LKM (a character device, a netfilter hook, or a tracepoint consumer) with KUnit tests.
eBPF Observability Track-a production-grade tracing tool comparable to bpftrace's runqlat or tcpconnect, packaged with a userspace consumer.
Self-Healing Service Track-a systemd-managed application with health checks, automatic restart, watchdog, and integration with the cgroup memory pressure interface.

Details in CAPSTONE_PROJECTS.md.

Prelude-The Philosophy Behind the Syllabus¶

Sit with this document for an evening before week 1.

1. Linux Is a Kernel + a Contract¶

Linus Torvalds writes the kernel. The GNU project, glibc, BusyBox, systemd, and countless others build a userspace around it. What people call "Linux" is the interface between these layers-and that interface is what a Linux engineer must master.

The contract has three surfaces:

The system call ABI-man 2 syscalls. Stable across kernel versions for 25+ years. The kernel's most binding promise.
The procfs / sysfs / netlink interfaces-semi-stable, documented in Documentation/ABI/ in the kernel tree. The control surface for nearly every tunable.
The character/block device file interface-/dev. Hardware abstracted as files; the most Unix idea in Linux.

If you internalize "everything is a file or a syscall," the kernel/userspace boundary becomes legible.

2. The Five-Axis Cost Model¶

A working Linux engineer reasons about every system change along five axes:

Axis	Question to ask
Boundary cost	Does this cross a syscall? A context switch? A copy from userspace?
Memory	Where is this allocated-slab, page, anon, file-backed, hugepage?
Scheduler	What runqueue does this run on? Will it preempt? Is it RT-class?
Isolation	Which namespace, cgroup, capability, MAC label?
Failure	What does the OOM killer do? What does the audit log show?

Beginner courses teach axis 1 only. This curriculum forces all five into your hands by week 12.

3. The Reading List¶

Primary - Linux Kernel Development, Robert Love (3e). The canonical introductory text. - Understanding the Linux Kernel, Bovet & Cesati. Older but unmatched depth on memory management and process control. - The Linux Programming Interface, Michael Kerrisk. The stdlib/syscall reference. Treat as a pinned tab. - Systems Performance, Brendan Gregg (2e). The performance bible. - BPF Performance Tools, Brendan Gregg. Required for Month 3.

Documentation - Documentation/ in the kernel tree. Particularly Documentation/admin-guide/, Documentation/networking/, Documentation/cgroup-v2.rst, Documentation/scheduler/, Documentation/vm/. - `man-pages - Kerrisk's project. Read sections 2 (syscalls), 3 (libc), 5 (file formats), 7 (overviews).

Adjacent - TCP/IP Illustrated, Vol. 1, Stevens. For the networking month. - Operating Systems: Three Easy Pieces, Arpaci-Dusseau. Free, excellent, foundational.

4. Curriculum Philosophy: "Read the Source, Trace the Syscall"¶

Three rules:

Source first, blog second. When the curriculum says "study the page-fault handler," it means open mm/memory.c::handle_mm_fault and read it. Blogs go stale; commits are dated.
strace and perf are the teachers. When you do not understand why a program behaves as it does, the first response is strace -fc, the second is perf trace, and only the third is to ask another human.
One lab per concept, one upstream interaction per phase. By the end of each month: an lkml.org reply, a documentation typo fix, a bpftrace recipe shared, or a kernel-module experiment posted publicly.

5. What Linux Is Not For¶

A graduate of this curriculum should be able to argue these points:

Hard real-time. Stock Linux is preemptible but not hard-RT. PREEMPT_RT is in tree but still not microseconds-deterministic. Use Xenomai, RTEMS, or QNX.
Storage with strict tail-latency requirements. Linux block layer adds tens-of-microseconds variance. SPDK or DPDK-on-userspace bypasses the kernel entirely.
Code where the team has no C, no syscall, no signal-handling intuition. A team that has not debugged a signalfd, a vfork() race, or a misconfigured ulimit will struggle.

6. A Note on AI-Assisted Workflows¶

Never run AI-suggested dd, mkfs, cryptsetup, iptables -F, or systemctl mask without reading the man page first. The blast radius of a bad command at this layer is the entire system.
Verify privileged commands on a VM before any production host. qemu-system-x86_64 with a Debian cloud image and 30 seconds of cloud-init is a reasonable scratch environment.

You are now ready for Week 1. Open 01_MONTH_KERNEL_FOUNDATIONS.md.

Month 1-Kernel Foundations: Boot, Syscalls, VFS, Processes¶

Goal: by the end of week 4 you can (a) trace a process from fork() through execve() to _exit(), (b) describe the VFS layer and the path of open("/etc/passwd", O_RDONLY) from libc to filesystem driver, (c) read /proc/<pid>/ and /sys/ to debug a misbehaving process, and (d) write a basic systemd unit with security hardening.

Weeks¶

Week 1 - Boot, Init, Systemd¶

1.1 Conceptual Core¶

A modern Linux boot is a chain of progressively more-Linux-like stages: firmware (UEFI / BIOS) → bootloader (GRUB / systemd-boot) → kernel + initramfs → /sbin/init (systemd, mostly).
systemd is the dominant init+service manager. It is not SysV-init with Type=simple units bolted on; it is a unit-graph dependency engine that supervises sockets, timers, mounts, slices, and services as first-class objects.
The unit hierarchy: target (a runlevel-equivalent) ← service/socket/timer/mount/device/slice/path ← drop-ins (/etc/systemd/system/foo.service.d/*.conf).

1.2 Mechanical Detail¶

Boot trace: dmesg | head -200 plus journalctl -b 0 --no-pager shows the kernel and userspace boot logs from the current boot.
systemd-analyze blame and systemd-analyze critical-chain decompose boot time.
A unit file's anatomy: [Unit] (deps, ordering), [Service] (exec, restart, security), [Install] (alias, enable target).
Hardening directives: NoNewPrivileges=yes, ProtectSystem=strict, ProtectHome=yes, PrivateTmp=yes, RestrictAddressFamilies=AF_INET AF_INET6, CapabilityBoundingSet=, SystemCallFilter=@system-service, MemoryMax=, CPUQuota=. Every long-running service should set these.
systemctl edit <unit> for drop-ins; never edit /lib/systemd/system/* (overwritten by package updates).

1.3 Lab-"A Hardened Echo Service"¶

Write a tiny C program that listens on a Unix socket and echoes input. Static-link with - static`.
Write a echo.socket and echo.service pair using socket activation.
Apply every hardening directive that is plausible for an echo server. Run systemd-analyze security echo.service and aim for a score under 1.0.
Verify isolation: from inside the service (debug via systemd-run --shell --unit=echo.service), confirm ProtectSystem makes /usr read-only.

1.4 Hardening Drill¶

Read man systemd.exec cover-to-cover. Make a one-page cheat sheet of hardening directives.

1.5 Performance Tuning Slice¶

Capture systemd-analyze plot > boot.svg from a fresh VM. Identify the longest-blocking unit and propose a Before=/After= adjustment.

Week 2 - Syscalls and the Kernel/Userspace Boundary¶

2.1 Conceptual Core¶

A system call is a transfer of control from userspace to the kernel via a defined ABI: trigger an interrupt or a syscall instruction, the kernel reads register-passed arguments, dispatches via a table indexed by syscall number.
On x86_64 Linux: arguments in rdi, rsi, rdx, r10, r8, r9; syscall number in rax; return in rax. Errors as negative rax ( - errno`).
libc wraps each syscall in a function (open(2) is a thin wrapper; some wrappers like fork(3) glue to clone(2)).

2.2 Mechanical Detail¶

Read `arch/x86/entry/syscalls/syscall_64.tbl - the syscall table.
The path: userspace syscall instruction → entry_SYSCALL_64 (arch/x86/entry/entry_64.S) → do_syscall_64 → sys_<name> in C.
strace -f -e trace=%file ./prog traces file-related syscalls only.
ltrace for library-level tracing (less useful since most actions hit the kernel anyway).
perf trace is the modern equivalent of strace with much lower overhead.
audit (auditd) for production-grade syscall logging-gated by rules, written via netlink.

2.3 Lab-"Syscall Forensics"¶

`strace -c ls /etc - produce a count summary of syscalls. Predict the top 5; verify.
Implement cat in pure C using only open, read, write, close. No libc helpers (syscall(SYS_open, ...)).
Run under strace -f to verify zero unexpected calls.
Build a minimal seccomp allowlist for your cat, allowing only the syscalls actually used. Verify it kills attempts to invoke other syscalls.

2.4 Hardening Drill¶

Configure auditctl -a always,exit -F arch=b64 -S execve -k exec to log every execve. Read the resulting aureport -x output. Document the operational cost (log volume).

2.5 Performance Tuning Slice¶

Run a workload under perf stat -e syscalls:sys_enter_*. Identify the highest-frequency syscall. Hypothesize a reduction (batching, larger buffers, splice).

Week 3 - The Virtual File System (VFS)¶

3.1 Conceptual Core¶

The VFS is the kernel's abstraction over filesystem implementations. Userspace sees one consistent API (open, read, stat, mmap); the kernel dispatches to ext4, btrfs, xfs, tmpfs, procfs, sysfs, fuse via per-FS operation tables.
Four core VFS objects:
inode-a file's metadata (owner, perms, size, pointers to data blocks).
dentry-a directory entry; the cached mapping from a name to an inode.
file-an open file description (per open() call); holds offset, flags, ref count.
superblock-a mounted filesystem instance.
The dentry cache (dcache) and inode cache (icache) are why repeated stat()s are fast.

3.2 Mechanical Detail¶

Read fs/open.c::do_sys_openat2 - the entry point ofopenat(2)`.
path_openat resolves the path through the dcache, allocating new dentries on miss.
Each FS implements a struct file_operations and struct inode_operations. ext4's are in fs/ext4/file.c, fs/ext4/inode.c.
Mount namespaces (preview)-each mount namespace has its own mount tree. Containers exploit this.
Pseudo-filesystems:
procfs (/proc)-kernel-introspection: /proc/<pid>/, /proc/cpuinfo, /proc/meminfo, /proc/sys/.
sysfs (/sys)-device/driver-introspection, with most kernel tunables under /sys/kernel/, /sys/class/, /sys/block/, /sys/fs/cgroup/.
cgroupfs, devtmpfs, tmpfs, bpf, tracefs, debugfs, securityfs.

3.3 Lab-"VFS Forensics"¶

Catalogue every entry in /proc/<pid>/ for one of your processes. Document what each gives.
Read /proc/<pid>/maps and explain every region (text, heap, stack, vdso, vvar, shared libs).
Use eBPF's vfs_open kprobe (via bpftrace) to log every open system-wide for 5 seconds. Triage the noise.
Mount tmpfs at a custom path, fill it, and observe the allocator behavior in /proc/meminfo (Shmem).

3.4 Hardening Drill¶

Lock down /proc with hidepid=2 (mount option). Verify a non-root user can no longer see other users' processes.

3.5 Performance Tuning Slice¶

Use perf trace -F to find the hottest VFS function on your workload. If it's __d_lookup, your dcache is being thrashed; if it's __find_get_block, your buffer cache is. Document the inference.

Week 4 - Processes, Threads, and Signals¶

4.1 Conceptual Core¶

A process is the unit of resource ownership: address space, file descriptors, credentials, signal handlers.
A thread in Linux is a process that shares its address space (and most other resources) with its siblings-the kernel calls them all "tasks." clone(2) with various flags is the underlying syscall; fork() is a special case (clone(SIGCHLD)); pthread_create() is another (clone(CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND | CLONE_THREAD | ...)).
Signals are the kernel's asynchronous notification mechanism: kill(2), sigaction(2), signal masks, signalfd(2). The most error-prone part of Unix.

4.2 Mechanical Detail¶

task_struct - the kernel's per-task struct, ~3 KB. Readinclude/linux/sched.h`.
The PID is the per-namespace task identifier; the TGID (thread-group ID) is what userspace getpid() returns. Threads share TGID, differ in PID.
Process tree: /proc/<pid>/task/<tid>/ for each thread; /proc/<pid>/status for credentials, capabilities, OOM score.
Signal mechanics:
Synchronous signals (SIGSEGV, SIGBUS, SIGFPE)-delivered to the offending thread.
Asynchronous signals (SIGTERM, SIGINT, SIGUSR1/2)-delivered to any thread that doesn't block them, usually the main thread.
Real-time signals (SIGRTMIN–SIGRTMAX)-queued (regular signals can be coalesced).
The fork-then-exec pattern: fork() is heavy (copy page tables, COW), vfork() is lighter but pauses the parent, posix_spawn() and clone3(CLONE_VFORK | CLONE_VM) are the modern cheaper alternatives.

4.3 Lab-"Process Forensics"¶

Write a C program that forks 4 children, each computing for 5 s. Use ptrace or strace -f to observe all four.
Add a signal handler that catches SIGTERM and logs cleanly to all children before exit.
Reproduce a classic bug: a parent that ignores SIGCHLD and a child that exits, producing zombies. Verify with ps -ef | grep defunct.
Convert to signalfd + `epoll - the modern signal-handling pattern that integrates with event loops.

4.4 Hardening Drill¶

Set RLIMIT_NPROC and RLIMIT_STACK for your service via LimitNPROC= and LimitSTACK= in the systemd unit. Verify with prlimit -p <pid>.

4.5 Performance Tuning Slice¶

perf record -e sched:sched_switch for 10 seconds of your workload. Analyze with perf sched latency. Identify the top wakeup latencies.

Month 1 Capstone Deliverable¶

A kernel-foundations/ directory: 1. hardened-echo/ - week 1's echo service with maximal systemd hardening, score documented. 2.syscall-cat/ - week 2's libc-free cat plus a seccomp policy. 3. vfs-explorer/ - week 3'sbpftracerecipes capturing VFS activity. 4.signal-disciplined-server/ - a TCP echo server using signalfd + epoll for graceful shutdown.

A RUNBOOK.md documenting the boot trace, the syscall trace methodology, and the signal-handling decisions.

Month 2-Memory Management and Scheduling¶

Goal: by the end of week 8 you can (a) read /proc/meminfo and explain every line, (b) trace a page fault from userspace through do_page_fault to a returned PTE, (c) explain CFS (and EEVDF since 6.6) and read /proc/<pid>/sched, and (d) tune a memory-pressure-sensitive service using MemoryHigh/MemoryMax/PSI.

Weeks¶

Week 5 - Virtual Memory, Paging, and the Page Cache¶

5.1 Conceptual Core¶

Each process has a private virtual address space (mm_struct). The MMU translates virtual to physical addresses via page tables (4-level on x86_64; 5-level on newer CPUs).
Pages are 4 KiB by default. HugePages (2 MiB or 1 GiB) reduce TLB pressure for memory-intensive workloads.
Memory is divided into anonymous (heap, stack) and file-backed (mmap'd files, page cache for read/write).
The page cache is Linux's most aggressive optimization: nearly every read of a regular file is cached; writes are buffered until writeback (or fsync()).

5.2 Mechanical Detail¶

/proc/meminfo line decoding:
MemTotal / MemFree / `MemAvailable - the latter is what to monitor.
`Buffers - block-device caches.
`Cached - page cache.
Active / Inactive (anon, file)-LRU lists.
Dirty / `Writeback - outstanding writeback work.
Slab (Reclaimable / Unreclaim)-kernel object allocators.
AnonHugePages / `HugePages_* - transparent and explicit hugepages.
vm.dirty_ratio / `vm.dirty_background_ratio - when does the kernel start (and force) writeback.
`vm.swappiness - bias between swapping anon vs evicting file pages. Default 60; for DB servers often lowered to 10 or even 1.
`mm/memory.c::handle_mm_fault - the page-fault entry point. Three classes: minor (already in page cache, just map), major (must read from disk), and COW (write to a shared mapping).

5.3 Lab-"Memory Forensics"¶

Run vmstat 1 and free -h while loading a 4-GB file with cat file > /dev/null. Watch Cached grow.
echo 3 > /proc/sys/vm/drop_caches and observe the eviction.
mmap a large file MAP_PRIVATE, write to it, observe AnonHugePages and the COW behavior in /proc/<pid>/smaps.
Configure vm.nr_hugepages=512 (1 GiB of 2 MiB pages). Allocate via MAP_HUGETLB. Measure the latency-distribution change vs default pages.

5.4 Hardening Drill¶

Set vm.unprivileged_userfaultfd=0 (a frequently-exploited surface) and vm.mmap_min_addr=65536 (defense against null-pointer kernel exploits). Document the reasoning.

5.5 Performance Tuning Slice¶

Use perf stat -e dTLB-load-misses,dTLB-loads ./prog. If TLB miss ratio >1%, evaluate hugepages or madvise(MADV_HUGEPAGE).

Week 6 - Swapping, OOM, Memory Pressure (PSI)¶

6.1 Conceptual Core¶

Swap is the kernel's overflow valve when anon memory pressure exceeds available RAM. Modern systems often run swapless or with a small swap (e.g., zswap or zram).
The OOM killer is the kernel's last-resort mechanism: when all reclaim has failed and an allocation cannot be satisfied, it kills the process with the highest oom_score (heuristic of memory usage and adjusted by oom_score_adj).
PSI (Pressure Stall Information)-/proc/pressure/{cpu,memory,io} and per-cgroup `pressure.{cpu,memory,io} - reports time the system or a cgroup spent stalled on each resource. The modern signal for "this system is sad."

6.2 Mechanical Detail¶

swapon, swapoff, /proc/swaps, vm.swappiness.
zram (compressed RAM-backed swap) configuration via systemd-zram-generator or manually with zramctl.
OOM tuning:
oom_score_adj per-process (/proc/<pid>/oom_score_adj, range -1000 to 1000).
systemd OOMScoreAdjust= directive.
vm.overcommit_memory (0/1/2): allow / always-allow / strict accounting.
PSI semantics:
`some - at least one task stalled.
`full - all runnable tasks stalled (system-wide can't reach this for CPU).
Numbers are 10s/60s/300s averages of stall percentage.

6.3 Lab-"Pressure and the OOM Killer"¶

Write a memory-eater program. Run inside a memory.high=512M cgroup. Observe pressure.memory rise.
Push past memory.max; watch the OOM killer. Check dmesg and journalctl -k | grep -i 'killed process'.
Set oom_score_adj=-500 on a critical process; verify it survives an OOM event triggered by another, lower-priority hog.
Measure PSI under realistic load: capture pressure.memory every second for 5 minutes during a workload spike. Plot.

6.4 Hardening Drill¶

Add MemoryHigh= and MemoryMax= to every long-running service. Use MemoryHigh as a soft target (slows allocations) and MemoryMax as the hard cliff.

6.5 Performance Tuning Slice¶

Hook bpftrace -e 'kprobe:oom_kill_process { printf("%s killed %s\n", comm, str(arg0->comm)) }' to observe OOM events live.

Week 7 - The CPU Scheduler (CFS, EEVDF)¶

7.1 Conceptual Core¶

The Completely Fair Scheduler (CFS) is a virtual-time, weighted-fair-queueing scheduler: each runnable task accumulates vruntime, and the scheduler picks the task with smallest vruntime. Weight is from nice value.
Since Linux 6.6, CFS has been replaced with EEVDF (Earliest Eligible Virtual Deadline First), which provides better latency guarantees while preserving fairness. The userspace API and most of the conceptual model are unchanged.
Scheduling classes (priority order, top to bottom): dl (deadline), rt (real-time, FIFO/RR), fair (CFS/EEVDF), idle. Almost everything userspace runs is in fair.

7.2 Mechanical Detail¶

Read kernel/sched/fair.c (modern: kernel/sched/eevdf.c and friends).
Per-CPU runqueues; load-balancer migrates tasks between CPUs.
sched_setscheduler(2) and chrt userspace tool.
CPU affinity: sched_setaffinity(2), taskset, systemd CPUAffinity=.
Cgroup cpu controller v2: cpu.weight (proportional), cpu.max (bandwidth).
`/proc//sched - per-task scheduler stats.
`/proc/sched_debug - system-wide scheduler state.

7.3 Lab-"Scheduler Forensics"¶

Run two CPU hogs at nice 0. Observe split CPU. Lower one to nice 19, verify ~95/5 split.
Use bpftrace -e 'tracepoint:sched:sched_switch { @[comm] = count() }' to see context-switch rates.
Pin a workload to specific CPUs with taskset -c 0,1. Compare cache-miss rate vs unpinned with perf stat.
Place two services in cgroups with cpu.weight=100 and cpu.weight=1000. Verify the 10:1 split under contention.

7.4 Hardening Drill¶

Forbid SCHED_FIFO/SCHED_RR for non-root with kernel.sched_rt_runtime_us tuning; or use RestrictSUIDSGID=yes and RestrictRealtime=yes in systemd units to prevent privilege escalation via RT scheduling.

7.5 Performance Tuning Slice¶

perf sched record sleep 10; perf sched latency - identifies wakeup-latency outliers. Tunables:sched_wakeup_granularity_ns,sched_min_granularity_ns` (older kernels); EEVDF largely auto-tunes.

Week 8 - Disk I/O Scheduling, Filesystems Beyond ext4¶

8.1 Conceptual Core¶

The I/O stack: filesystem → block layer (with merging, sorting) → I/O scheduler → device driver → hardware.
I/O schedulers (settable per-device in /sys/block/<dev>/queue/scheduler): none (preferred for NVMe), mq-deadline, kyber, bfq. Choose none for fast SSDs, mq-deadline for mixed, bfq for desktop fairness.
Filesystems:
`ext4 - the default; well-understood; journaled.
`xfs - high-throughput, parallel metadata; the RHEL default.
`btrfs - copy-on-write, snapshots, multi-device. Use cautiously for high-throughput.
`zfs - out-of-tree (CDDL); mature, snapshots, integrity. Heavy memory footprint.
`tmpfs - RAM-backed.

8.2 Mechanical Detail¶

/sys/block/<dev>/queue/{nr_requests,read_ahead_kb,scheduler,rotational}. Tunable per-device.
iostat -xz 1 for per-device I/O stats. Watch %util, await, svctm, r/s, w/s.
blktrace + btt for fine-grained I/O timing. Modern alternative: bpftrace's biolatency/biosnoop recipes.
Mount options for performance:
`noatime - don't update access times. Always set on busy filesystems.
discard vs periodic `fstrim - for SSDs; periodic is usually better.
`commit=N - ext4 journal commit interval.

8.3 Lab-"I/O Forensics"¶

Run fio with a representative workload. Measure baseline.
Toggle the I/O scheduler. Re-run. Compare.
Use bpftrace -e 'tracepoint:block:block_rq_issue { @[args->comm] = count() }' to see who's hitting the disk.
Mount with vs without noatime and measure metadata-write traffic difference.

8.4 Hardening Drill¶

Set nodev,nosuid,noexec on /tmp, /home, /var/tmp mounts. Document why each matters.

8.5 Performance Tuning Slice¶

bpftrace -e 'tracepoint:block:block_rq_complete /args->dev == 0x800/ { @ = hist((nsecs - @start[arg0]) / 1000) }' to histogram I/O latencies in microseconds.

Month 2 Capstone Deliverable¶

A memory-and-scheduling/ directory: 1. meminfo-decoder/ - a script that reads/proc/meminfoand outputs a human-readable health report. 2.psi-watcher/ - a daemon that alerts when pressure.memory exceeds a threshold. 3. sched-bench/ - comparing nice-weighted, cgroup-weighted, and pinned workloads. 4.io-tuner/ - a fio harness sweeping I/O-scheduler options on the local disk.

Each comes with a markdown writeup of measurements and tuning conclusions.

Month 3-Namespaces, cgroups v2, eBPF¶

Goal: by the end of week 12 you can (a) construct a "container" by hand using unshare(2) + pivot_root(2) + cgroups, (b) explain every controller in cgroups v2 and design a resource policy for a multi-tenant host, (c) write an eBPF program that traces a kprobe and aggregates results in a map, and (d) read the output of bpftrace recipes from BPF Performance Tools and explain each.

Weeks¶

Week 9 - Namespaces¶

9.1 Conceptual Core¶

A namespace is a kernel mechanism that gives a process a private view of a global resource. Eight types exist:
`mnt - mount tree.
`pid - PID space; PID 1 inside is special (default signal handlers blocked, OOM-immune).
`net - network stack: interfaces, routing, sockets, iptables tables.
`uts - hostname, domainname.
`ipc - System V IPC, POSIX message queues.
`user - UID/GID mappings; the security-relevant namespace.
`cgroup - view of cgroup hierarchy.
`time - monotonic and boot-time clock offsets (relatively new; container images rarely use).
Namespaces are the primitive containers are built on. Docker / containerd / runc use them; you can use them directly with unshare(2), clone(2) flags, and setns(2).

9.2 Mechanical Detail¶

unshare --user --pid --net --mount --uts --ipc --cgroup --fork --map-root-user bash gets you "inside" most namespaces in one shell (but mounts are inherited until you mount --make-rslave or remount).
lsns -t <type> enumerates active namespaces.
/proc/<pid>/ns/{mnt,pid,net,...} are inode-numbered handles you can setns(2) into via nsenter --target <pid> --all.
The pivot_root(2) syscall replaces the current root with a new one-this is how containers escape from the host's /.
User namespaces allow unprivileged users to "be root" inside the namespace-the foundation of rootless containers.

9.3 Lab-"Hand-Built Container"¶

Write a C program that: 1. clone()s with CLONE_NEWUSER | CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWNET | CLONE_NEWCGROUP. 2. Configures UID/GID mappings via /proc/<pid>/uid_map and gid_map. 3. Creates a veth pair to give the namespace network access. 4. pivot_roots into a minimal Alpine rootfs. 5. execves /bin/sh.

You should now have a working terminal "inside" a "container" that you wrote in ~150 lines of C.

9.4 Hardening Drill¶

Verify kernel.unprivileged_userns_clone=1 is set (it is on most modern distros). Read CVE history for `user namespaces - many privilege-escalation CVEs over the years are namespace-related; learn the surface.

9.5 Performance Tuning Slice¶

`bpftrace -e 'kprobe:setns { @[comm] = count() }' - see who's switching namespaces. Useful for debugging container-runtime activity.

Week 10 - Control Groups v2¶

10.1 Conceptual Core¶

cgroups (control groups) are the kernel mechanism for resource limits and accounting. Every process belongs to exactly one cgroup per controller; the controllers enforce CPU, memory, I/O, and other limits collectively rather than per-process.

v2 is the unified hierarchy: a single tree under /sys/fs/cgroup/, every controller attached to it. v1 had a separate tree per controller and a long list of design lessons; v2 is the cleanup. All new code should target v2 - major distros default to it since 2019-2020 (systemd.unified_cgroup_hierarchy=1).

Controllers available in v2: cpu, memory, io, pids, cpuset, hugetlb, rdma, misc. Plus the implicit freezer mechanism (cgroup.freeze).

This is the foundation Kubernetes resource limits, Docker --memory, and systemd MemoryMax= all sit on top of. Master it and Kubernetes' "OOMKilled" alerts become diagnosable rather than mysterious.

10.2 Mechanical Detail¶

Filesystem layout: /sys/fs/cgroup/ is a single mount in v2. Each subdirectory is a cgroup; create one with mkdir. Files inside (cpu.max, memory.max, ...) are the controller knobs.
Memory controller files you'll touch most:
- memory.low - protection threshold; the kernel reclaims from cgroups exceeding their low before touching protected ones.
- memory.high - soft target; exceeding it throttles allocations (slows the process) but doesn't OOM.
- memory.max - hard cap; exceeding it triggers cgroup-level OOM.
- memory.events - counters: low, high, max, oom, oom_kill. Read these to alert on pressure.
- memory.pressure - PSI (Pressure Stall Information): time spent waiting on memory. The signal for "I'm not OOM, but I'm not happy either."
CPU controller:
- cpu.weight - proportional share (default 100, range 1-10000). Under contention, slices are weight-proportional.
- cpu.max - bandwidth as "$quota $period" microseconds (e.g., "50000 100000" = 50% of one CPU). Kubernetes CPU limits compile to this.
IO controller:
- io.weight - proportional, like cpu.weight.
- io.max - per-device bandwidth and IOPS caps: "8:0 rbps=10485760 wbps=10485760 riops=100 wiops=100".
PIDs controller: pids.max - fork-bomb protection; set to a sane upper bound per service.
Moving a process: echo $PID > /sys/fs/cgroup/foo/cgroup.procs. Children of moved processes inherit the new cgroup.
systemd integration: every systemd unit gets its own cgroup automatically. systemd-cgls shows the tree; systemd-cgtop shows live resource usage; systemctl set-property foo.service MemoryMax=2G updates limits live.

The trap

Setting memory.max without memory.high. When the workload spikes, you go straight from "fine" to OOM-killed with no warning signal. Set memory.high to ~80% of memory.max so you see throttling (and memory.events.high counter ticks) before the kill arrives - your alerting can fire on high events rather than waiting for oom_kill.

10.3 Lab - "Multi-Tenant Cgroups"¶

Create three sibling cgroups: tenant-a, tenant-b, tenant-c under /sys/fs/cgroup/test/.
Set cpu.weight 100/200/400 - under contention (run stress-ng --cpu N in each), verify the 1:2:4 split with top.
Set memory.high=1G memory.max=2G on each, run a memory hog (stress-ng --vm 1 --vm-bytes 3G), observe throttling first (memory.events.high ticks, latency increases) then OOM (memory.events.oom_kill ticks).
Set io.max to limit disk bandwidth on a specific device for one cgroup; run fio inside, verify with iostat -x 1.

10.4 Hardening Drill¶

For every long-running service on a managed host, set explicit values for: MemoryHigh=, MemoryMax=, CPUQuota=, TasksMax=, IOWeight=. Document the policy in RESOURCE_POLICY.md (one row per service). The right defaults: MemoryHigh at 80% of MemoryMax; CPUQuota per service tier; TasksMax at expected concurrency × 4 for headroom.

10.5 Performance Tuning Slice¶

Wire two Prometheus metrics from every cgroup: - cgroup_memory_events_total{event="high|max|oom_kill"} from memory.events. - cgroup_pressure_seconds_total from memory.pressure / cpu.pressure / io.pressure (PSI).

Alert on high rate > 0 sustained (early warning), and on any oom_kill (incident). PSI is the right signal for "approaching saturation" - fires before any hard limit is hit.

Week 11 - eBPF: Foundations¶

11.1 Conceptual Core¶

eBPF is an in-kernel virtual machine that runs verified bytecode at hookpoints (kprobes, tracepoints, XDP, socket filters, LSM, etc.). It is the modern way to extend Linux without writing kernel modules.
The verifier rejects programs that could crash the kernel (unbounded loops, invalid memory access, dereferencing null). This is what makes eBPF safe.
Programs communicate with userspace via maps (hash, array, ring buffer, LRU, per-CPU variants).

11.2 Mechanical Detail¶

Tooling tiers, low to high level:
Raw eBPF C compiled with clang -target bpf and loaded with libbpf. The production-grade path.
libbpf + CO-RE (Compile Once, Run Everywhere)-portable across kernel versions.
BCC (Python frontend)-older, requires kernel headers at runtime.
bpftrace-high-level scripting, fastest path to a one-off observation.
Hookpoints:
kprobes / kretprobes-kernel function entry/exit.
uprobes / uretprobes-userspace function entry/exit.
tracepoints-stable kernel events with structured args. Prefer over kprobes when available.
XDP-packet processing at NIC driver level (covered Month 4).
fentry/fexit-modern, lower-overhead replacement for kprobes (BPF Trampoline).
LSM hooks-security-relevant decisions.
Sched, syscalls, perf events-many more.

11.3 Lab-"First eBPF Tools"¶

Install bpftrace. Run bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s %s\n", comm, str(args->filename)) }' and watch the system-wide open trace. Triage.
Write a bpftrace script that histograms read() syscall sizes by process.
Convert one of the recipes to libbpf C + a userspace consumer using libbpf-bootstrap as the template.
Read 10 of Brendan Gregg's bpftrace recipes (runqlat.bt, tcpaccept.bt, vfsstat.bt, etc.) and run them. Document each.

11.4 Hardening Drill¶

kernel.unprivileged_bpf_disabled=1 is the modern default (only root or CAP_BPF can load programs). Verify and document.

11.5 Performance Tuning Slice¶

Use runqlat (run queue latency histogram) to detect scheduler stalls. Capture a baseline; document p50/p99/p99.9.

Week 12 - eBPF in Production: Observability Tools¶

12.1 Conceptual Core¶

eBPF makes "perfect tracing" possible: every important system event can be intercepted with low overhead, aggregated in-kernel, and shipped to userspace.
The standard observability stack today (in 2026) is: Cilium (networking), Pixie / Parca / Pyroscope (profiling), Tetragon (security observability), Falco (runtime security). All eBPF-based.

12.2 Mechanical Detail¶

Continuous profiling with Parca / Pyroscope: stack-sampling at low frequency across all processes, attributing CPU and on-CPU time per function, with flame graphs in a UI.
`bpftrace - style tools you'll write yourself:
`tcpconnect - log new TCP connections with PID and process name.
execsnoop - log everyexecve` system-wide.
`opensnoop - every file open.
`biosnoop - every block I/O completion with latency.
The ring buffer map (BPF_MAP_TYPE_RINGBUF) is the modern way to ship events to userspace; it replaces the older perf-buffer pattern with simpler, faster semantics.

12.3 Lab-"Build a Production-Grade eBPF Tool"¶

Write connsnoop: - Hooks tcp_v4_connect and tcp_v6_connect (kprobe), inet_csk_accept (kretprobe), tcp_close. - Records per-connection: 5-tuple, PID, process name, duration, bytes-tx/rx. - Aggregates in-kernel via per-CPU hash maps, ships completion events through a ring buffer. - Userspace consumer in C (with libbpf) or Go (with cilium/ebpf). Outputs JSON. - Verifier-clean, CO-RE-portable across kernels 5.10+.

12.4 Hardening Drill¶

Add connsnoop as a systemd service with full hardening. The eBPF program needs CAP_BPF and CAP_PERFMON; do not grant CAP_SYS_ADMIN (the legacy alternative).

12.5 Performance Tuning Slice¶

Run connsnoop on a host doing real work; measure its CPU overhead with perf stat. Target <0.5% in steady state. If higher, narrow the hookpoints or aggregate more in-kernel.

Month 3 Capstone Deliverable¶

A namespaces-cgroups-ebpf/ directory: 1. mini-container/ (week 9)-the C program that builds a container by hand. 2. multi-tenant-cgroups/ (week 10)-the cgroup-v2 policy + verification script. 3. bpf-tour/ (week 11)-five bpftrace recipes with annotated output. 4. connsnoop/ (week 12)-the libbpf + userspace consumer tool.

CI runs the recipes against a CI VM and validates output schemas. Open one upstream interaction: a doc-fix PR to bpftrace, or a tested bpftrace recipe submitted as an example.

Month 4-Linux Networking: Netfilter, IPVS, XDP, Bridges, OVS¶

Goal: by the end of week 16 you can (a) trace a packet from NIC through XDP, the network stack, conntrack, iptables/nftables, sockets, and back, (b) configure a Linux bridge and an Open vSwitch flow, (c) write an XDP program that drops or redirects packets, and (d) reason about IPVS load-balancing modes.

Weeks¶

Week 13 - The Network Stack: Sockets, NAPI, conntrack¶

13.1 Conceptual Core¶

The Linux networking stack is layered: NIC driver → NAPI (interrupt + polling hybrid) → netif_receive_skb → protocol handlers (IP, ARP) → transport (TCP/UDP) → socket buffer → userspace via recv().
An sk_buff is the kernel's packet representation: a struct with metadata + pointers into the data buffer. It travels from driver to socket. ~250 bytes of metadata.
Netfilter is the packet-mangling/filtering framework: hooks at PRE_ROUTING, INPUT, FORWARD, OUTPUT, POST_ROUTING. iptables and nftables are userspace tools that program these hooks.

13.2 Mechanical Detail¶

ss -tnp-show TCP sockets with PIDs. Replaces deprecated netstat.
ip suite-ip addr, ip route, ip rule, ip neigh, ip link, ip tuntap. The single tool you need; ifconfig and route are deprecated.
conntrack: stateful tracking of connections in netfilter. conntrack -L shows the table; nf_conntrack_max tunes capacity. Each entry ~300 bytes; a busy host may track millions.
tc-traffic control: queue disciplines (qdiscs), classes, filters. The shaping and policing tool. Modern qdiscs: fq_codel (default), fq (for high-bandwidth), cake.
TSO / GSO / GRO / LRO-segmentation offloads. Disable with ethtool -K for debugging; they distort packet timing.

13.3 Lab-"Packet Forensics"¶

`tcpdump -i any -nn -X 'tcp port 443' -c 10 - capture and dissect TLS handshake bytes.
Trace a TCP connection's lifecycle with bpftrace's tcplife.bt.
Set up a gratuitous DROP rule with iptables -I INPUT -p icmp -j DROP and verify with ping. Remove. Repeat with nft.
Inspect conntrack: cat /proc/net/nf_conntrack while a long-lived connection is open.

13.4 Hardening Drill¶

sysctl net.ipv4.tcp_syncookies=1, net.ipv4.conf.all.rp_filter=1, net.ipv4.conf.all.accept_source_route=0, net.ipv6.conf.all.accept_redirects=0. Document each.

13.5 Performance Tuning Slice¶

ethtool -S <iface> - driver stats (drops, errors, csum issues).ip -s link show - interface stats. Identify any non-zero error counter.

Week 14 - Netfilter / nftables / iptables, IPVS¶

14.1 Conceptual Core¶

iptables is being phased out in favor of nftables, the in-kernel successor. Both program the netfilter hooks.
IPVS (IP Virtual Server) is the kernel-level L4 load-balancer used by kube-proxy (ipvs mode) and many appliance-style LBs. Three modes: NAT, DR (direct return), TUN.
The decision matrix: simple SNAT/DNAT/firewall → nftables. L4 LB at scale → IPVS. L7 LB → userspace (Envoy, Nginx, HAProxy).

14.2 Mechanical Detail¶

nftables tables: families ip, ip6, inet, arp, bridge, netdev. Tables contain chains; chains contain rules; rules match and act.

The standard pattern for a host firewall:

table inet filter {
    chain input {
        type filter hook input priority 0; policy drop;
        ct state established,related accept
        iif lo accept
        tcp dport 22 accept
        icmp type echo-request accept
    }
    chain forward { type filter hook forward priority 0; policy drop; }
    chain output { type filter hook output priority 0; policy accept; }
}

IPVS: configured via ipvsadm. A virtual service has a VIP+port, a scheduling algorithm (rr, wrr, lc, wlc, sh source-hash), and real servers.
conntrack-tools for inspecting and manipulating the conntrack table; conntrackd for HA replication.

14.3 Lab-"Build a Stateful Firewall and a Load Balancer"¶

Convert an existing iptables ruleset to nftables. Verify equivalence with packet probes.
Set up IPVS-DR: VIP with two real servers; load test with wrk. Compare with HAProxy on the same setup.
Saturate the conntrack table on purpose; observe nf_conntrack: table full, dropping packet in dmesg. Tune nf_conntrack_max.

14.4 Hardening Drill¶

Default-deny INPUT and FORWARD policies. Document the allowed flows. Ship the nftables ruleset as part of the host's idempotent provisioning.

14.5 Performance Tuning Slice¶

Compare iptables vs nftables vs IPVS per-packet overhead with perf stat on a packet-flood workload.

Week 15 - XDP and AF_XDP¶

15.1 Conceptual Core¶

XDP (eXpress Data Path) is an eBPF hookpoint at the driver level, before the kernel constructs an sk_buff. The earliest possible point to drop, redirect, or pass packets. Used for DDoS scrubbers, custom load balancers (Katran), and eBPF-based service meshes (Cilium).
AF_XDP is a userspace-fast-path: pin a NIC queue to a userspace process, exchange packets via shared-memory rings. Throughput approaching DPDK with lower complexity.
The four XDP actions: XDP_DROP, XDP_PASS (continue to kernel stack), XDP_TX (back out the same NIC), XDP_REDIRECT (to another NIC, or to AF_XDP socket, or to CPU map).

15.2 Mechanical Detail¶

Drivers vary in XDP support level: native (best), generic (slow, software fallback), offloaded (some smartNICs).
Verify with ip link show <iface> - look forxdp` mode.
Attach: bpftool prog load my_xdp.o /sys/fs/bpf/my_xdp; bpftool net attach xdp pinned /sys/fs/bpf/my_xdp dev eth0.
XDP programs are constrained: no helpers that allocate memory, no looping past the verifier's bound, no kernel function calls outside the eBPF helpers list.

15.3 Lab-"An XDP DDoS Scrubber"¶

Write an XDP program that: - Drops UDP packets with source port < 1024 (a coarse DDoS-vector heuristic). - Counts dropped packets per source IP in an LRU-hash map (1M entries). - Userspace tool reads the map every second and emits Prometheus metrics. - Test with pktgen or trafgen. Measure throughput and CPU overhead.

15.4 Hardening Drill¶

XDP programs require CAP_NET_ADMIN (and CAP_BPF). Document the operational privilege required.

15.5 Performance Tuning Slice¶

Measure pps capacity with vs without XDP on the same NIC. Modern 25/40 Gbps NICs can drop 10s of Mpps with native XDP.

Week 16 - Bridges, VLANs, OVS¶

16.1 Conceptual Core¶

A Linux bridge is an in-kernel L2 switch. Used by virtually every container runtime and VM hypervisor.
VLAN (802.1Q) tagging segments a single L2 network into many.
Open vSwitch (OVS) is a programmable virtual switch: flow-table-based (OpenFlow-compatible), with hardware offload to smartNICs. Used by OpenStack Neutron, Kubernetes (older networking), and OVN.

16.2 Mechanical Detail¶

Linux bridge management with ip link:

ip link add br0 type bridge
ip link set veth0 master br0
ip link set br0 up

VLAN: ip link add link eth0 name eth0.10 type vlan id 10.
OVS: ovs-vsctl add-br br0; ovs-vsctl add-port br0 eth0; ovs-ofctl dump-flows br0. The flow table is the programmable part.
Bridge vs OVS decision matrix: simple L2 connectivity → bridge. Programmable flows, OpenFlow, hardware offload → OVS.

16.3 Lab-"Three Network Topologies"¶

Two namespaces connected via a Linux bridge: classic container networking.
Two namespaces on tagged VLANs sharing one bridge.
The same topology in OVS, with explicit OpenFlow rules.

For each, verify connectivity with ping, capture with tcpdump, document the difference.

16.4 Hardening Drill¶

Bridge iptables integration: sysctl net.bridge.bridge-nf-call-iptables=1 (so bridged traffic traverses iptables). Understand whether you want this-for some setups (e.g., transparent bridges) you don't.

16.5 Performance Tuning Slice¶

Compare per-packet latency through a bridge vs OVS vs a direct veth pair under load.

Month 4 Capstone Deliverable¶

A linux-networking/ directory: 1. nft-firewall/ - a default-deny stateful firewall with documented allowlist. 2.ipvs-lb/ - IPVS-DR load balancer with two backends and a health-check sidecar. 3. xdp-scrubber/ - the DDoS scrubber + Prometheus exporter. 4.bridge-vs-ovs/ - three topologies + a comparison report.

A NETWORK_RUNBOOK.md documenting interface inventory, MTU, sysctl tunables, and the firewall ruleset.

Month 5-Security and Hardening¶

Goal: by the end of week 20 you can (a) author SELinux and AppArmor profiles for a service, (b) configure full-disk encryption with LUKS and explain the key derivation chain, (c) write a seccomp-bpf policy, and (d) ship an audited host that passes a basic CIS benchmark.

Weeks¶

Week 17 - Discretionary and Mandatory Access Control¶

17.1 Conceptual Core¶

DAC-discretionary access control-the classic rwx permissions plus POSIX ACLs (getfacl/setfacl). The owner decides who can access.
MAC-mandatory access control-the kernel decides, based on a policy administered separately from file ownership. Two implementations dominate Linux:
SELinux (RHEL family, default-enforcing). Type-enforcement-based. Powerful, complex.
AppArmor (Debian/Ubuntu/SUSE family). Path-based. Simpler, less expressive.
Both are LSMs (Linux Security Modules); only one is typically active per system.

17.2 Mechanical Detail¶

SELinux triple: <user>:<role>:<type>:<level>. The type is the workhorse; rules express "type X may do operation Y on type Z."
getenforce, setenforce 0|1, audit2allow, semanage, restorecon, chcon. Memorize these six.
ausearch -m AVC -ts recent for SELinux denials. audit2allow -a -M mymodule to draft a policy module from observed denials. Production discipline: never setenforce 0 to unblock; capture denials, build a policy, ship it.
AppArmor: profiles in /etc/apparmor.d/. Generate with aa-genprof, refine with aa-logprof. Modes: enforce, complain (logs but allows). aa-status shows active profiles.

17.3 Lab-"MAC for an Echo Service"¶

Take week 1's echo service. Author: 1. An SELinux type-enforcement module that allows it to bind its socket and read its config but nothing else. 2. An equivalent AppArmor profile. Verify with deliberate violations (try to read /etc/shadow)-both should deny and audit.

17.4 Hardening Drill¶

Survey your distro: which LSM is active? Read its policy for one well-known service (httpd on RHEL, nginx on Ubuntu) and explain the constraint set.

17.5 Performance Tuning Slice¶

Measure SELinux/AppArmor overhead with perf stat on a syscall-heavy workload-typically <2%.

Week 18 - Capabilities, Seccomp, no_new_privs¶

18.1 Conceptual Core¶

Linux capabilities subdivide the historical "root" privilege into ~40 discrete capabilities (CAP_NET_ADMIN, CAP_SYS_PTRACE, CAP_DAC_OVERRIDE, etc.). A process holds bounding, effective, permitted, inheritable, and ambient sets.
The principle: a service should hold only the capabilities it needs. A web server binding port <1024 needs `CAP_NET_BIND_SERVICE - not full root.
seccomp-bpf is a syscall-level allowlist/denylist enforced by an eBPF program attached at prctl(PR_SET_SECCOMP) time. A killer feature for sandboxing.
no_new_privs (PR_SET_NO_NEW_PRIVS): once set, neither the calling task nor its descendants can gain privileges via setuid binaries, file capabilities, or LSM transitions. Required before applying user-space seccomp (and a generally good default).

18.2 Mechanical Detail¶

getcap, setcap to manage file capabilities. getpcaps <pid> for process caps.
systemd directives: CapabilityBoundingSet=, AmbientCapabilities=, NoNewPrivileges=yes, SystemCallFilter=, SystemCallArchitectures=native.
SystemCallFilter=@system-service is a curated allowlist that covers most service workloads. Combine with explicit denylists for risky calls.
For container runtimes, the Docker default seccomp profile (`/etc/docker/seccomp.json - equivalent) is a reasonable baseline; understand why each blocked syscall is blocked.

18.3 Lab-"Capabilities and Seccomp"¶

Convert a service that runs as root to one that runs as a non-root user with only the minimum capabilities.
Author a seccomp policy using libseccomp that allows only the syscalls the service uses. Verify by attempting denied syscalls.
Apply via systemd SystemCallFilter= and confirm.

18.4 Hardening Drill¶

Review every long-running service on a host. For each: what capabilities does it actually need? Document. Tighten where possible.

18.5 Performance Tuning Slice¶

seccomp adds a small per-syscall cost. Measure with perf stat -e syscalls:sys_enter_* before and after.

Week 19 - Encryption at Rest: LUKS, dm-crypt, dm-verity¶

19.1 Conceptual Core¶

LUKS (Linux Unified Key Setup) is the standard for full-disk encryption: a header at the start of a block device contains key-slots (each protected by a passphrase or keyfile), which unlock a master key, which is used by dm-crypt to en/decrypt block I/O.
dm-verity provides integrity (not confidentiality) for read-only filesystems via a Merkle tree. Used in Android, Fedora Silverblue, and increasingly in container hosts.
fscrypt offers per-file encryption at the ext4/f2fs/UBIFS layer, with per-user keys.

19.2 Mechanical Detail¶

LUKS2 header structure: binary header + JSON metadata + key-slot area. cryptsetup luksDump shows the metadata.
The chain: passphrase → Argon2id KDF → key-slot key → unlocks master volume key → dm-crypt encrypts/decrypts with AES-XTS (default).
Key management:
Multiple key slots (8 by default in LUKS2). Add/remove with cryptsetup luksAddKey/luksRemoveKey.
TPM2 binding: systemd-cryptenroll --tpm2-device=auto for unattended boot with measured-boot integrity.
YubiKey FIDO2: systemd-cryptenroll --fido2-device=auto.
crypttab(5) for boot-time activation. systemd's generator translates it into units.

19.3 Lab-"Encrypt a Disk End to End"¶

Create a LUKS2 volume on a spare disk or loopback file.
Format with ext4. Mount.
Add a TPM2-bound key slot. Enroll a recovery passphrase.
Configure auto-unlock at boot via crypttab.
Simulate disk theft: dump the device contents; verify they are opaque without the key.

19.4 Hardening Drill¶

For laptops: enable LUKS with a strong passphrase + TPM2 + measured boot. For servers: TPM2 binding plus clevis + tang for network-bound disk encryption (auto-unlock only when the host can reach the key server).

19.5 Performance Tuning Slice¶

Measure encryption overhead with fio: LUKS-encrypted vs plaintext. On modern CPUs with AES-NI, expect <5% throughput cost.

Week 20 - Audit, Integrity Measurement, and Compliance¶

20.1 Conceptual Core¶

The Linux audit subsystem (auditd) generates structured logs of security-relevant events: syscalls, file access, login attempts, privilege escalations.
IMA/EVM (Integrity Measurement Architecture / Extended Verification Module) hashes files at access time and optionally signs them; integrates with TPM for attestation.
CIS benchmarks and STIGs are industry-standard hardening checklists. openscap and lynis automate auditing.

20.2 Mechanical Detail¶

auditctl configures rules at runtime; /etc/audit/rules.d/*.rules for persistence. aureport, ausearch for querying.
A reasonable baseline rule set: log every execve, every open failure on /etc/passwd//etc/shadow, every change to auditd config itself, every privilege escalation.
IMA: ima_policy="appraise_tcb" kernel parameter. Measures executables; with EVM, signs metadata.
Compliance scanning:
lynis audit system for a quick local audit.
openscap (oscap xccdf eval) for SCAP-based formal benchmarks.

20.3 Lab-"An Audited Host"¶

Configure auditd with a baseline ruleset.
Trigger expected events (failed su, edit of /etc/passwd); verify logs.
Run `lynis audit system - record the score and address the top 5 findings.
(Optional) Boot with IMA enabled; measure the impact on boot time and observe /sys/kernel/security/ima/ascii_runtime_measurements.

20.4 Hardening Drill¶

Ship an idempotent provisioning playbook (Ansible/shell) that applies CIS baseline tunables: disable unused services, set umask 027 for system accounts, limit at/cron to admins, etc.

20.5 Performance Tuning Slice¶

Audit logging at high syscall rates is expensive. Measure log volume; rate-limit chatty rules with - Ffilters or move toaudisp-remote` to ship off-host.

Month 5 Capstone Deliverable¶

A security-and-hardening/ directory: 1. mac-profiles/ - SELinux + AppArmor for the echo service. 2.cap-seccomp/ - minimum-capability + seccomp-policy template. 3. luks-tang/ - fully encrypted volume with network-bound auto-unlock. 4.audit-baseline/ - auditd rules + a `lynis - validated host playbook.

A THREAT_MODEL.md for the example service: assets, attack surfaces, mitigations.

Month 6-Kernel Modules, Performance Mastery, Capstone¶

Goal: by the end of week 24 you have shipped one capstone in your chosen track and can defend every design decision in a senior systems-engineering interview.

Weeks¶

Week 21 - Loadable Kernel Modules (LKM)¶

21.1 Conceptual Core¶

A loadable kernel module is C code compiled against the running kernel's headers, loaded via insmod/modprobe, and unloaded via rmmod. It runs at ring 0, with full kernel privileges. There is no safety net: a bug crashes the box.
The legitimate uses for LKMs in 2026 are narrow: device drivers for hardware not supported by mainline, niche filesystem additions, specific tracing or security modules that cannot be expressed as eBPF.
For most "I want to extend the kernel" needs today, eBPF is the right answer. Pick LKM only when you genuinely cannot express the work in eBPF.

21.2 Mechanical Detail¶

The skeleton:

#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/init.h>
static int __init mymod_init(void) { pr_info("hello\n"); return 0; }
static void __exit mymod_exit(void) { pr_info("bye\n"); }
module_init(mymod_init);
module_exit(mymod_exit);
MODULE_LICENSE("GPL");

Build via a Makefile:

obj-m += mymod.o
KDIR := /lib/modules/$(shell uname -r)/build
all: ; make -C $(KDIR) M=$(PWD) modules
clean: ; make -C $(KDIR) M=$(PWD) clean

Character device skeleton: register a cdev, define file_operations (open, release, read, write, unlocked_ioctl), allocate a dev_t via alloc_chrdev_region, create a class and device so udev creates /dev/<name>.
Locking: spin_lock (interrupt context), mutex_lock (sleepable), rcu_read_lock (read-mostly). Choosing wrong is a deadlock or a soft-lockup.
Memory: kmalloc (small, contiguous), vmalloc (large, possibly non-contiguous), kmem_cache_* (slab). Account in /proc/slabinfo.
Signing & secure boot: production hosts with secure boot will reject unsigned modules. Sign with scripts/sign-file against a MOK (Machine Owner Key).

21.3 Lab-"A Character Device LKM"¶

Write pkv, a simple in-kernel key/value character device: - /dev/pkv accepts writes of the form key=value\n and reads return the value for the last-written key. - 100 KV slots, in-kernel hash table. - Concurrency-correct under multiple writers/readers (use a mutex; pursue an rwlock variant as a stretch). - ioctl operations for LIST and DELETE. - KUnit tests in tree. - Loads/unloads cleanly with no lockdep or KASAN warnings (turn both on in your test kernel).

21.4 Hardening Drill¶

Read CVE history for staging drivers; identify three classes of bugs that recur (UAF after kfree on error path, missing copy_from_user length check, integer overflow). Audit your pkv for each.

21.5 Performance Tuning Slice¶

Run bpftrace with kprobes on your module's functions; measure per-op latency. Compare against an equivalent userspace KV (e.g., a Unix-socket server).

Week 22 - Tracing and Performance Mastery: ftrace, perf, BPF¶

22.1 Conceptual Core¶

The Linux observability triad: ftrace (function tracer; in-kernel only), perf (sampling profiler + tracepoint subscriber), eBPF (programmable, low-overhead).
Each has its niche. For "what is the kernel doing?" → ftrace function_graph. For "where is CPU time spent?" → perf record + flamegraph. For "summarize a behavior with low overhead" → eBPF.

22.2 Mechanical Detail¶

ftrace via /sys/kernel/tracing/. Set current_tracer, filter with set_ftrace_filter, dump trace. Modern frontend: trace-cmd.
perf:
`perf stat - counter snapshot.
perf record -g + `perf report - sampling profiler with call graphs.
perf script to feed into FlameGraph.pl for flamegraphs.
`perf trace - strace-equivalent with low overhead.
`perf top - live profiler.
bpftrace for one-liners; libbpf C for production tools.

22.3 Lab-"End-to-End Profiling"¶

Take a service running on a host. Capture: perf record -F 99 -ag -- sleep 30.
Generate a flamegraph.
Identify the top three CPU consumers; for each, propose a hypothesis and a fix.
Compare with the same workload profiled by parca or pyroscope if available.

22.4 Hardening Drill¶

perf requires kernel.perf_event_paranoid ≤ 2 for unprivileged use. Decide your policy: tighter (=3, perf disabled for non-root) or looser (=1, allow unprivileged users to profile their own processes).

22.5 Performance Tuning Slice¶

Run runqlat, cpudist, offcputime (BPF) on a busy host. Build a one-page "what's wrong with this host?" diagnostic flow.

Week 23 - Performance Tuning at Scale¶

23.1 Conceptual Core¶

Tuning at scale is systematic-not "tweak vm.swappiness." It is: measure, hypothesize, change one variable, re-measure, document.
Brendan Gregg's USE method: for every resource, characterize Utilization, Saturation, Errors. Apply to CPU, memory, disk, network.
The common bottlenecks, in rough rank order: I/O (latency or throughput), memory pressure (PSI), syscall-frequency or context-switch storms, lock contention, NIC drops.

23.2 Mechanical Detail¶

CPU: mpstat -P ALL 1, pidstat 1. Look for one-CPU-pegged. Check for irq imbalance (/proc/interrupts).
Memory: vmstat 1, /proc/pressure/memory. PSI > 0% sustained = problem.
Disk: iostat -xz 1. Look at await and %util. >70% %util and >10ms await = saturated.
Network: sar -n DEV 1, ethtool -S <iface>. Drops, errors, frame-too-long counts.
Kernel: perf top - if__do_softirqor_raw_spin_lock` is hot, dig further.

23.3 Lab-"Triage Drill"¶

A scripted "broken host" is provided (or build one): a VM with one of {disk-bound, memory-bound, network-bound, lock-contended, scheduler-thrashing} pathologies. Diagnose using only the tools above. Document the inference chain. Then introduce a fix and verify.

23.4 Hardening Drill¶

Codify the USE-method dashboard for your environment. Wire to Prometheus/Grafana. Alert on saturation > 80% for any resource for 5 min.

23.5 Performance Tuning Slice¶

Sysctl baseline for high-throughput servers (review and adapt-never copy blindly):

net.core.somaxconn = 4096
net.ipv4.tcp_max_syn_backlog = 4096
net.core.netdev_max_backlog = 16384
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 15
vm.dirty_ratio = 10
vm.dirty_background_ratio = 3
vm.swappiness = 10
fs.file-max = 2097152

Each line should be paired with a justification in your runbook.

Week 24 - Capstone Integration & Defense¶

24.1 Conceptual Core¶

The final week is integration, not new material. Bring your chosen capstone (see CAPSTONE_PROJECTS.md) to defensible quality.

24.2 Final Hardening Checklist¶

CIS benchmark (or lynis) score documented; top findings addressed.
LSM (SELinux or AppArmor) enforcing for any service touched.
All long-running services systemd-managed with full hardening directives.
auditd configured; ruleset documented.
LUKS (where applicable); TPM2 binding documented.
Sysctl baseline applied; deviations explained.
Boot is reproducible: same image → same hash, where applicable.
Observability: node_exporter, eBPF observability tools, log shipping.
Runbooks for: OOM, disk-full, network-down, runaway-CPU, broken-DNS.

24.3 Lab-"Defend the Host"¶

Schedule a 45-minute mock review with a peer. Walk through: the host's threat model, the capstone artifact, the observability story, and a live demo of triaging a fault. Defend every choice-cgroup policy, LSM type, sysctl values, auditd rules.

24.4 Performance Tuning Slice¶

Final pass: capture a 1-minute perf record -ag flamegraph of the capstone under representative load. Commit it. This is the resume artifact.

Month 6 Deliverable¶

The capstone artifact, plus an aggregated linux-mastery/ repo containing every prior month's deliverable.

Appendix A-Hardening and Performance Tuning Reference¶

Curriculum-wide consolidation of the hardening and tuning slices. By week 24 the reader's host-baseline/ template should contain working examples of each section.

A.1 The systemd Hardening Cheat Sheet¶

For every long-running service, evaluate each:

[Service]
# Identity
DynamicUser=yes
User=svc
Group=svc

# Filesystem isolation
ProtectSystem=strict
ProtectHome=yes
ProtectKernelTunables=yes
ProtectKernelModules=yes
ProtectKernelLogs=yes
ProtectControlGroups=yes
ProtectClock=yes
ProtectHostname=yes
ProtectProc=invisible
PrivateTmp=yes
PrivateDevices=yes
PrivateUsers=yes
PrivateMounts=yes
ReadOnlyPaths=/
ReadWritePaths=/var/lib/svc

# Networking
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
IPAddressAllow=localhost
IPAddressDeny=any
RestrictNetworkInterfaces=lo eth0

# Capabilities & syscalls
NoNewPrivileges=yes
CapabilityBoundingSet=
AmbientCapabilities=
SystemCallArchitectures=native
SystemCallFilter=@system-service
SystemCallFilter=~@privileged @resources

# Resources
MemoryMax=512M
MemoryHigh=384M
CPUQuota=200%
TasksMax=128
LimitNOFILE=65536

# Other
LockPersonality=yes
RestrictNamespaces=yes
RestrictRealtime=yes
RestrictSUIDSGID=yes
RemoveIPC=yes
UMask=0077

Validate with `systemd-analyze security - aim for "exposure level: 1.0 OK" or below.

A.2 Sysctl Baseline (Server)¶

# Network-connection backlog
net.core.somaxconn = 4096
net.ipv4.tcp_max_syn_backlog = 4096
net.core.netdev_max_backlog = 16384

# Network-TCP behavior
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 15
net.ipv4.tcp_keepalive_time = 60
net.ipv4.tcp_keepalive_intvl = 10
net.ipv4.tcp_keepalive_probes = 6

# Network-anti-spoof
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.all.accept_source_route = 0
net.ipv4.conf.all.accept_redirects = 0
net.ipv6.conf.all.accept_redirects = 0
net.ipv4.conf.all.send_redirects = 0
net.ipv4.icmp_echo_ignore_broadcasts = 1

# VM / memory
vm.swappiness = 10
vm.dirty_ratio = 10
vm.dirty_background_ratio = 3
vm.overcommit_memory = 0
vm.mmap_min_addr = 65536
vm.unprivileged_userfaultfd = 0

# File handles & inotify
fs.file-max = 2097152
fs.inotify.max_user_watches = 524288
fs.protected_hardlinks = 1
fs.protected_symlinks = 1

# Kernel hardening
kernel.dmesg_restrict = 1
kernel.kptr_restrict = 2
kernel.unprivileged_bpf_disabled = 1
kernel.yama.ptrace_scope = 1
kernel.perf_event_paranoid = 2
kernel.kexec_load_disabled = 1

Customize per workload; never copy blindly.

A.3 The `perf` Reference Card¶

Goal	Command
Top CPU consumers (live)	`perf top -F 99 -g`
Sample profile + call graph	`perf record -F 99 -g -- sleep 30; perf report`
Flamegraph	`perf record -F 99 -ag -- sleep 30; perf script \| stackcollapse-perf.pl \| flamegraph.pl > out.svg`
Counter snapshot	`perf stat -e cycles,instructions,cache-misses ./prog`
Tracepoints	`perf trace -F` (strace replacement)
Sched debug	`perf sched record sleep 10; perf sched latency`
Block I/O	`perf trace -e block:* -- sleep 10`

A.4 The `bpftrace` Reference Card¶

# top syscalls per process for 10 s
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count() }' -c 'sleep 10'

# file open audit
bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s %s\n", comm, str(args->filename)) }'

# tcp connect latency
bpftrace -e 'kprobe:tcp_v4_connect { @start[tid] = nsecs }
              kretprobe:tcp_v4_connect /@start[tid]/ {
                  @lat = hist((nsecs - @start[tid])/1000); delete(@start[tid])
              }'

# run-queue latency histogram
bpftrace tools/runqlat.bt

# what's filling the page cache
bpftrace -e 'kprobe:add_to_page_cache_lru { @[comm] = count() }'

A.5 SystemTap (Legacy)¶

SystemTap predates eBPF and is still occasionally useful on RHEL 7-era kernels:

stap -e 'probe syscall.open { printf("%s %s\n", execname(), filename) }'

For new work, prefer eBPF / bpftrace. Only learn stap if your environment forces it.

A.6 The `host-baseline/` Template¶

host-baseline/
  ansible/
    roles/
      common/        # hostname, time, base packages
      sshd/          # hardened sshd_config
      audit/         # auditd ruleset
      sysctl/        # the baseline above
      firewall/      # nftables ruleset
      lsm/           # SELinux or AppArmor profiles
      observability/ # node_exporter, journald-remote, eBPF tools
      cgroups/       # tenant cgroup template
  systemd-units/     # service templates
  ebpf-tools/        # in-house tracing tools
  RUNBOOK.md
  THREAT_MODEL.md

This is the artifact every host you bring up after week 24 should be provisioned from.

Appendix B-Build-From-Scratch Linux Toolbox¶

A working Linux engineer should have implemented each of the following at least once.

B.1 A Self-Healing systemd Service¶

# /etc/systemd/system/self-heal.service
[Unit]
Description=Self-healing application
After=network-online.target
Wants=network-online.target

[Service]
Type=notify
ExecStart=/usr/local/bin/myapp
WatchdogSec=30s
Restart=always
RestartSec=2s
StartLimitInterval=300
StartLimitBurst=5
TimeoutStopSec=10
NotifyAccess=main

# health-check via WatchdogSec: app calls sd_notify(0, "WATCHDOG=1") periodically
# if it stops, systemd kills and restarts

# (paste the hardening block from Appendix A here)

[Install]
WantedBy=multi-user.target

The application calls sd_notify(0, "READY=1") after init and sd_notify(0, "WATCHDOG=1") periodically. If the watchdog timer expires, systemd kills and restarts. This is the standard pattern for self-healing in production.

B.2 A Minimal Init (PID 1)¶

For containers or micro-VMs:

#include <sys/wait.h>
#include <unistd.h>
#include <signal.h>

int main(int argc, char **argv) {
    if (argc < 2) return 1;
    pid_t pid = fork();
    if (pid == 0) execvp(argv[1], argv+1);
    // PID 1 must reap zombies and forward signals
    for (;;) {
        int status;
        pid_t p = waitpid(-1, &status, 0);
        if (p == pid) return WEXITSTATUS(status);
    }
}

Use tini or dumb-init in production; this is for understanding.

B.3 A Hand-Built Container¶

(Sketch-see Month 3 lab.) - clone(CLONE_NEWUSER | CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWNET | CLONE_NEWCGROUP, ...) - Write UID/GID maps via /proc/<pid>/uid_map, gid_map. - Mount proc, sysfs, tmpfs inside. - pivot_root into rootfs. - Configure veth pair on the host; move one end into the new netns. - execve user command.

B.4 A Kernel Module Skeleton¶

#include <linux/module.h>
#include <linux/init.h>
#include <linux/kernel.h>

static int __init mymod_init(void) {
    pr_info("mymod loaded\n");
    return 0;
}
static void __exit mymod_exit(void) {
    pr_info("mymod unloaded\n");
}
module_init(mymod_init);
module_exit(mymod_exit);
MODULE_LICENSE("GPL");
MODULE_AUTHOR("you");
MODULE_DESCRIPTION("skeleton");

Makefile:

obj-m += mymod.o
KDIR := /lib/modules/$(shell uname -r)/build
all:
    $(MAKE) -C $(KDIR) M=$(PWD) modules
clean:
    $(MAKE) -C $(KDIR) M=$(PWD) clean

B.5 An eBPF Skeleton (libbpf + CO-RE)¶

prog.bpf.c:

#include "vmlinux.h"
#include <bpf/bpf_helpers.h>

struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 1 << 20);
} events SEC(".maps");

SEC("tracepoint/syscalls/sys_enter_execve")
int handle_execve(void *ctx) {
    char *e = bpf_ringbuf_reserve(&events, 16, 0);
    if (!e) return 0;
    bpf_ringbuf_submit(e, 0);
    return 0;
}

char LICENSE[] SEC("license") = "GPL";

User-side: use libbpf skeletons (bpftool gen skeleton prog.bpf.o > prog.skel.h). The full pattern is in `libbpf-bootstrap - clone it, modify.

B.6 A Seccomp-bpf Allowlist¶

#include <seccomp.h>

scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_KILL);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0);
// ... only what you actually need
seccomp_load(ctx);

For systemd-managed services, prefer SystemCallFilter= directives; this raw API is for embedded code, sandboxes, and runtime libraries.

B.7 A udev Rule¶

/etc/udev/rules.d/99-mydev.rules:

SUBSYSTEM=="usb", ATTR{idVendor}=="abcd", ATTR{idProduct}=="1234", \
  MODE="0660", GROUP="plugdev", SYMLINK+="mydev"

Reload: udevadm control --reload-rules && udevadm trigger.

B.8 A Repeatable VM Lab Setup¶

The final, hidden artifact: a Vagrantfile (or qemu script) that boots a fresh Ubuntu/Fedora VM with cloud-init, pre-installs bpftrace, perf, trace-cmd, auditd, cryptsetup, your toolbox above. Every lab in this curriculum should be reproducible from this base in under 5 minutes.

Appendix C-Contributing to the Linux Kernel: A Playbook¶

The Linux kernel is famously approachable, and famously demanding. This appendix is the on-ramp.

C.1 Mental Model¶

The Linux kernel is developed via mailing lists (LKML and per-subsystem lists), with patches sent inline via git send-email. There is no GitHub PR equivalent. Reviewers reply on-list; subsystem maintainers (MAINTAINERS file) merge into their trees; Linus pulls from those into mainline.

Implications: 1. You send patches as emails, not PRs. 2. The cultural norm is direct, sometimes blunt feedback. Do not take it personally. 3. The maintainer set is finite and busy. A two-week response cycle is normal; six weeks is not unusual; ignored = re-send a polite ping.

C.2 Setting Up¶

git clone https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
cd linux
make defconfig    # or copy your distro's config
make -j$(nproc)

For development, prefer working off the subsystem tree (e.g., net-next for networking work, tip for sched/locking, staging for drivers in incubation). The MAINTAINERS file lists per-subsystem trees.

Configure git:

git config sendemail.smtpserver smtp.example.com
git config sendemail.smtpuser you@example.com
git config user.signingkey "..."

C.3 Where the Easy Wins Are¶

C.3.1 Documentation¶

Documentation/ is large and uneven. Typo fixes and clarifications are welcomed by linux-doc@vger.kernel.org.
Trivially good first patch.

C.3.2 Coding-style fixes¶

scripts/checkpatch.pl flags style violations. Submit a series fixing them. (But: huge mass-style patches are sometimes resisted as churn-limit to one file or one subsystem.)

C.3.3 Staging drivers¶

drivers/staging/ is the incubator for drivers being prepped for mainline. Cleanup work here (sparse fixes, checkpatch, removing unused code) is well-supported.

C.3.4 New tracepoints / debugfs entries¶

Adding a tracepoint to expose useful info to eBPF tooling is a tractable medium-difficulty contribution. Discuss design on-list first.

C.3.5 Bug fixes¶

The bugzilla.kernel.org tracker has reproducible bugs. Pick one in a subsystem you understand.

C.3.6 Don't start here (yet)¶

Scheduler core (kernel/sched/).
Memory management core (mm/).
Networking core (net/core/, net/ipv4/tcp*).
VFS core (fs/namei.c, fs/dcache.c).

C.4 The First-Patch Workflow¶

Pick a small change. A typo, a checkpatch warning, a tested driver fix.
Make the change in a branch.
Run scripts/checkpatch.pl --strict <patch>. Fix all warnings.
Build the affected subsystem. For drivers: make M=drivers/foo/.
Test on real or virtual hardware. A patch with no Tested-by is suspect.

Commit with a kernel-style message:

subsys: short imperative description (under 70 chars)

Body explaining the problem and why this change fixes it. Past tense
for the bug ("crashed when..."), imperative for the fix ("Avoid the
crash by checking..."). Wrap at 72 columns.

Fixes: <12-char-sha1> ("subject of bug-introducing commit")
Cc: stable@vger.kernel.org # if backport-worthy
Signed-off-by: Your Name <you@example.com>

The Signed-off-by is a DCO (Developer Certificate of Origin) declaration; mandatory.

Identify the recipients:
```
./scripts/get_maintainer.pl --no-rolestats <patch>
```
This lists the subsystem maintainer, reviewers, mailing list. Send to all; CC LKML.

Send:

git send-email --to=maintainer@... --cc=list@... <patch>

Address review. Reviewers may ask for: design changes, additional testing, breakup into multiple patches. Each new version is [PATCH v2] subsys: .... Always include a changelog after - --` describing what changed since v1.
Maintainer applies. Your name and email go into the commit log.

C.5 The MAINTAINERS Map¶

The MAINTAINERS file at the root of the tree lists, for every subsystem: maintainer(s), reviewers, mailing list, source files in scope, status (Maintained / Supported / Odd Fixes / Orphan).

get_maintainer.pl parses it. Always run before sending a patch.

C.6 Reading Map¶

For depth, in this order:

File	What it teaches
`Documentation/process/submitting-patches.rst`	Authoritative process. Read first.
`Documentation/process/coding-style.rst`	The kernel's C style; non-negotiable.
`Documentation/process/email-clients.rst`	Why your email client matters.
`Documentation/dev-tools/sparse.rst`	Kernel-specific static analysis.
`Documentation/admin-guide/cgroup-v2.rst`	The cgroup-v2 interface, normative.
`Documentation/scheduler/sched-design-CFS.rst`	CFS / EEVDF design.
`Documentation/vm/`	Memory management deep dives.
`Documentation/networking/`	Per-subsystem networking docs.
`Documentation/bpf/`	Modern eBPF design and ABIs.

C.7 Adjacent Targets if Mainline Is Too Heavy¶

bpftrace-high contribution velocity, friendly maintainers.
bcc-older but still active.
perf userspace tools-tools/perf/ in the kernel tree, but its own dynamic.
util-linux-mount, lsblk, nsenter, etc. Active, contributor-friendly.
systemd-large but well-organized; on GitHub, with PRs.
iproute2-ip, tc, ss. Smaller surface, important.

A merged contribution to any of these signals real Linux fluency.

C.8 Calibration¶

A reasonable goal for a curriculum graduate:

By end of week 23: a patch sent to LKML or a subsystem list (could be a doc fix or a checkpatch cleanup).
By end of capstone: that patch merged.
6 months post-curriculum: a substantive contribution-a driver fix, a small tracepoint addition, or a bpftrace tool merged.

Patient, persistent contributors become trusted contributors. Trusted contributors become maintainers.

Capstone Projects-Three Tracks, One Choice¶

Pick one. The work performed here is what you describe in interviews.

Track 1-Kernel Module: An Out-of-Tree LKM¶

Outcome: a non-trivial out-of-tree Linux kernel module, KUnit-tested, sparse-clean, KASAN-clean, with a clear README and a path toward upstream submission (even if you don't take it all the way).

Suggested scopes¶

A character-device key/value store (week 21 lab, hardened). Adds: ioctl for batch ops, an mmap interface for zero-copy reads, RCU-protected reader path.
A netfilter hook. A small accelerator that, e.g., counts packets matching a configurable BPF filter at the netfilter ingress hook, with stats exposed via a procfs entry.
A custom tracepoint suite. Add tracepoints to a subsystem of your choice (e.g., your pkv module from the lab) and write a bpftrace consumer.

Acceptance¶

Loads cleanly on at least two LTS kernels (e.g., 6.6 and 6.12).
KUnit tests in tree; pass on both kernels.
KASAN, lockdep, KCSAN warnings: zero across stress-test load.
Signed for secure boot.
A README.md with build, install, and use; a DESIGN.md with locking and memory ownership documented.

Skills exercised¶

Months 1 (kernel boundary), 2 (memory + scheduling internals), 6 (LKM development).

Track 2-eBPF Observability Tool¶

Outcome: a production-grade tracing tool comparable in quality to one of Brendan Gregg's BCC tools, with a proper userspace consumer, a Prometheus exporter, and CO-RE portability.

Suggested scopes¶

syscallat-system-call latency histograms, per-syscall, per-process, with low overhead. Equivalent of bpftrace's syscount but production-quality.
tcptop-top-N connections by bytes/sec, sortable by direction. Cilium's Hubble has equivalents; do this from scratch.
A profiler-like tool that, given a PID, samples on-CPU stacks at 99 Hz, aggregates with a frequency table, and exposes flamegraph data.

Acceptance¶

Implemented with libbpf + CO-RE.
Userspace consumer in C or Go (using cilium/ebpf).
Runs on kernels 5.10+ without recompilation.
Verifier-clean across architectures (x86_64 + aarch64 minimum).
Prometheus exporter with low-cardinality labels.
A bpftrace equivalent for comparison; document why the production version exists.
CPU overhead under representative load: < 1%.

Skills exercised¶

Months 3 (eBPF), 4 (networking, if you pick tcptop), 6 (perf tuning).

Track 3-Self-Healing Distributed Service¶

Outcome: a small distributed service (a multi-instance HTTP API, a job runner, a metrics collector) deployed on Linux hosts with a comprehensive self-healing posture.

Suggested scopes¶

A 3-node deployment of a small HTTP service:
Each node is a hardened Ubuntu/Debian/Rocky host provisioned by Ansible.
The service is systemd-managed with watchdog, full hardening directives, cgroups-v2 resource limits.
On any node failure, the survivors continue serving (use a TCP load balancer + healthcheck, e.g., HAProxy or IPVS).
Memory pressure (PSI > X%) triggers a soft restart of the worst offender via a cgroup-event watcher.
Disk pressure triggers log rotation and old-data cleanup.
A chaos.sh script kills random nodes; the cluster recovers without human intervention.

Acceptance¶

Reproducible from Ansible: ansible-playbook site.yml brings up 3 hosts from blank Ubuntu cloud images.
Full observability: node_exporter, journald, eBPF tools, Prometheus + Grafana.
A documented threat model and CIS-aligned baseline (lynis score).
A 60-minute "chaos demo": run chaos.sh; observe full self-healing; produce a one-page incident report from logs.
Encryption at rest (LUKS on data volumes); TLS between nodes (step-ca or self-signed); auditd shipping logs off-host.

Skills exercised¶

All months. This is the integrative track-the right choice if you want operations-engineer breadth.

Cross-Track Requirements¶

host-baseline/ template integrated.
ADRs (≥3).
THREAT_MODEL.md, RUNBOOK.md, RECOVERY.md.
Defense readiness: a 45-minute walkthrough with a peer.

Worked example - Week 10: cgroups v2 on your laptop, end to end¶

Companion to Linux Kernel → Month 03 → Week 10: Cgroups v2. The week explains the unified hierarchy and the controllers. This page is a hands-on tour: create a cgroup, put a process in it, limit it, watch the limit bite.

You need Linux 5.x+ with cgroups v2 enabled (default since ~2020 on most distros). Verify:

$ stat -fc %T /sys/fs/cgroup
cgroup2fs

If you see tmpfs instead, you're on cgroups v1 hybrid mode - change a boot parameter and reboot, or use a recent Fedora/Arch/Ubuntu 22.04+ where v2 is the default.

The cgroup filesystem¶

cgroups v2 is just a filesystem. Every directory under /sys/fs/cgroup/ is a cgroup; child directories are nested cgroups; the files inside control what the cgroup does.

$ ls /sys/fs/cgroup/
cgroup.controllers  cgroup.procs        cpu.stat           memory.current  ...
cgroup.max.depth    cgroup.subtree_control  io.stat        memory.max      ...

cgroup.controllers - what controllers are available to descendants.
cgroup.subtree_control - what controllers are enabled for descendants.
cgroup.procs - PIDs currently in this cgroup (root has all of them).
memory.max, cpu.max, io.max - controller-specific knobs.

Create a cgroup¶

$ sudo mkdir /sys/fs/cgroup/demo
$ ls /sys/fs/cgroup/demo
cgroup.controllers  cgroup.events  cgroup.procs  cgroup.stat  cgroup.type  ...

The kernel populated the new directory with the standard files. Some are read-only (cgroup.controllers), some are writable knobs.

But notice: there are no memory.max or cpu.max files yet. Controllers must be enabled by the parent before they appear in a child:

$ cat /sys/fs/cgroup/cgroup.subtree_control
(empty)
$ echo "+memory +cpu +io" | sudo tee /sys/fs/cgroup/cgroup.subtree_control
$ ls /sys/fs/cgroup/demo | grep -E '^(memory|cpu|io)'
cpu.idle
cpu.max
cpu.stat
io.max
io.stat
memory.current
memory.events
memory.max
memory.peak
memory.stat
...

Now the controllers are available in /sys/fs/cgroup/demo/.

Limit memory¶

Cap the cgroup at 100 MB:

$ echo $((100 * 1024 * 1024)) | sudo tee /sys/fs/cgroup/demo/memory.max
104857600

Or use the M / G suffix syntax (kernel parses it):

$ echo "100M" | sudo tee /sys/fs/cgroup/demo/memory.max

That's the hard limit. The kernel will OOM-kill anything in this cgroup that pushes past it.

Put a process in the cgroup¶

To move a process in, write its PID to cgroup.procs:

$ sudo bash -c 'echo $$ > /sys/fs/cgroup/demo/cgroup.procs && exec bash'

The trick: $$ is the PID of the subshell before exec replaces it; writing to cgroup.procs moves the PID; exec bash then replaces the subshell so the new bash inherits the cgroup. From this new shell, every child process is also in demo.

Verify:

$ cat /proc/self/cgroup
0::/demo

Yes - this process is in /demo.

Watch the limit bite¶

In the demo shell, allocate memory aggressively:

$ python3 -c '
chunks = []
for i in range(1000):
    chunks.append(b"x" * (1024 * 1024))
    if i % 10 == 0:
        print(i, "MB")
'
0 MB
10 MB
20 MB
...
90 MB
Killed

At ~100 MB, the OOM killer fires. Note that only the process in this cgroup was killed; the rest of your system is fine.

Check what happened:

$ cat /sys/fs/cgroup/demo/memory.events
low 0
high 0
max 0
oom 1
oom_kill 1

oom_kill 1 - one process was OOM-killed inside this cgroup. Tells you why your container died next time.

Limit CPU¶

$ echo "50000 100000" | sudo tee /sys/fs/cgroup/demo/cpu.max

The format is "quota period." 50000 100000 means "50ms of CPU per 100ms wall-clock period." That's 50% of one core, regardless of how many cores you have.

Test:

$ yes > /dev/null &
$ top -p $!
# CPU% should hover around 50.0

You can also use cpu.max max 100000 to remove the cap. Or set cpu.weight (proportional sharing) for a softer policy.

Limit IO¶

$ ls -la /dev/nvme0n1   # find the device number
brw-rw---- 1 root disk 259, 0 May 17 09:00 /dev/nvme0n1

$ echo "259:0 wbps=10485760" | sudo tee /sys/fs/cgroup/demo/io.max

That throttles writes to the named device to 10 MB/s. Run dd if=/dev/zero of=/tmp/test bs=1M count=200; watch the throughput cap.

Nested cgroups¶

Make /demo/web and /demo/worker:

$ sudo mkdir /sys/fs/cgroup/demo/web /sys/fs/cgroup/demo/worker
$ echo "+memory +cpu" | sudo tee /sys/fs/cgroup/demo/cgroup.subtree_control
$ echo "50M" | sudo tee /sys/fs/cgroup/demo/web/memory.max
$ echo "30M" | sudo tee /sys/fs/cgroup/demo/worker/memory.max

The limits compose: web is capped at 50 MB, worker at 30 MB, both inside demo's 100 MB ceiling. If web would exceed 50 MB, it's killed; if both together would exceed 100 MB, the kernel kills whichever is closer to its limit.

This is exactly the model container runtimes use. A pod is a cgroup, each container inside the pod is a sub-cgroup, the pod's memory.max is the pod's resource limit, the container's is each container's limit.

Tear it down¶

To delete a cgroup, first move all its processes out:

$ sudo bash -c 'echo $$ > /sys/fs/cgroup/cgroup.procs'
$ sudo rmdir /sys/fs/cgroup/demo/web /sys/fs/cgroup/demo/worker /sys/fs/cgroup/demo

rmdir only succeeds if the cgroup is empty (no procs, no descendants).

The trap¶

Setting memory.max to a value below your process's current memory use is fine on cgroups v1 (the limit is "from now on") but on v2, the kernel can immediately OOM-kill processes to bring memory back under the limit. Always reserve headroom; never tighten limits live unless you understand what's currently in there.

The other trap: cpu.weight is relative and useless on an idle system. If your cgroup is the only one running, it gets 100% of the CPU regardless of its weight. Only meaningful under contention.

Exercise¶

Recreate the demo above. Confirm the OOM kill triggers at exactly the limit.
Add memory.high set to 80M (below memory.max=100M). Re-run the Python allocator. Observe: under memory.high, the kernel throttles allocation but doesn't kill. What does memory.events show?
Look at how Docker uses this. docker run --memory=100m -d busybox. Then cat /sys/fs/cgroup/system.slice/docker-*.scope/memory.max. The match should be exact.
(Advanced) Read cpu.pressure - the PSI (pressure-stall information) interface. It reports how much your cgroup is waiting on CPU. More useful for capacity planning than instantaneous CPU%.

The main Week 10 chapter covers the controller architecture and v1 vs v2 differences.
Container Internals → capabilities walkthrough is the sibling - container runtimes layer cgroups on top of namespaces on top of capabilities.
Glossary: Cgroup, Namespace, OOM killer, PSI in the main glossary.