Linux Kernel¶
Kernel foundations, mm, namespaces, cgroups, eBPF, networking.
Printing this page
Use your browser's Print → Save as PDF. The print stylesheet hides navigation, comments, and other site chrome; pages break cleanly at section boundaries; advanced content stays included regardless of beginner-mode state.
Linux Systems & Kernel Engineering-A 24-Week Mastery Roadmap¶
Authoring lens: Principal Systems Engineer / Linux Kernel Specialist. Target outcome: A graduate of this curriculum should be capable of (a) reading kernel source and contributing patches to a subsystem, (b) operating a fleet of Linux hosts with a coherent observability and security posture, and (c) writing custom kernel modules, eBPF programs, and systemd integrations to solve real production problems.
This is not "Linux command line in 24 weeks." It assumes the reader is already comfortable on the shell, has shipped userspace code, and is ready to read C source from linux/, man-pages, and the kernel documentation tree as primary literature.
Repository Layout¶
| File | Purpose |
|---|---|
00_PRELUDE_AND_PHILOSOPHY.md |
The Linux design ethics; the kernel/userspace contract; reading list. |
01_MONTH_KERNEL_FOUNDATIONS.md |
Weeks 1–4. Boot, syscalls, VFS, processes & threads. |
02_MONTH_MEMORY_AND_SCHEDULING.md |
Weeks 5–8. Paging, swapping, HugePages, CFS/EEVDF scheduler. |
03_MONTH_NAMESPACES_CGROUPS_EBPF.md |
Weeks 9–12. Namespaces, cgroups v2, eBPF, observability. |
04_MONTH_NETWORKING.md |
Weeks 13–16. Netfilter, IPVS, XDP, bridges, OVS. |
05_MONTH_SECURITY_AND_HARDENING.md |
Weeks 17–20. SELinux/AppArmor, LUKS, sysctl, audit, secure boot. |
06_MONTH_KERNEL_MODULES_CAPSTONE.md |
Weeks 21–24. LKM development, perf tuning, capstone defense. |
APPENDIX_A_HARDENING_AND_TUNING.md |
sysctl, perf, SystemTap, BCC/bpftrace recipes. |
APPENDIX_B_TOOLBOX.md |
Build-from-scratch reference: a tiny init, a custom systemd unit, a kernel module skeleton, an eBPF skeleton. |
APPENDIX_C_CONTRIBUTING_TO_THE_KERNEL.md |
LKML; git send-email; first patch playbook; subsystem map. |
CAPSTONE_PROJECTS.md |
Three terminal projects: self-healing systemd unit, custom LKM, eBPF observability tool. |
How Each Week Is Structured¶
Every weekly module follows the same five-section format:
- Conceptual Core-the why, with a mental model.
- Mechanical Detail-the how, down to kernel source and
man-pagesreferences. - Lab-a hands-on exercise that cannot be completed without internalizing the concept.
- Hardening Drill-
sysctl, AppArmor/SELinux, audit rules, orsystemdsecurity directives that follow from the topic. - Performance Tuning Slice-a
perf/bpftrace/ftracemicro-task that compounds across weeks.
Each week is sized for ~12–16 focused hours.
Progression Strategy¶
Kernel Foundations ──► Memory & Scheduling ──► Namespaces / cgroups / eBPF
│ │ │
└───────────────┬───────┴──────────────────────────┘
▼
Networking
│
▼
Security & Hardening
│
▼
LKM Development & Capstone
Non-Goals¶
- This is not an LPIC/RHCSA exam-prep guide. The exam objectives focus on operational fluency; this curriculum focuses on internals.
- Not a guide to a specific distribution. Examples skew toward modern systemd-based distros (Debian/Ubuntu, Fedora/RHEL, Arch), with kernel paths from upstream.
- Not "How to use Docker." That belongs in the Container Internals curriculum.
Capstone Tracks (pick one in Month 6)¶
- Kernel Module Track-a non-trivial out-of-tree LKM (a character device, a netfilter hook, or a tracepoint consumer) with KUnit tests.
- eBPF Observability Track-a production-grade tracing tool comparable to
bpftrace'srunqlatortcpconnect, packaged with a userspace consumer. - Self-Healing Service Track-a systemd-managed application with health checks, automatic restart, watchdog, and integration with the cgroup memory pressure interface.
Details in CAPSTONE_PROJECTS.md.
Prelude-The Philosophy Behind the Syllabus¶
Sit with this document for an evening before week 1.
1. Linux Is a Kernel + a Contract¶
Linus Torvalds writes the kernel. The GNU project, glibc, BusyBox, systemd, and countless others build a userspace around it. What people call "Linux" is the interface between these layers-and that interface is what a Linux engineer must master.
The contract has three surfaces:
- The system call ABI-
man 2 syscalls. Stable across kernel versions for 25+ years. The kernel's most binding promise. - The procfs / sysfs / netlink interfaces-semi-stable, documented in
Documentation/ABI/in the kernel tree. The control surface for nearly every tunable. - The character/block device file interface-
/dev. Hardware abstracted as files; the most Unix idea in Linux.
If you internalize "everything is a file or a syscall," the kernel/userspace boundary becomes legible.
2. The Five-Axis Cost Model¶
A working Linux engineer reasons about every system change along five axes:
| Axis | Question to ask |
|---|---|
| Boundary cost | Does this cross a syscall? A context switch? A copy from userspace? |
| Memory | Where is this allocated-slab, page, anon, file-backed, hugepage? |
| Scheduler | What runqueue does this run on? Will it preempt? Is it RT-class? |
| Isolation | Which namespace, cgroup, capability, MAC label? |
| Failure | What does the OOM killer do? What does the audit log show? |
Beginner courses teach axis 1 only. This curriculum forces all five into your hands by week 12.
3. The Reading List¶
Primary - Linux Kernel Development, Robert Love (3e). The canonical introductory text. - Understanding the Linux Kernel, Bovet & Cesati. Older but unmatched depth on memory management and process control. - The Linux Programming Interface, Michael Kerrisk. The stdlib/syscall reference. Treat as a pinned tab. - Systems Performance, Brendan Gregg (2e). The performance bible. - BPF Performance Tools, Brendan Gregg. Required for Month 3.
Documentation
- Documentation/ in the kernel tree. Particularly Documentation/admin-guide/, Documentation/networking/, Documentation/cgroup-v2.rst, Documentation/scheduler/, Documentation/vm/.
- `man-pages - Kerrisk's project. Read sections 2 (syscalls), 3 (libc), 5 (file formats), 7 (overviews).
Adjacent - TCP/IP Illustrated, Vol. 1, Stevens. For the networking month. - Operating Systems: Three Easy Pieces, Arpaci-Dusseau. Free, excellent, foundational.
4. Curriculum Philosophy: "Read the Source, Trace the Syscall"¶
Three rules:
- Source first, blog second. When the curriculum says "study the page-fault handler," it means open
mm/memory.c::handle_mm_faultand read it. Blogs go stale; commits are dated. straceandperfare the teachers. When you do not understand why a program behaves as it does, the first response isstrace -fc, the second isperf trace, and only the third is to ask another human.- One lab per concept, one upstream interaction per phase. By the end of each month: an
lkml.orgreply, a documentation typo fix, abpftracerecipe shared, or a kernel-module experiment posted publicly.
5. What Linux Is Not For¶
A graduate of this curriculum should be able to argue these points:
- Hard real-time. Stock Linux is preemptible but not hard-RT. PREEMPT_RT is in tree but still not microseconds-deterministic. Use Xenomai, RTEMS, or QNX.
- Storage with strict tail-latency requirements. Linux block layer adds tens-of-microseconds variance. SPDK or DPDK-on-userspace bypasses the kernel entirely.
- Code where the team has no C, no syscall, no signal-handling intuition. A team that has not debugged a
signalfd, avfork()race, or a misconfiguredulimitwill struggle.
6. A Note on AI-Assisted Workflows¶
- Never run AI-suggested
dd,mkfs,cryptsetup,iptables -F, orsystemctl maskwithout reading the man page first. The blast radius of a bad command at this layer is the entire system. - Verify privileged commands on a VM before any production host.
qemu-system-x86_64with a Debian cloud image and 30 seconds ofcloud-initis a reasonable scratch environment.
You are now ready for Week 1. Open 01_MONTH_KERNEL_FOUNDATIONS.md.
Month 1-Kernel Foundations: Boot, Syscalls, VFS, Processes¶
Goal: by the end of week 4 you can (a) trace a process from fork() through execve() to _exit(), (b) describe the VFS layer and the path of open("/etc/passwd", O_RDONLY) from libc to filesystem driver, (c) read /proc/<pid>/ and /sys/ to debug a misbehaving process, and (d) write a basic systemd unit with security hardening.
Weeks¶
- Week 1 - Boot, Init, Systemd
- Week 2 - Syscalls and the Kernel/Userspace Boundary
- Week 3 - The Virtual File System (VFS)
- Week 4 - Processes, Threads, and Signals
Week 1 - Boot, Init, Systemd¶
1.1 Conceptual Core¶
- A modern Linux boot is a chain of progressively more-Linux-like stages: firmware (UEFI / BIOS) → bootloader (GRUB / systemd-boot) → kernel + initramfs →
/sbin/init(systemd, mostly). - systemd is the dominant init+service manager. It is not SysV-init with
Type=simpleunits bolted on; it is a unit-graph dependency engine that supervises sockets, timers, mounts, slices, and services as first-class objects. - The unit hierarchy:
target(a runlevel-equivalent) ←service/socket/timer/mount/device/slice/path← drop-ins (/etc/systemd/system/foo.service.d/*.conf).
1.2 Mechanical Detail¶
- Boot trace:
dmesg | head -200plusjournalctl -b 0 --no-pagershows the kernel and userspace boot logs from the current boot. systemd-analyze blameandsystemd-analyze critical-chaindecompose boot time.- A unit file's anatomy:
[Unit](deps, ordering),[Service](exec, restart, security),[Install](alias, enable target). - Hardening directives:
NoNewPrivileges=yes,ProtectSystem=strict,ProtectHome=yes,PrivateTmp=yes,RestrictAddressFamilies=AF_INET AF_INET6,CapabilityBoundingSet=,SystemCallFilter=@system-service,MemoryMax=,CPUQuota=. Every long-running service should set these. systemctl edit <unit>for drop-ins; never edit/lib/systemd/system/*(overwritten by package updates).
1.3 Lab-"A Hardened Echo Service"¶
- Write a tiny C program that listens on a Unix socket and echoes input. Static-link with - static`.
- Write a
echo.socketandecho.servicepair using socket activation. - Apply every hardening directive that is plausible for an echo server. Run
systemd-analyze security echo.serviceand aim for a score under 1.0. - Verify isolation: from inside the service (debug via
systemd-run --shell --unit=echo.service), confirmProtectSystemmakes/usrread-only.
1.4 Hardening Drill¶
- Read
man systemd.execcover-to-cover. Make a one-page cheat sheet of hardening directives.
1.5 Performance Tuning Slice¶
- Capture
systemd-analyze plot > boot.svgfrom a fresh VM. Identify the longest-blocking unit and propose aBefore=/After=adjustment.
Week 2 - Syscalls and the Kernel/Userspace Boundary¶
2.1 Conceptual Core¶
- A system call is a transfer of control from userspace to the kernel via a defined ABI: trigger an interrupt or a
syscallinstruction, the kernel reads register-passed arguments, dispatches via a table indexed by syscall number. - On x86_64 Linux: arguments in
rdi, rsi, rdx, r10, r8, r9; syscall number inrax; return inrax. Errors as negativerax( - errno`). - libc wraps each syscall in a function (
open(2)is a thin wrapper; some wrappers likefork(3)glue toclone(2)).
2.2 Mechanical Detail¶
- Read `arch/x86/entry/syscalls/syscall_64.tbl - the syscall table.
- The path: userspace
syscallinstruction →entry_SYSCALL_64(arch/x86/entry/entry_64.S) →do_syscall_64→sys_<name>in C. strace -f -e trace=%file ./progtraces file-related syscalls only.ltracefor library-level tracing (less useful since most actions hit the kernel anyway).perf traceis the modern equivalent ofstracewith much lower overhead.audit(auditd) for production-grade syscall logging-gated by rules, written via netlink.
2.3 Lab-"Syscall Forensics"¶
- `strace -c ls /etc - produce a count summary of syscalls. Predict the top 5; verify.
- Implement
catin pure C using onlyopen,read,write,close. No libc helpers (syscall(SYS_open, ...)). - Run under
strace -fto verify zero unexpected calls. - Build a minimal
seccompallowlist for yourcat, allowing only the syscalls actually used. Verify it kills attempts to invoke other syscalls.
2.4 Hardening Drill¶
- Configure
auditctl -a always,exit -F arch=b64 -S execve -k execto log everyexecve. Read the resultingaureport -xoutput. Document the operational cost (log volume).
2.5 Performance Tuning Slice¶
- Run a workload under
perf stat -e syscalls:sys_enter_*. Identify the highest-frequency syscall. Hypothesize a reduction (batching, larger buffers, splice).
Week 3 - The Virtual File System (VFS)¶
3.1 Conceptual Core¶
- The VFS is the kernel's abstraction over filesystem implementations. Userspace sees one consistent API (
open,read,stat,mmap); the kernel dispatches to ext4, btrfs, xfs, tmpfs, procfs, sysfs, fuse via per-FS operation tables. - Four core VFS objects:
- inode-a file's metadata (owner, perms, size, pointers to data blocks).
- dentry-a directory entry; the cached mapping from a name to an inode.
- file-an open file description (per
open()call); holds offset, flags, ref count. - superblock-a mounted filesystem instance.
- The dentry cache (
dcache) and inode cache (icache) are why repeatedstat()s are fast.
3.2 Mechanical Detail¶
- Read
fs/open.c::do_sys_openat2 - the entry point ofopenat(2)`. path_openatresolves the path through the dcache, allocating new dentries on miss.- Each FS implements a
struct file_operationsandstruct inode_operations. ext4's are infs/ext4/file.c,fs/ext4/inode.c. - Mount namespaces (preview)-each mount namespace has its own mount tree. Containers exploit this.
- Pseudo-filesystems:
procfs(/proc)-kernel-introspection:/proc/<pid>/,/proc/cpuinfo,/proc/meminfo,/proc/sys/.sysfs(/sys)-device/driver-introspection, with most kernel tunables under/sys/kernel/,/sys/class/,/sys/block/,/sys/fs/cgroup/.cgroupfs,devtmpfs,tmpfs,bpf,tracefs,debugfs,securityfs.
3.3 Lab-"VFS Forensics"¶
- Catalogue every entry in
/proc/<pid>/for one of your processes. Document what each gives. - Read
/proc/<pid>/mapsand explain every region (text, heap, stack, vdso, vvar, shared libs). - Use
eBPF'svfs_openkprobe (viabpftrace) to log everyopensystem-wide for 5 seconds. Triage the noise. - Mount
tmpfsat a custom path, fill it, and observe the allocator behavior in/proc/meminfo(Shmem).
3.4 Hardening Drill¶
- Lock down
/procwithhidepid=2(mount option). Verify a non-root user can no longer see other users' processes.
3.5 Performance Tuning Slice¶
- Use
perf trace -Fto find the hottest VFS function on your workload. If it's__d_lookup, your dcache is being thrashed; if it's__find_get_block, your buffer cache is. Document the inference.
Week 4 - Processes, Threads, and Signals¶
4.1 Conceptual Core¶
- A process is the unit of resource ownership: address space, file descriptors, credentials, signal handlers.
- A thread in Linux is a process that shares its address space (and most other resources) with its siblings-the kernel calls them all "tasks."
clone(2)with various flags is the underlying syscall;fork()is a special case (clone(SIGCHLD));pthread_create()is another (clone(CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND | CLONE_THREAD | ...)). - Signals are the kernel's asynchronous notification mechanism:
kill(2),sigaction(2), signal masks,signalfd(2). The most error-prone part of Unix.
4.2 Mechanical Detail¶
task_struct - the kernel's per-task struct, ~3 KB. Readinclude/linux/sched.h`.- The PID is the per-namespace task identifier; the TGID (thread-group ID) is what userspace
getpid()returns. Threads share TGID, differ in PID. - Process tree:
/proc/<pid>/task/<tid>/for each thread;/proc/<pid>/statusfor credentials, capabilities, OOM score. - Signal mechanics:
- Synchronous signals (
SIGSEGV,SIGBUS,SIGFPE)-delivered to the offending thread. - Asynchronous signals (
SIGTERM,SIGINT,SIGUSR1/2)-delivered to any thread that doesn't block them, usually the main thread. - Real-time signals (
SIGRTMIN–SIGRTMAX)-queued (regular signals can be coalesced). - The fork-then-exec pattern:
fork()is heavy (copy page tables, COW),vfork()is lighter but pauses the parent,posix_spawn()andclone3(CLONE_VFORK | CLONE_VM)are the modern cheaper alternatives.
4.3 Lab-"Process Forensics"¶
- Write a C program that forks 4 children, each computing for 5 s. Use
ptraceorstrace -fto observe all four. - Add a signal handler that catches
SIGTERMand logs cleanly to all children before exit. - Reproduce a classic bug: a parent that ignores
SIGCHLDand a child that exits, producing zombies. Verify withps -ef | grep defunct. - Convert to
signalfd+ `epoll - the modern signal-handling pattern that integrates with event loops.
4.4 Hardening Drill¶
- Set
RLIMIT_NPROCandRLIMIT_STACKfor your service viaLimitNPROC=andLimitSTACK=in the systemd unit. Verify withprlimit -p <pid>.
4.5 Performance Tuning Slice¶
perf record -e sched:sched_switchfor 10 seconds of your workload. Analyze withperf sched latency. Identify the top wakeup latencies.
Month 1 Capstone Deliverable¶
A kernel-foundations/ directory:
1. hardened-echo/ - week 1's echo service with maximal systemd hardening, score documented.
2.syscall-cat/ - week 2's libc-free cat plus a seccomp policy.
3. vfs-explorer/ - week 3'sbpftracerecipes capturing VFS activity.
4.signal-disciplined-server/ - a TCP echo server using signalfd + epoll for graceful shutdown.
A RUNBOOK.md documenting the boot trace, the syscall trace methodology, and the signal-handling decisions.
Month 2-Memory Management and Scheduling¶
Goal: by the end of week 8 you can (a) read /proc/meminfo and explain every line, (b) trace a page fault from userspace through do_page_fault to a returned PTE, (c) explain CFS (and EEVDF since 6.6) and read /proc/<pid>/sched, and (d) tune a memory-pressure-sensitive service using MemoryHigh/MemoryMax/PSI.
Weeks¶
- Week 5 - Virtual Memory, Paging, and the Page Cache
- Week 6 - Swapping, OOM, Memory Pressure (PSI)
- Week 7 - The CPU Scheduler (CFS, EEVDF)
- Week 8 - Disk I/O Scheduling, Filesystems Beyond ext4
Week 5 - Virtual Memory, Paging, and the Page Cache¶
5.1 Conceptual Core¶
- Each process has a private virtual address space (
mm_struct). The MMU translates virtual to physical addresses via page tables (4-level on x86_64; 5-level on newer CPUs). - Pages are 4 KiB by default. HugePages (2 MiB or 1 GiB) reduce TLB pressure for memory-intensive workloads.
- Memory is divided into anonymous (heap, stack) and file-backed (mmap'd files, page cache for read/write).
- The page cache is Linux's most aggressive optimization: nearly every read of a regular file is cached; writes are buffered until writeback (or
fsync()).
5.2 Mechanical Detail¶
/proc/meminfoline decoding:MemTotal/MemFree/ `MemAvailable - the latter is what to monitor.- `Buffers - block-device caches.
- `Cached - page cache.
Active/Inactive(anon, file)-LRU lists.Dirty/ `Writeback - outstanding writeback work.Slab(Reclaimable / Unreclaim)-kernel object allocators.AnonHugePages/ `HugePages_* - transparent and explicit hugepages.vm.dirty_ratio/ `vm.dirty_background_ratio - when does the kernel start (and force) writeback.- `vm.swappiness - bias between swapping anon vs evicting file pages. Default 60; for DB servers often lowered to 10 or even 1.
- `mm/memory.c::handle_mm_fault - the page-fault entry point. Three classes: minor (already in page cache, just map), major (must read from disk), and COW (write to a shared mapping).
5.3 Lab-"Memory Forensics"¶
- Run
vmstat 1andfree -hwhile loading a 4-GB file withcat file > /dev/null. WatchCachedgrow. echo 3 > /proc/sys/vm/drop_cachesand observe the eviction.mmapa large fileMAP_PRIVATE, write to it, observeAnonHugePagesand the COW behavior in/proc/<pid>/smaps.- Configure
vm.nr_hugepages=512(1 GiB of 2 MiB pages). Allocate viaMAP_HUGETLB. Measure the latency-distribution change vs default pages.
5.4 Hardening Drill¶
- Set
vm.unprivileged_userfaultfd=0(a frequently-exploited surface) andvm.mmap_min_addr=65536(defense against null-pointer kernel exploits). Document the reasoning.
5.5 Performance Tuning Slice¶
- Use
perf stat -e dTLB-load-misses,dTLB-loads ./prog. If TLB miss ratio >1%, evaluate hugepages ormadvise(MADV_HUGEPAGE).
Week 6 - Swapping, OOM, Memory Pressure (PSI)¶
6.1 Conceptual Core¶
- Swap is the kernel's overflow valve when anon memory pressure exceeds available RAM. Modern systems often run swapless or with a small swap (e.g.,
zswaporzram). - The OOM killer is the kernel's last-resort mechanism: when all reclaim has failed and an allocation cannot be satisfied, it kills the process with the highest
oom_score(heuristic of memory usage and adjusted byoom_score_adj). - PSI (Pressure Stall Information)-
/proc/pressure/{cpu,memory,io}and per-cgroup `pressure.{cpu,memory,io} - reports time the system or a cgroup spent stalled on each resource. The modern signal for "this system is sad."
6.2 Mechanical Detail¶
swapon,swapoff,/proc/swaps,vm.swappiness.zram(compressed RAM-backed swap) configuration viasystemd-zram-generatoror manually withzramctl.- OOM tuning:
oom_score_adjper-process (/proc/<pid>/oom_score_adj, range -1000 to 1000).systemdOOMScoreAdjust=directive.vm.overcommit_memory(0/1/2): allow / always-allow / strict accounting.- PSI semantics:
- `some - at least one task stalled.
- `full - all runnable tasks stalled (system-wide can't reach this for CPU).
- Numbers are 10s/60s/300s averages of stall percentage.
6.3 Lab-"Pressure and the OOM Killer"¶
- Write a memory-eater program. Run inside a
memory.high=512Mcgroup. Observepressure.memoryrise. - Push past
memory.max; watch the OOM killer. Checkdmesgandjournalctl -k | grep -i 'killed process'. - Set
oom_score_adj=-500on a critical process; verify it survives an OOM event triggered by another, lower-priority hog. - Measure PSI under realistic load: capture
pressure.memoryevery second for 5 minutes during a workload spike. Plot.
6.4 Hardening Drill¶
- Add
MemoryHigh=andMemoryMax=to every long-running service. UseMemoryHighas a soft target (slows allocations) andMemoryMaxas the hard cliff.
6.5 Performance Tuning Slice¶
- Hook
bpftrace -e 'kprobe:oom_kill_process { printf("%s killed %s\n", comm, str(arg0->comm)) }'to observe OOM events live.
Week 7 - The CPU Scheduler (CFS, EEVDF)¶
7.1 Conceptual Core¶
- The Completely Fair Scheduler (CFS) is a virtual-time, weighted-fair-queueing scheduler: each runnable task accumulates
vruntime, and the scheduler picks the task with smallestvruntime. Weight is fromnicevalue. - Since Linux 6.6, CFS has been replaced with EEVDF (Earliest Eligible Virtual Deadline First), which provides better latency guarantees while preserving fairness. The userspace API and most of the conceptual model are unchanged.
- Scheduling classes (priority order, top to bottom):
dl(deadline),rt(real-time, FIFO/RR),fair(CFS/EEVDF),idle. Almost everything userspace runs is infair.
7.2 Mechanical Detail¶
- Read
kernel/sched/fair.c(modern:kernel/sched/eevdf.cand friends). - Per-CPU runqueues; load-balancer migrates tasks between CPUs.
sched_setscheduler(2)andchrtuserspace tool.- CPU affinity:
sched_setaffinity(2),taskset, systemdCPUAffinity=. - Cgroup cpu controller v2:
cpu.weight(proportional),cpu.max(bandwidth). - `/proc/
/sched - per-task scheduler stats. - `/proc/sched_debug - system-wide scheduler state.
7.3 Lab-"Scheduler Forensics"¶
- Run two CPU hogs at
nice 0. Observe split CPU. Lower one tonice 19, verify ~95/5 split. - Use
bpftrace -e 'tracepoint:sched:sched_switch { @[comm] = count() }'to see context-switch rates. - Pin a workload to specific CPUs with
taskset -c 0,1. Compare cache-miss rate vs unpinned withperf stat. - Place two services in cgroups with
cpu.weight=100andcpu.weight=1000. Verify the 10:1 split under contention.
7.4 Hardening Drill¶
- Forbid SCHED_FIFO/SCHED_RR for non-root with
kernel.sched_rt_runtime_ustuning; or useRestrictSUIDSGID=yesandRestrictRealtime=yesin systemd units to prevent privilege escalation via RT scheduling.
7.5 Performance Tuning Slice¶
perf sched record sleep 10; perf sched latency - identifies wakeup-latency outliers. Tunables:sched_wakeup_granularity_ns,sched_min_granularity_ns` (older kernels); EEVDF largely auto-tunes.
Week 8 - Disk I/O Scheduling, Filesystems Beyond ext4¶
8.1 Conceptual Core¶
- The I/O stack: filesystem → block layer (with merging, sorting) → I/O scheduler → device driver → hardware.
- I/O schedulers (settable per-device in
/sys/block/<dev>/queue/scheduler):none(preferred for NVMe),mq-deadline,kyber,bfq. Choosenonefor fast SSDs,mq-deadlinefor mixed,bfqfor desktop fairness. - Filesystems:
- `ext4 - the default; well-understood; journaled.
- `xfs - high-throughput, parallel metadata; the RHEL default.
- `btrfs - copy-on-write, snapshots, multi-device. Use cautiously for high-throughput.
- `zfs - out-of-tree (CDDL); mature, snapshots, integrity. Heavy memory footprint.
- `tmpfs - RAM-backed.
8.2 Mechanical Detail¶
/sys/block/<dev>/queue/{nr_requests,read_ahead_kb,scheduler,rotational}. Tunable per-device.iostat -xz 1for per-device I/O stats. Watch%util,await,svctm,r/s,w/s.blktrace+bttfor fine-grained I/O timing. Modern alternative:bpftrace'sbiolatency/biosnooprecipes.- Mount options for performance:
- `noatime - don't update access times. Always set on busy filesystems.
discardvs periodic `fstrim - for SSDs; periodic is usually better.- `commit=N - ext4 journal commit interval.
8.3 Lab-"I/O Forensics"¶
- Run
fiowith a representative workload. Measure baseline. - Toggle the I/O scheduler. Re-run. Compare.
- Use
bpftrace -e 'tracepoint:block:block_rq_issue { @[args->comm] = count() }'to see who's hitting the disk. - Mount with vs without
noatimeand measure metadata-write traffic difference.
8.4 Hardening Drill¶
- Set
nodev,nosuid,noexecon/tmp,/home,/var/tmpmounts. Document why each matters.
8.5 Performance Tuning Slice¶
bpftrace -e 'tracepoint:block:block_rq_complete /args->dev == 0x800/ { @ = hist((nsecs - @start[arg0]) / 1000) }'to histogram I/O latencies in microseconds.
Month 2 Capstone Deliverable¶
A memory-and-scheduling/ directory:
1. meminfo-decoder/ - a script that reads/proc/meminfoand outputs a human-readable health report.
2.psi-watcher/ - a daemon that alerts when pressure.memory exceeds a threshold.
3. sched-bench/ - comparing nice-weighted, cgroup-weighted, and pinned workloads.
4.io-tuner/ - a fio harness sweeping I/O-scheduler options on the local disk.
Each comes with a markdown writeup of measurements and tuning conclusions.
Month 3-Namespaces, cgroups v2, eBPF¶
Goal: by the end of week 12 you can (a) construct a "container" by hand using unshare(2) + pivot_root(2) + cgroups, (b) explain every controller in cgroups v2 and design a resource policy for a multi-tenant host, (c) write an eBPF program that traces a kprobe and aggregates results in a map, and (d) read the output of bpftrace recipes from BPF Performance Tools and explain each.
Weeks¶
- Week 9 - Namespaces
- Week 10 - Control Groups v2
- Week 11 - eBPF: Foundations
- Week 12 - eBPF in Production: Observability Tools
Week 9 - Namespaces¶
9.1 Conceptual Core¶
- A namespace is a kernel mechanism that gives a process a private view of a global resource. Eight types exist:
- `mnt - mount tree.
- `pid - PID space; PID 1 inside is special (default signal handlers blocked, OOM-immune).
- `net - network stack: interfaces, routing, sockets, iptables tables.
- `uts - hostname, domainname.
- `ipc - System V IPC, POSIX message queues.
- `user - UID/GID mappings; the security-relevant namespace.
- `cgroup - view of cgroup hierarchy.
- `time - monotonic and boot-time clock offsets (relatively new; container images rarely use).
- Namespaces are the primitive containers are built on. Docker / containerd / runc use them; you can use them directly with
unshare(2),clone(2)flags, andsetns(2).
9.2 Mechanical Detail¶
unshare --user --pid --net --mount --uts --ipc --cgroup --fork --map-root-user bashgets you "inside" most namespaces in one shell (but mounts are inherited until youmount --make-rslaveor remount).lsns -t <type>enumerates active namespaces./proc/<pid>/ns/{mnt,pid,net,...}are inode-numbered handles you cansetns(2)into viansenter --target <pid> --all.- The
pivot_root(2)syscall replaces the current root with a new one-this is how containers escape from the host's/. - User namespaces allow unprivileged users to "be root" inside the namespace-the foundation of rootless containers.
9.3 Lab-"Hand-Built Container"¶
Write a C program that:
1. clone()s with CLONE_NEWUSER | CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWNET | CLONE_NEWCGROUP.
2. Configures UID/GID mappings via /proc/<pid>/uid_map and gid_map.
3. Creates a veth pair to give the namespace network access.
4. pivot_roots into a minimal Alpine rootfs.
5. execves /bin/sh.
You should now have a working terminal "inside" a "container" that you wrote in ~150 lines of C.
9.4 Hardening Drill¶
- Verify
kernel.unprivileged_userns_clone=1is set (it is on most modern distros). Read CVE history for `user namespaces - many privilege-escalation CVEs over the years are namespace-related; learn the surface.
9.5 Performance Tuning Slice¶
- `bpftrace -e 'kprobe:setns { @[comm] = count() }' - see who's switching namespaces. Useful for debugging container-runtime activity.
Week 10 - Control Groups v2¶
10.1 Conceptual Core¶
cgroups (control groups) are the kernel mechanism for resource limits and accounting. Every process belongs to exactly one cgroup per controller; the controllers enforce CPU, memory, I/O, and other limits collectively rather than per-process.
v2 is the unified hierarchy: a single tree under /sys/fs/cgroup/, every controller attached to it. v1 had a separate tree per controller and a long list of design lessons; v2 is the cleanup. All new code should target v2 - major distros default to it since 2019-2020 (systemd.unified_cgroup_hierarchy=1).
Controllers available in v2: cpu, memory, io, pids, cpuset, hugetlb, rdma, misc. Plus the implicit freezer mechanism (cgroup.freeze).
This is the foundation Kubernetes resource limits, Docker --memory, and systemd MemoryMax= all sit on top of. Master it and Kubernetes' "OOMKilled" alerts become diagnosable rather than mysterious.
10.2 Mechanical Detail¶
- Filesystem layout:
/sys/fs/cgroup/is a single mount in v2. Each subdirectory is a cgroup; create one withmkdir. Files inside (cpu.max,memory.max, ...) are the controller knobs. - Memory controller files you'll touch most:
memory.low- protection threshold; the kernel reclaims from cgroups exceeding theirlowbefore touching protected ones.memory.high- soft target; exceeding it throttles allocations (slows the process) but doesn't OOM.memory.max- hard cap; exceeding it triggers cgroup-level OOM.memory.events- counters:low,high,max,oom,oom_kill. Read these to alert on pressure.memory.pressure- PSI (Pressure Stall Information): time spent waiting on memory. The signal for "I'm not OOM, but I'm not happy either."
- CPU controller:
cpu.weight- proportional share (default 100, range 1-10000). Under contention, slices are weight-proportional.cpu.max- bandwidth as"$quota $period"microseconds (e.g.,"50000 100000"= 50% of one CPU). Kubernetes CPU limits compile to this.
- IO controller:
io.weight- proportional, likecpu.weight.io.max- per-device bandwidth and IOPS caps:"8:0 rbps=10485760 wbps=10485760 riops=100 wiops=100".
- PIDs controller:
pids.max- fork-bomb protection; set to a sane upper bound per service. - Moving a process:
echo $PID > /sys/fs/cgroup/foo/cgroup.procs. Children of moved processes inherit the new cgroup. - systemd integration: every systemd unit gets its own cgroup automatically.
systemd-cglsshows the tree;systemd-cgtopshows live resource usage;systemctl set-property foo.service MemoryMax=2Gupdates limits live.
The trap
Setting memory.max without memory.high. When the workload spikes, you go straight from "fine" to OOM-killed with no warning signal. Set memory.high to ~80% of memory.max so you see throttling (and memory.events.high counter ticks) before the kill arrives - your alerting can fire on high events rather than waiting for oom_kill.
10.3 Lab - "Multi-Tenant Cgroups"¶
- Create three sibling cgroups:
tenant-a,tenant-b,tenant-cunder/sys/fs/cgroup/test/. - Set
cpu.weight100/200/400 - under contention (runstress-ng --cpu Nin each), verify the 1:2:4 split withtop. - Set
memory.high=1G memory.max=2Gon each, run a memory hog (stress-ng --vm 1 --vm-bytes 3G), observe throttling first (memory.events.highticks, latency increases) then OOM (memory.events.oom_killticks). - Set
io.maxto limit disk bandwidth on a specific device for one cgroup; runfioinside, verify withiostat -x 1.
10.4 Hardening Drill¶
For every long-running service on a managed host, set explicit values for: MemoryHigh=, MemoryMax=, CPUQuota=, TasksMax=, IOWeight=. Document the policy in RESOURCE_POLICY.md (one row per service). The right defaults: MemoryHigh at 80% of MemoryMax; CPUQuota per service tier; TasksMax at expected concurrency × 4 for headroom.
10.5 Performance Tuning Slice¶
Wire two Prometheus metrics from every cgroup:
- cgroup_memory_events_total{event="high|max|oom_kill"} from memory.events.
- cgroup_pressure_seconds_total from memory.pressure / cpu.pressure / io.pressure (PSI).
Alert on high rate > 0 sustained (early warning), and on any oom_kill (incident). PSI is the right signal for "approaching saturation" - fires before any hard limit is hit.
Week 11 - eBPF: Foundations¶
11.1 Conceptual Core¶
- eBPF is an in-kernel virtual machine that runs verified bytecode at hookpoints (kprobes, tracepoints, XDP, socket filters, LSM, etc.). It is the modern way to extend Linux without writing kernel modules.
- The verifier rejects programs that could crash the kernel (unbounded loops, invalid memory access, dereferencing null). This is what makes eBPF safe.
- Programs communicate with userspace via maps (hash, array, ring buffer, LRU, per-CPU variants).
11.2 Mechanical Detail¶
- Tooling tiers, low to high level:
- Raw eBPF C compiled with
clang -target bpfand loaded withlibbpf. The production-grade path. libbpf+ CO-RE (Compile Once, Run Everywhere)-portable across kernel versions.- BCC (Python frontend)-older, requires kernel headers at runtime.
bpftrace-high-level scripting, fastest path to a one-off observation.- Hookpoints:
- kprobes / kretprobes-kernel function entry/exit.
- uprobes / uretprobes-userspace function entry/exit.
- tracepoints-stable kernel events with structured args. Prefer over kprobes when available.
- XDP-packet processing at NIC driver level (covered Month 4).
fentry/fexit-modern, lower-overhead replacement for kprobes (BPF Trampoline).- LSM hooks-security-relevant decisions.
- Sched, syscalls, perf events-many more.
11.3 Lab-"First eBPF Tools"¶
- Install
bpftrace. Runbpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s %s\n", comm, str(args->filename)) }'and watch the system-wide open trace. Triage. - Write a
bpftracescript that histogramsread()syscall sizes by process. - Convert one of the recipes to
libbpfC + a userspace consumer usinglibbpf-bootstrapas the template. - Read 10 of Brendan Gregg's
bpftracerecipes (runqlat.bt,tcpaccept.bt,vfsstat.bt, etc.) and run them. Document each.
11.4 Hardening Drill¶
kernel.unprivileged_bpf_disabled=1is the modern default (only root orCAP_BPFcan load programs). Verify and document.
11.5 Performance Tuning Slice¶
- Use
runqlat(run queue latency histogram) to detect scheduler stalls. Capture a baseline; document p50/p99/p99.9.
Week 12 - eBPF in Production: Observability Tools¶
12.1 Conceptual Core¶
- eBPF makes "perfect tracing" possible: every important system event can be intercepted with low overhead, aggregated in-kernel, and shipped to userspace.
- The standard observability stack today (in 2026) is: Cilium (networking), Pixie / Parca / Pyroscope (profiling), Tetragon (security observability), Falco (runtime security). All eBPF-based.
12.2 Mechanical Detail¶
- Continuous profiling with Parca / Pyroscope: stack-sampling at low frequency across all processes, attributing CPU and on-CPU time per function, with flame graphs in a UI.
- `bpftrace - style tools you'll write yourself:
- `tcpconnect - log new TCP connections with PID and process name.
execsnoop - log everyexecve` system-wide.- `opensnoop - every file open.
- `biosnoop - every block I/O completion with latency.
- The ring buffer map (
BPF_MAP_TYPE_RINGBUF) is the modern way to ship events to userspace; it replaces the older perf-buffer pattern with simpler, faster semantics.
12.3 Lab-"Build a Production-Grade eBPF Tool"¶
Write connsnoop:
- Hooks tcp_v4_connect and tcp_v6_connect (kprobe), inet_csk_accept (kretprobe), tcp_close.
- Records per-connection: 5-tuple, PID, process name, duration, bytes-tx/rx.
- Aggregates in-kernel via per-CPU hash maps, ships completion events through a ring buffer.
- Userspace consumer in C (with libbpf) or Go (with cilium/ebpf). Outputs JSON.
- Verifier-clean, CO-RE-portable across kernels 5.10+.
12.4 Hardening Drill¶
- Add
connsnoopas a systemd service with full hardening. The eBPF program needsCAP_BPFandCAP_PERFMON; do not grantCAP_SYS_ADMIN(the legacy alternative).
12.5 Performance Tuning Slice¶
- Run
connsnoopon a host doing real work; measure its CPU overhead withperf stat. Target <0.5% in steady state. If higher, narrow the hookpoints or aggregate more in-kernel.
Month 3 Capstone Deliverable¶
A namespaces-cgroups-ebpf/ directory:
1. mini-container/ (week 9)-the C program that builds a container by hand.
2. multi-tenant-cgroups/ (week 10)-the cgroup-v2 policy + verification script.
3. bpf-tour/ (week 11)-five bpftrace recipes with annotated output.
4. connsnoop/ (week 12)-the libbpf + userspace consumer tool.
CI runs the recipes against a CI VM and validates output schemas. Open one upstream interaction: a doc-fix PR to bpftrace, or a tested bpftrace recipe submitted as an example.
Month 4-Linux Networking: Netfilter, IPVS, XDP, Bridges, OVS¶
Goal: by the end of week 16 you can (a) trace a packet from NIC through XDP, the network stack, conntrack, iptables/nftables, sockets, and back, (b) configure a Linux bridge and an Open vSwitch flow, (c) write an XDP program that drops or redirects packets, and (d) reason about IPVS load-balancing modes.
Weeks¶
- Week 13 - The Network Stack: Sockets, NAPI, conntrack
- Week 14 - Netfilter / nftables / iptables, IPVS
- Week 15 - XDP and AF_XDP
- Week 16 - Bridges, VLANs, OVS
Week 13 - The Network Stack: Sockets, NAPI, conntrack¶
13.1 Conceptual Core¶
- The Linux networking stack is layered: NIC driver → NAPI (interrupt + polling hybrid) →
netif_receive_skb→ protocol handlers (IP, ARP) → transport (TCP/UDP) → socket buffer → userspace viarecv(). - An
sk_buffis the kernel's packet representation: a struct with metadata + pointers into the data buffer. It travels from driver to socket. ~250 bytes of metadata. - Netfilter is the packet-mangling/filtering framework: hooks at PRE_ROUTING, INPUT, FORWARD, OUTPUT, POST_ROUTING. iptables and nftables are userspace tools that program these hooks.
13.2 Mechanical Detail¶
ss -tnp-show TCP sockets with PIDs. Replaces deprecatednetstat.ipsuite-ip addr,ip route,ip rule,ip neigh,ip link,ip tuntap. The single tool you need;ifconfigandrouteare deprecated.- conntrack: stateful tracking of connections in netfilter.
conntrack -Lshows the table;nf_conntrack_maxtunes capacity. Each entry ~300 bytes; a busy host may track millions. tc-traffic control: queue disciplines (qdiscs), classes, filters. The shaping and policing tool. Modern qdiscs:fq_codel(default),fq(for high-bandwidth),cake.- TSO / GSO / GRO / LRO-segmentation offloads. Disable with
ethtool -Kfor debugging; they distort packet timing.
13.3 Lab-"Packet Forensics"¶
- `tcpdump -i any -nn -X 'tcp port 443' -c 10 - capture and dissect TLS handshake bytes.
- Trace a TCP connection's lifecycle with
bpftrace'stcplife.bt. - Set up a gratuitous DROP rule with
iptables -I INPUT -p icmp -j DROPand verify withping. Remove. Repeat withnft. - Inspect conntrack:
cat /proc/net/nf_conntrackwhile a long-lived connection is open.
13.4 Hardening Drill¶
sysctl net.ipv4.tcp_syncookies=1,net.ipv4.conf.all.rp_filter=1,net.ipv4.conf.all.accept_source_route=0,net.ipv6.conf.all.accept_redirects=0. Document each.
13.5 Performance Tuning Slice¶
ethtool -S <iface> - driver stats (drops, errors, csum issues).ip -s link show- interface stats. Identify any non-zero error counter.
Week 14 - Netfilter / nftables / iptables, IPVS¶
14.1 Conceptual Core¶
- iptables is being phased out in favor of nftables, the in-kernel successor. Both program the netfilter hooks.
- IPVS (IP Virtual Server) is the kernel-level L4 load-balancer used by
kube-proxy(ipvsmode) and many appliance-style LBs. Three modes: NAT, DR (direct return), TUN. - The decision matrix: simple SNAT/DNAT/firewall → nftables. L4 LB at scale → IPVS. L7 LB → userspace (Envoy, Nginx, HAProxy).
14.2 Mechanical Detail¶
- nftables tables: families
ip,ip6,inet,arp,bridge,netdev. Tables contain chains; chains contain rules; rules match and act. - The standard pattern for a host firewall:
table inet filter { chain input { type filter hook input priority 0; policy drop; ct state established,related accept iif lo accept tcp dport 22 accept icmp type echo-request accept } chain forward { type filter hook forward priority 0; policy drop; } chain output { type filter hook output priority 0; policy accept; } } - IPVS: configured via
ipvsadm. A virtual service has a VIP+port, a scheduling algorithm (rr,wrr,lc,wlc,shsource-hash), and real servers. conntrack-toolsfor inspecting and manipulating the conntrack table;conntrackdfor HA replication.
14.3 Lab-"Build a Stateful Firewall and a Load Balancer"¶
- Convert an existing iptables ruleset to nftables. Verify equivalence with packet probes.
- Set up IPVS-DR: VIP with two real servers; load test with
wrk. Compare with HAProxy on the same setup. - Saturate the conntrack table on purpose; observe
nf_conntrack: table full, dropping packetin dmesg. Tunenf_conntrack_max.
14.4 Hardening Drill¶
- Default-deny INPUT and FORWARD policies. Document the allowed flows. Ship the nftables ruleset as part of the host's idempotent provisioning.
14.5 Performance Tuning Slice¶
- Compare iptables vs nftables vs IPVS per-packet overhead with
perf staton a packet-flood workload.
Week 15 - XDP and AF_XDP¶
15.1 Conceptual Core¶
- XDP (eXpress Data Path) is an eBPF hookpoint at the driver level, before the kernel constructs an
sk_buff. The earliest possible point to drop, redirect, or pass packets. Used for DDoS scrubbers, custom load balancers (Katran), and eBPF-based service meshes (Cilium). - AF_XDP is a userspace-fast-path: pin a NIC queue to a userspace process, exchange packets via shared-memory rings. Throughput approaching DPDK with lower complexity.
- The four XDP actions:
XDP_DROP,XDP_PASS(continue to kernel stack),XDP_TX(back out the same NIC),XDP_REDIRECT(to another NIC, or to AF_XDP socket, or to CPU map).
15.2 Mechanical Detail¶
- Drivers vary in XDP support level: native (best), generic (slow, software fallback), offloaded (some smartNICs).
- Verify with
ip link show <iface> - look forxdp` mode. - Attach:
bpftool prog load my_xdp.o /sys/fs/bpf/my_xdp; bpftool net attach xdp pinned /sys/fs/bpf/my_xdp dev eth0. - XDP programs are constrained: no helpers that allocate memory, no looping past the verifier's bound, no kernel function calls outside the eBPF helpers list.
15.3 Lab-"An XDP DDoS Scrubber"¶
Write an XDP program that:
- Drops UDP packets with source port < 1024 (a coarse DDoS-vector heuristic).
- Counts dropped packets per source IP in an LRU-hash map (1M entries).
- Userspace tool reads the map every second and emits Prometheus metrics.
- Test with pktgen or trafgen. Measure throughput and CPU overhead.
15.4 Hardening Drill¶
- XDP programs require
CAP_NET_ADMIN(andCAP_BPF). Document the operational privilege required.
15.5 Performance Tuning Slice¶
- Measure pps capacity with vs without XDP on the same NIC. Modern 25/40 Gbps NICs can drop 10s of Mpps with native XDP.
Week 16 - Bridges, VLANs, OVS¶
16.1 Conceptual Core¶
- A Linux bridge is an in-kernel L2 switch. Used by virtually every container runtime and VM hypervisor.
- VLAN (802.1Q) tagging segments a single L2 network into many.
- Open vSwitch (OVS) is a programmable virtual switch: flow-table-based (OpenFlow-compatible), with hardware offload to smartNICs. Used by OpenStack Neutron, Kubernetes (older networking), and OVN.
16.2 Mechanical Detail¶
- Linux bridge management with
ip link: - VLAN:
ip link add link eth0 name eth0.10 type vlan id 10. - OVS:
ovs-vsctl add-br br0; ovs-vsctl add-port br0 eth0; ovs-ofctl dump-flows br0. The flow table is the programmable part. - Bridge vs OVS decision matrix: simple L2 connectivity → bridge. Programmable flows, OpenFlow, hardware offload → OVS.
16.3 Lab-"Three Network Topologies"¶
- Two namespaces connected via a Linux bridge: classic container networking.
- Two namespaces on tagged VLANs sharing one bridge.
- The same topology in OVS, with explicit OpenFlow rules.
For each, verify connectivity with ping, capture with tcpdump, document the difference.
16.4 Hardening Drill¶
- Bridge
iptablesintegration:sysctl net.bridge.bridge-nf-call-iptables=1(so bridged traffic traverses iptables). Understand whether you want this-for some setups (e.g., transparent bridges) you don't.
16.5 Performance Tuning Slice¶
- Compare per-packet latency through a bridge vs OVS vs a direct veth pair under load.
Month 4 Capstone Deliverable¶
A linux-networking/ directory:
1. nft-firewall/ - a default-deny stateful firewall with documented allowlist.
2.ipvs-lb/ - IPVS-DR load balancer with two backends and a health-check sidecar.
3. xdp-scrubber/ - the DDoS scrubber + Prometheus exporter.
4.bridge-vs-ovs/ - three topologies + a comparison report.
A NETWORK_RUNBOOK.md documenting interface inventory, MTU, sysctl tunables, and the firewall ruleset.
Month 5-Security and Hardening¶
Goal: by the end of week 20 you can (a) author SELinux and AppArmor profiles for a service, (b) configure full-disk encryption with LUKS and explain the key derivation chain, (c) write a seccomp-bpf policy, and (d) ship an audited host that passes a basic CIS benchmark.
Weeks¶
- Week 17 - Discretionary and Mandatory Access Control
- Week 18 - Capabilities, Seccomp, no_new_privs
- Week 19 - Encryption at Rest: LUKS, dm-crypt, dm-verity
- Week 20 - Audit, Integrity Measurement, and Compliance
Week 17 - Discretionary and Mandatory Access Control¶
17.1 Conceptual Core¶
- DAC-discretionary access control-the classic
rwxpermissions plus POSIX ACLs (getfacl/setfacl). The owner decides who can access. - MAC-mandatory access control-the kernel decides, based on a policy administered separately from file ownership. Two implementations dominate Linux:
- SELinux (RHEL family, default-enforcing). Type-enforcement-based. Powerful, complex.
- AppArmor (Debian/Ubuntu/SUSE family). Path-based. Simpler, less expressive.
- Both are LSMs (Linux Security Modules); only one is typically active per system.
17.2 Mechanical Detail¶
- SELinux triple:
<user>:<role>:<type>:<level>. The type is the workhorse; rules express "type X may do operation Y on type Z." getenforce,setenforce 0|1,audit2allow,semanage,restorecon,chcon. Memorize these six.ausearch -m AVC -ts recentfor SELinux denials.audit2allow -a -M mymoduleto draft a policy module from observed denials. Production discipline: neversetenforce 0to unblock; capture denials, build a policy, ship it.- AppArmor: profiles in
/etc/apparmor.d/. Generate withaa-genprof, refine withaa-logprof. Modes:enforce,complain(logs but allows).aa-statusshows active profiles.
17.3 Lab-"MAC for an Echo Service"¶
Take week 1's echo service. Author:
1. An SELinux type-enforcement module that allows it to bind its socket and read its config but nothing else.
2. An equivalent AppArmor profile.
Verify with deliberate violations (try to read /etc/shadow)-both should deny and audit.
17.4 Hardening Drill¶
- Survey your distro: which LSM is active? Read its policy for one well-known service (
httpdon RHEL,nginxon Ubuntu) and explain the constraint set.
17.5 Performance Tuning Slice¶
- Measure SELinux/AppArmor overhead with
perf staton a syscall-heavy workload-typically <2%.
Week 18 - Capabilities, Seccomp, no_new_privs¶
18.1 Conceptual Core¶
- Linux capabilities subdivide the historical "root" privilege into ~40 discrete capabilities (
CAP_NET_ADMIN,CAP_SYS_PTRACE,CAP_DAC_OVERRIDE, etc.). A process holds bounding, effective, permitted, inheritable, and ambient sets. - The principle: a service should hold only the capabilities it needs. A web server binding port <1024 needs `CAP_NET_BIND_SERVICE - not full root.
- seccomp-bpf is a syscall-level allowlist/denylist enforced by an eBPF program attached at
prctl(PR_SET_SECCOMP)time. A killer feature for sandboxing. no_new_privs(PR_SET_NO_NEW_PRIVS): once set, neither the calling task nor its descendants can gain privileges viasetuidbinaries, file capabilities, or LSM transitions. Required before applying user-space seccomp (and a generally good default).
18.2 Mechanical Detail¶
getcap,setcapto manage file capabilities.getpcaps <pid>for process caps.- systemd directives:
CapabilityBoundingSet=,AmbientCapabilities=,NoNewPrivileges=yes,SystemCallFilter=,SystemCallArchitectures=native. SystemCallFilter=@system-serviceis a curated allowlist that covers most service workloads. Combine with explicit denylists for risky calls.- For container runtimes, the Docker default seccomp profile (`/etc/docker/seccomp.json - equivalent) is a reasonable baseline; understand why each blocked syscall is blocked.
18.3 Lab-"Capabilities and Seccomp"¶
- Convert a service that runs as root to one that runs as a non-root user with only the minimum capabilities.
- Author a seccomp policy using libseccomp that allows only the syscalls the service uses. Verify by attempting denied syscalls.
- Apply via systemd
SystemCallFilter=and confirm.
18.4 Hardening Drill¶
- Review every long-running service on a host. For each: what capabilities does it actually need? Document. Tighten where possible.
18.5 Performance Tuning Slice¶
- seccomp adds a small per-syscall cost. Measure with
perf stat -e syscalls:sys_enter_*before and after.
Week 19 - Encryption at Rest: LUKS, dm-crypt, dm-verity¶
19.1 Conceptual Core¶
- LUKS (Linux Unified Key Setup) is the standard for full-disk encryption: a header at the start of a block device contains key-slots (each protected by a passphrase or keyfile), which unlock a master key, which is used by dm-crypt to en/decrypt block I/O.
- dm-verity provides integrity (not confidentiality) for read-only filesystems via a Merkle tree. Used in Android, Fedora Silverblue, and increasingly in container hosts.
- fscrypt offers per-file encryption at the ext4/f2fs/UBIFS layer, with per-user keys.
19.2 Mechanical Detail¶
- LUKS2 header structure: binary header + JSON metadata + key-slot area.
cryptsetup luksDumpshows the metadata. - The chain: passphrase → Argon2id KDF → key-slot key → unlocks master volume key → dm-crypt encrypts/decrypts with AES-XTS (default).
- Key management:
- Multiple key slots (8 by default in LUKS2). Add/remove with
cryptsetup luksAddKey/luksRemoveKey. - TPM2 binding:
systemd-cryptenroll --tpm2-device=autofor unattended boot with measured-boot integrity. - YubiKey FIDO2:
systemd-cryptenroll --fido2-device=auto. crypttab(5)for boot-time activation. systemd's generator translates it into units.
19.3 Lab-"Encrypt a Disk End to End"¶
- Create a LUKS2 volume on a spare disk or loopback file.
- Format with ext4. Mount.
- Add a TPM2-bound key slot. Enroll a recovery passphrase.
- Configure auto-unlock at boot via
crypttab. - Simulate disk theft: dump the device contents; verify they are opaque without the key.
19.4 Hardening Drill¶
- For laptops: enable LUKS with a strong passphrase + TPM2 + measured boot. For servers: TPM2 binding plus
clevis+tangfor network-bound disk encryption (auto-unlock only when the host can reach the key server).
19.5 Performance Tuning Slice¶
- Measure encryption overhead with
fio: LUKS-encrypted vs plaintext. On modern CPUs with AES-NI, expect <5% throughput cost.
Week 20 - Audit, Integrity Measurement, and Compliance¶
20.1 Conceptual Core¶
- The Linux audit subsystem (
auditd) generates structured logs of security-relevant events: syscalls, file access, login attempts, privilege escalations. - IMA/EVM (Integrity Measurement Architecture / Extended Verification Module) hashes files at access time and optionally signs them; integrates with TPM for attestation.
- CIS benchmarks and STIGs are industry-standard hardening checklists.
openscapandlynisautomate auditing.
20.2 Mechanical Detail¶
auditctlconfigures rules at runtime;/etc/audit/rules.d/*.rulesfor persistence.aureport,ausearchfor querying.- A reasonable baseline rule set: log every
execve, everyopenfailure on/etc/passwd//etc/shadow, every change toauditdconfig itself, every privilege escalation. - IMA:
ima_policy="appraise_tcb"kernel parameter. Measures executables; with EVM, signs metadata. - Compliance scanning:
lynis audit systemfor a quick local audit.openscap(oscap xccdf eval) for SCAP-based formal benchmarks.
20.3 Lab-"An Audited Host"¶
- Configure auditd with a baseline ruleset.
- Trigger expected events (failed
su, edit of/etc/passwd); verify logs. - Run `lynis audit system - record the score and address the top 5 findings.
- (Optional) Boot with IMA enabled; measure the impact on boot time and observe
/sys/kernel/security/ima/ascii_runtime_measurements.
20.4 Hardening Drill¶
- Ship an idempotent provisioning playbook (Ansible/shell) that applies CIS baseline tunables: disable unused services, set
umask 027for system accounts, limitat/cronto admins, etc.
20.5 Performance Tuning Slice¶
- Audit logging at high syscall rates is expensive. Measure log volume; rate-limit chatty rules with - F
filters or move toaudisp-remote` to ship off-host.
Month 5 Capstone Deliverable¶
A security-and-hardening/ directory:
1. mac-profiles/ - SELinux + AppArmor for the echo service.
2.cap-seccomp/ - minimum-capability + seccomp-policy template.
3. luks-tang/ - fully encrypted volume with network-bound auto-unlock.
4.audit-baseline/ - auditd rules + a `lynis - validated host playbook.
A THREAT_MODEL.md for the example service: assets, attack surfaces, mitigations.
Month 6-Kernel Modules, Performance Mastery, Capstone¶
Goal: by the end of week 24 you have shipped one capstone in your chosen track and can defend every design decision in a senior systems-engineering interview.
Weeks¶
- Week 21 - Loadable Kernel Modules (LKM)
- Week 22 - Tracing and Performance Mastery: ftrace, perf, BPF
- Week 23 - Performance Tuning at Scale
- Week 24 - Capstone Integration & Defense
Week 21 - Loadable Kernel Modules (LKM)¶
21.1 Conceptual Core¶
- A loadable kernel module is C code compiled against the running kernel's headers, loaded via
insmod/modprobe, and unloaded viarmmod. It runs at ring 0, with full kernel privileges. There is no safety net: a bug crashes the box. - The legitimate uses for LKMs in 2026 are narrow: device drivers for hardware not supported by mainline, niche filesystem additions, specific tracing or security modules that cannot be expressed as eBPF.
- For most "I want to extend the kernel" needs today, eBPF is the right answer. Pick LKM only when you genuinely cannot express the work in eBPF.
21.2 Mechanical Detail¶
- The skeleton:
- Build via a
Makefile: - Character device skeleton: register a
cdev, definefile_operations(open,release,read,write,unlocked_ioctl), allocate adev_tviaalloc_chrdev_region, create aclassanddeviceso udev creates/dev/<name>. - Locking:
spin_lock(interrupt context),mutex_lock(sleepable),rcu_read_lock(read-mostly). Choosing wrong is a deadlock or a soft-lockup. - Memory:
kmalloc(small, contiguous),vmalloc(large, possibly non-contiguous),kmem_cache_*(slab). Account in/proc/slabinfo. - Signing & secure boot: production hosts with secure boot will reject unsigned modules. Sign with
scripts/sign-fileagainst a MOK (Machine Owner Key).
21.3 Lab-"A Character Device LKM"¶
Write pkv, a simple in-kernel key/value character device:
- /dev/pkv accepts writes of the form key=value\n and reads return the value for the last-written key.
- 100 KV slots, in-kernel hash table.
- Concurrency-correct under multiple writers/readers (use a mutex; pursue an rwlock variant as a stretch).
- ioctl operations for LIST and DELETE.
- KUnit tests in tree.
- Loads/unloads cleanly with no lockdep or KASAN warnings (turn both on in your test kernel).
21.4 Hardening Drill¶
- Read CVE history for
stagingdrivers; identify three classes of bugs that recur (UAF afterkfreeon error path, missing copy_from_user length check, integer overflow). Audit yourpkvfor each.
21.5 Performance Tuning Slice¶
- Run
bpftracewith kprobes on your module's functions; measure per-op latency. Compare against an equivalent userspace KV (e.g., a Unix-socket server).
Week 22 - Tracing and Performance Mastery: ftrace, perf, BPF¶
22.1 Conceptual Core¶
- The Linux observability triad: ftrace (function tracer; in-kernel only), perf (sampling profiler + tracepoint subscriber), eBPF (programmable, low-overhead).
- Each has its niche. For "what is the kernel doing?" → ftrace function_graph. For "where is CPU time spent?" → perf record + flamegraph. For "summarize a behavior with low overhead" → eBPF.
22.2 Mechanical Detail¶
- ftrace via
/sys/kernel/tracing/. Setcurrent_tracer, filter withset_ftrace_filter, dumptrace. Modern frontend:trace-cmd. - perf:
- `perf stat - counter snapshot.
perf record -g+ `perf report - sampling profiler with call graphs.perf scriptto feed into FlameGraph.pl for flamegraphs.- `perf trace - strace-equivalent with low overhead.
- `perf top - live profiler.
bpftracefor one-liners;libbpfC for production tools.
22.3 Lab-"End-to-End Profiling"¶
- Take a service running on a host. Capture:
perf record -F 99 -ag -- sleep 30. - Generate a flamegraph.
- Identify the top three CPU consumers; for each, propose a hypothesis and a fix.
- Compare with the same workload profiled by
parcaorpyroscopeif available.
22.4 Hardening Drill¶
perfrequireskernel.perf_event_paranoid≤ 2 for unprivileged use. Decide your policy: tighter (=3, perf disabled for non-root) or looser (=1, allow unprivileged users to profile their own processes).
22.5 Performance Tuning Slice¶
- Run
runqlat,cpudist,offcputime(BPF) on a busy host. Build a one-page "what's wrong with this host?" diagnostic flow.
Week 23 - Performance Tuning at Scale¶
23.1 Conceptual Core¶
- Tuning at scale is systematic-not "tweak
vm.swappiness." It is: measure, hypothesize, change one variable, re-measure, document. - Brendan Gregg's USE method: for every resource, characterize Utilization, Saturation, Errors. Apply to CPU, memory, disk, network.
- The common bottlenecks, in rough rank order: I/O (latency or throughput), memory pressure (PSI), syscall-frequency or context-switch storms, lock contention, NIC drops.
23.2 Mechanical Detail¶
- CPU:
mpstat -P ALL 1,pidstat 1. Look for one-CPU-pegged. Check for irq imbalance (/proc/interrupts). - Memory:
vmstat 1,/proc/pressure/memory. PSI > 0% sustained = problem. - Disk:
iostat -xz 1. Look atawaitand%util. >70%%utiland >10msawait= saturated. - Network:
sar -n DEV 1,ethtool -S <iface>. Drops, errors, frame-too-long counts. - Kernel:
perf top - if__do_softirqor_raw_spin_lock` is hot, dig further.
23.3 Lab-"Triage Drill"¶
A scripted "broken host" is provided (or build one): a VM with one of {disk-bound, memory-bound, network-bound, lock-contended, scheduler-thrashing} pathologies. Diagnose using only the tools above. Document the inference chain. Then introduce a fix and verify.
23.4 Hardening Drill¶
- Codify the USE-method dashboard for your environment. Wire to Prometheus/Grafana. Alert on saturation > 80% for any resource for 5 min.
23.5 Performance Tuning Slice¶
- Sysctl baseline for high-throughput servers (review and adapt-never copy blindly): Each line should be paired with a justification in your runbook.
Week 24 - Capstone Integration & Defense¶
24.1 Conceptual Core¶
The final week is integration, not new material. Bring your chosen capstone (see CAPSTONE_PROJECTS.md) to defensible quality.
24.2 Final Hardening Checklist¶
- CIS benchmark (or
lynis) score documented; top findings addressed. - LSM (SELinux or AppArmor) enforcing for any service touched.
- All long-running services systemd-managed with full hardening directives.
- auditd configured; ruleset documented.
- LUKS (where applicable); TPM2 binding documented.
- Sysctl baseline applied; deviations explained.
- Boot is reproducible: same image → same hash, where applicable.
- Observability:
node_exporter, eBPF observability tools, log shipping. - Runbooks for: OOM, disk-full, network-down, runaway-CPU, broken-DNS.
24.3 Lab-"Defend the Host"¶
Schedule a 45-minute mock review with a peer. Walk through: the host's threat model, the capstone artifact, the observability story, and a live demo of triaging a fault. Defend every choice-cgroup policy, LSM type, sysctl values, auditd rules.
24.4 Performance Tuning Slice¶
- Final pass: capture a 1-minute
perf record -agflamegraph of the capstone under representative load. Commit it. This is the resume artifact.
Month 6 Deliverable¶
The capstone artifact, plus an aggregated linux-mastery/ repo containing every prior month's deliverable.
Appendix A-Hardening and Performance Tuning Reference¶
Curriculum-wide consolidation of the hardening and tuning slices. By week 24 the reader's host-baseline/ template should contain working examples of each section.
A.1 The systemd Hardening Cheat Sheet¶
For every long-running service, evaluate each:
[Service]
# Identity
DynamicUser=yes
User=svc
Group=svc
# Filesystem isolation
ProtectSystem=strict
ProtectHome=yes
ProtectKernelTunables=yes
ProtectKernelModules=yes
ProtectKernelLogs=yes
ProtectControlGroups=yes
ProtectClock=yes
ProtectHostname=yes
ProtectProc=invisible
PrivateTmp=yes
PrivateDevices=yes
PrivateUsers=yes
PrivateMounts=yes
ReadOnlyPaths=/
ReadWritePaths=/var/lib/svc
# Networking
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
IPAddressAllow=localhost
IPAddressDeny=any
RestrictNetworkInterfaces=lo eth0
# Capabilities & syscalls
NoNewPrivileges=yes
CapabilityBoundingSet=
AmbientCapabilities=
SystemCallArchitectures=native
SystemCallFilter=@system-service
SystemCallFilter=~@privileged @resources
# Resources
MemoryMax=512M
MemoryHigh=384M
CPUQuota=200%
TasksMax=128
LimitNOFILE=65536
# Other
LockPersonality=yes
RestrictNamespaces=yes
RestrictRealtime=yes
RestrictSUIDSGID=yes
RemoveIPC=yes
UMask=0077
Validate with `systemd-analyze security
A.2 Sysctl Baseline (Server)¶
# Network-connection backlog
net.core.somaxconn = 4096
net.ipv4.tcp_max_syn_backlog = 4096
net.core.netdev_max_backlog = 16384
# Network-TCP behavior
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 15
net.ipv4.tcp_keepalive_time = 60
net.ipv4.tcp_keepalive_intvl = 10
net.ipv4.tcp_keepalive_probes = 6
# Network-anti-spoof
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.all.accept_source_route = 0
net.ipv4.conf.all.accept_redirects = 0
net.ipv6.conf.all.accept_redirects = 0
net.ipv4.conf.all.send_redirects = 0
net.ipv4.icmp_echo_ignore_broadcasts = 1
# VM / memory
vm.swappiness = 10
vm.dirty_ratio = 10
vm.dirty_background_ratio = 3
vm.overcommit_memory = 0
vm.mmap_min_addr = 65536
vm.unprivileged_userfaultfd = 0
# File handles & inotify
fs.file-max = 2097152
fs.inotify.max_user_watches = 524288
fs.protected_hardlinks = 1
fs.protected_symlinks = 1
# Kernel hardening
kernel.dmesg_restrict = 1
kernel.kptr_restrict = 2
kernel.unprivileged_bpf_disabled = 1
kernel.yama.ptrace_scope = 1
kernel.perf_event_paranoid = 2
kernel.kexec_load_disabled = 1
Customize per workload; never copy blindly.
A.3 The perf Reference Card¶
| Goal | Command |
|---|---|
| Top CPU consumers (live) | perf top -F 99 -g |
| Sample profile + call graph | perf record -F 99 -g -- sleep 30; perf report |
| Flamegraph | perf record -F 99 -ag -- sleep 30; perf script | stackcollapse-perf.pl | flamegraph.pl > out.svg |
| Counter snapshot | perf stat -e cycles,instructions,cache-misses ./prog |
| Tracepoints | perf trace -F (strace replacement) |
| Sched debug | perf sched record sleep 10; perf sched latency |
| Block I/O | perf trace -e block:* -- sleep 10 |
A.4 The bpftrace Reference Card¶
# top syscalls per process for 10 s
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count() }' -c 'sleep 10'
# file open audit
bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s %s\n", comm, str(args->filename)) }'
# tcp connect latency
bpftrace -e 'kprobe:tcp_v4_connect { @start[tid] = nsecs }
kretprobe:tcp_v4_connect /@start[tid]/ {
@lat = hist((nsecs - @start[tid])/1000); delete(@start[tid])
}'
# run-queue latency histogram
bpftrace tools/runqlat.bt
# what's filling the page cache
bpftrace -e 'kprobe:add_to_page_cache_lru { @[comm] = count() }'
A.5 SystemTap (Legacy)¶
SystemTap predates eBPF and is still occasionally useful on RHEL 7-era kernels:
For new work, prefer eBPF /bpftrace. Only learn stap if your environment forces it.
A.6 The host-baseline/ Template¶
host-baseline/
ansible/
roles/
common/ # hostname, time, base packages
sshd/ # hardened sshd_config
audit/ # auditd ruleset
sysctl/ # the baseline above
firewall/ # nftables ruleset
lsm/ # SELinux or AppArmor profiles
observability/ # node_exporter, journald-remote, eBPF tools
cgroups/ # tenant cgroup template
systemd-units/ # service templates
ebpf-tools/ # in-house tracing tools
RUNBOOK.md
THREAT_MODEL.md
This is the artifact every host you bring up after week 24 should be provisioned from.
Appendix B-Build-From-Scratch Linux Toolbox¶
A working Linux engineer should have implemented each of the following at least once.
B.1 A Self-Healing systemd Service¶
# /etc/systemd/system/self-heal.service
[Unit]
Description=Self-healing application
After=network-online.target
Wants=network-online.target
[Service]
Type=notify
ExecStart=/usr/local/bin/myapp
WatchdogSec=30s
Restart=always
RestartSec=2s
StartLimitInterval=300
StartLimitBurst=5
TimeoutStopSec=10
NotifyAccess=main
# health-check via WatchdogSec: app calls sd_notify(0, "WATCHDOG=1") periodically
# if it stops, systemd kills and restarts
# (paste the hardening block from Appendix A here)
[Install]
WantedBy=multi-user.target
The application calls sd_notify(0, "READY=1") after init and sd_notify(0, "WATCHDOG=1") periodically. If the watchdog timer expires, systemd kills and restarts. This is the standard pattern for self-healing in production.
B.2 A Minimal Init (PID 1)¶
For containers or micro-VMs:
#include <sys/wait.h>
#include <unistd.h>
#include <signal.h>
int main(int argc, char **argv) {
if (argc < 2) return 1;
pid_t pid = fork();
if (pid == 0) execvp(argv[1], argv+1);
// PID 1 must reap zombies and forward signals
for (;;) {
int status;
pid_t p = waitpid(-1, &status, 0);
if (p == pid) return WEXITSTATUS(status);
}
}
tini or dumb-init in production; this is for understanding.
B.3 A Hand-Built Container¶
(Sketch-see Month 3 lab.)
- clone(CLONE_NEWUSER | CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWNET | CLONE_NEWCGROUP, ...)
- Write UID/GID maps via /proc/<pid>/uid_map, gid_map.
- Mount proc, sysfs, tmpfs inside.
- pivot_root into rootfs.
- Configure veth pair on the host; move one end into the new netns.
- execve user command.
B.4 A Kernel Module Skeleton¶
#include <linux/module.h>
#include <linux/init.h>
#include <linux/kernel.h>
static int __init mymod_init(void) {
pr_info("mymod loaded\n");
return 0;
}
static void __exit mymod_exit(void) {
pr_info("mymod unloaded\n");
}
module_init(mymod_init);
module_exit(mymod_exit);
MODULE_LICENSE("GPL");
MODULE_AUTHOR("you");
MODULE_DESCRIPTION("skeleton");
Makefile:
obj-m += mymod.o
KDIR := /lib/modules/$(shell uname -r)/build
all:
$(MAKE) -C $(KDIR) M=$(PWD) modules
clean:
$(MAKE) -C $(KDIR) M=$(PWD) clean
B.5 An eBPF Skeleton (libbpf + CO-RE)¶
prog.bpf.c:
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
struct {
__uint(type, BPF_MAP_TYPE_RINGBUF);
__uint(max_entries, 1 << 20);
} events SEC(".maps");
SEC("tracepoint/syscalls/sys_enter_execve")
int handle_execve(void *ctx) {
char *e = bpf_ringbuf_reserve(&events, 16, 0);
if (!e) return 0;
bpf_ringbuf_submit(e, 0);
return 0;
}
char LICENSE[] SEC("license") = "GPL";
User-side: use libbpf skeletons (bpftool gen skeleton prog.bpf.o > prog.skel.h). The full pattern is in `libbpf-bootstrap - clone it, modify.
B.6 A Seccomp-bpf Allowlist¶
#include <seccomp.h>
scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_KILL);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0);
// ... only what you actually need
seccomp_load(ctx);
For systemd-managed services, prefer SystemCallFilter= directives; this raw API is for embedded code, sandboxes, and runtime libraries.
B.7 A udev Rule¶
/etc/udev/rules.d/99-mydev.rules:
SUBSYSTEM=="usb", ATTR{idVendor}=="abcd", ATTR{idProduct}=="1234", \
MODE="0660", GROUP="plugdev", SYMLINK+="mydev"
udevadm control --reload-rules && udevadm trigger.
B.8 A Repeatable VM Lab Setup¶
The final, hidden artifact: a Vagrantfile (or qemu script) that boots a fresh Ubuntu/Fedora VM with cloud-init, pre-installs bpftrace, perf, trace-cmd, auditd, cryptsetup, your toolbox above. Every lab in this curriculum should be reproducible from this base in under 5 minutes.
Appendix C-Contributing to the Linux Kernel: A Playbook¶
The Linux kernel is famously approachable, and famously demanding. This appendix is the on-ramp.
C.1 Mental Model¶
The Linux kernel is developed via mailing lists (LKML and per-subsystem lists), with patches sent inline via git send-email. There is no GitHub PR equivalent. Reviewers reply on-list; subsystem maintainers (MAINTAINERS file) merge into their trees; Linus pulls from those into mainline.
Implications: 1. You send patches as emails, not PRs. 2. The cultural norm is direct, sometimes blunt feedback. Do not take it personally. 3. The maintainer set is finite and busy. A two-week response cycle is normal; six weeks is not unusual; ignored = re-send a polite ping.
C.2 Setting Up¶
git clone https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
cd linux
make defconfig # or copy your distro's config
make -j$(nproc)
For development, prefer working off the subsystem tree (e.g., net-next for networking work, tip for sched/locking, staging for drivers in incubation). The MAINTAINERS file lists per-subsystem trees.
Configure git:
git config sendemail.smtpserver smtp.example.com
git config sendemail.smtpuser you@example.com
git config user.signingkey "..."
C.3 Where the Easy Wins Are¶
C.3.1 Documentation¶
Documentation/is large and uneven. Typo fixes and clarifications are welcomed bylinux-doc@vger.kernel.org.- Trivially good first patch.
C.3.2 Coding-style fixes¶
scripts/checkpatch.plflags style violations. Submit a series fixing them. (But: huge mass-style patches are sometimes resisted as churn-limit to one file or one subsystem.)
C.3.3 Staging drivers¶
drivers/staging/is the incubator for drivers being prepped for mainline. Cleanup work here (sparse fixes, checkpatch, removing unused code) is well-supported.
C.3.4 New tracepoints / debugfs entries¶
- Adding a tracepoint to expose useful info to eBPF tooling is a tractable medium-difficulty contribution. Discuss design on-list first.
C.3.5 Bug fixes¶
- The
bugzilla.kernel.orgtracker has reproducible bugs. Pick one in a subsystem you understand.
C.3.6 Don't start here (yet)¶
- Scheduler core (
kernel/sched/). - Memory management core (
mm/). - Networking core (
net/core/,net/ipv4/tcp*). - VFS core (
fs/namei.c,fs/dcache.c).
C.4 The First-Patch Workflow¶
- Pick a small change. A typo, a checkpatch warning, a tested driver fix.
- Make the change in a branch.
- Run
scripts/checkpatch.pl --strict <patch>. Fix all warnings. - Build the affected subsystem. For drivers:
make M=drivers/foo/. - Test on real or virtual hardware. A patch with no Tested-by is suspect.
- Commit with a kernel-style message:
The
subsys: short imperative description (under 70 chars) Body explaining the problem and why this change fixes it. Past tense for the bug ("crashed when..."), imperative for the fix ("Avoid the crash by checking..."). Wrap at 72 columns. Fixes: <12-char-sha1> ("subject of bug-introducing commit") Cc: stable@vger.kernel.org # if backport-worthy Signed-off-by: Your Name <you@example.com>Signed-off-byis a DCO (Developer Certificate of Origin) declaration; mandatory. - Identify the recipients: This lists the subsystem maintainer, reviewers, mailing list. Send to all; CC LKML.
- Send:
- Address review. Reviewers may ask for: design changes, additional testing, breakup into multiple patches. Each new version is
[PATCH v2] subsys: .... Always include a changelog after - --` describing what changed since v1. - Maintainer applies. Your name and email go into the commit log.
C.5 The MAINTAINERS Map¶
The MAINTAINERS file at the root of the tree lists, for every subsystem: maintainer(s), reviewers, mailing list, source files in scope, status (Maintained / Supported / Odd Fixes / Orphan).
get_maintainer.pl parses it. Always run before sending a patch.
C.6 Reading Map¶
For depth, in this order:
| File | What it teaches |
|---|---|
Documentation/process/submitting-patches.rst |
Authoritative process. Read first. |
Documentation/process/coding-style.rst |
The kernel's C style; non-negotiable. |
Documentation/process/email-clients.rst |
Why your email client matters. |
Documentation/dev-tools/sparse.rst |
Kernel-specific static analysis. |
Documentation/admin-guide/cgroup-v2.rst |
The cgroup-v2 interface, normative. |
Documentation/scheduler/sched-design-CFS.rst |
CFS / EEVDF design. |
Documentation/vm/ |
Memory management deep dives. |
Documentation/networking/ |
Per-subsystem networking docs. |
Documentation/bpf/ |
Modern eBPF design and ABIs. |
C.7 Adjacent Targets if Mainline Is Too Heavy¶
bpftrace-high contribution velocity, friendly maintainers.bcc-older but still active.perfuserspace tools-tools/perf/in the kernel tree, but its own dynamic.util-linux-mount,lsblk,nsenter, etc. Active, contributor-friendly.systemd-large but well-organized; on GitHub, with PRs.iproute2-ip,tc,ss. Smaller surface, important.
A merged contribution to any of these signals real Linux fluency.
C.8 Calibration¶
A reasonable goal for a curriculum graduate:
- By end of week 23: a patch sent to LKML or a subsystem list (could be a doc fix or a checkpatch cleanup).
- By end of capstone: that patch merged.
- 6 months post-curriculum: a substantive contribution-a driver fix, a small tracepoint addition, or a bpftrace tool merged.
Patient, persistent contributors become trusted contributors. Trusted contributors become maintainers.
Capstone Projects-Three Tracks, One Choice¶
Pick one. The work performed here is what you describe in interviews.
Track 1-Kernel Module: An Out-of-Tree LKM¶
Outcome: a non-trivial out-of-tree Linux kernel module, KUnit-tested, sparse-clean, KASAN-clean, with a clear README and a path toward upstream submission (even if you don't take it all the way).
Suggested scopes¶
- A character-device key/value store (week 21 lab, hardened). Adds:
ioctlfor batch ops, anmmapinterface for zero-copy reads, RCU-protected reader path. - A netfilter hook. A small accelerator that, e.g., counts packets matching a configurable BPF filter at the netfilter ingress hook, with stats exposed via a
procfsentry. - A custom tracepoint suite. Add tracepoints to a subsystem of your choice (e.g., your
pkvmodule from the lab) and write abpftraceconsumer.
Acceptance¶
- Loads cleanly on at least two LTS kernels (e.g., 6.6 and 6.12).
- KUnit tests in tree; pass on both kernels.
- KASAN, lockdep, KCSAN warnings: zero across stress-test load.
- Signed for secure boot.
- A
README.mdwith build, install, and use; aDESIGN.mdwith locking and memory ownership documented.
Skills exercised¶
- Months 1 (kernel boundary), 2 (memory + scheduling internals), 6 (LKM development).
Track 2-eBPF Observability Tool¶
Outcome: a production-grade tracing tool comparable in quality to one of Brendan Gregg's BCC tools, with a proper userspace consumer, a Prometheus exporter, and CO-RE portability.
Suggested scopes¶
syscallat-system-call latency histograms, per-syscall, per-process, with low overhead. Equivalent ofbpftrace'ssyscountbut production-quality.tcptop-top-N connections by bytes/sec, sortable by direction. Cilium's Hubble has equivalents; do this from scratch.- A profiler-like tool that, given a PID, samples on-CPU stacks at 99 Hz, aggregates with a frequency table, and exposes flamegraph data.
Acceptance¶
- Implemented with
libbpf+ CO-RE. - Userspace consumer in C or Go (using
cilium/ebpf). - Runs on kernels 5.10+ without recompilation.
- Verifier-clean across architectures (x86_64 + aarch64 minimum).
- Prometheus exporter with low-cardinality labels.
- A
bpftraceequivalent for comparison; document why the production version exists. - CPU overhead under representative load: < 1%.
Skills exercised¶
- Months 3 (eBPF), 4 (networking, if you pick
tcptop), 6 (perf tuning).
Track 3-Self-Healing Distributed Service¶
Outcome: a small distributed service (a multi-instance HTTP API, a job runner, a metrics collector) deployed on Linux hosts with a comprehensive self-healing posture.
Suggested scopes¶
- A 3-node deployment of a small HTTP service:
- Each node is a hardened Ubuntu/Debian/Rocky host provisioned by Ansible.
- The service is systemd-managed with watchdog, full hardening directives, cgroups-v2 resource limits.
- On any node failure, the survivors continue serving (use a TCP load balancer + healthcheck, e.g., HAProxy or IPVS).
- Memory pressure (PSI > X%) triggers a soft restart of the worst offender via a cgroup-event watcher.
- Disk pressure triggers log rotation and old-data cleanup.
- A
chaos.shscript kills random nodes; the cluster recovers without human intervention.
Acceptance¶
- Reproducible from Ansible:
ansible-playbook site.ymlbrings up 3 hosts from blank Ubuntu cloud images. - Full observability:
node_exporter, journald, eBPF tools, Prometheus + Grafana. - A documented threat model and CIS-aligned baseline (lynis score).
- A 60-minute "chaos demo": run
chaos.sh; observe full self-healing; produce a one-page incident report from logs. - Encryption at rest (LUKS on data volumes); TLS between nodes (
step-caor self-signed); auditd shipping logs off-host.
Skills exercised¶
- All months. This is the integrative track-the right choice if you want operations-engineer breadth.
Cross-Track Requirements¶
host-baseline/template integrated.- ADRs (≥3).
THREAT_MODEL.md,RUNBOOK.md,RECOVERY.md.- Defense readiness: a 45-minute walkthrough with a peer.
Worked example - Week 10: cgroups v2 on your laptop, end to end¶
Companion to Linux Kernel → Month 03 → Week 10: Cgroups v2. The week explains the unified hierarchy and the controllers. This page is a hands-on tour: create a cgroup, put a process in it, limit it, watch the limit bite.
You need Linux 5.x+ with cgroups v2 enabled (default since ~2020 on most distros). Verify:
If you see tmpfs instead, you're on cgroups v1 hybrid mode - change a boot parameter and reboot, or use a recent Fedora/Arch/Ubuntu 22.04+ where v2 is the default.
The cgroup filesystem¶
cgroups v2 is just a filesystem. Every directory under /sys/fs/cgroup/ is a cgroup; child directories are nested cgroups; the files inside control what the cgroup does.
$ ls /sys/fs/cgroup/
cgroup.controllers cgroup.procs cpu.stat memory.current ...
cgroup.max.depth cgroup.subtree_control io.stat memory.max ...
cgroup.controllers- what controllers are available to descendants.cgroup.subtree_control- what controllers are enabled for descendants.cgroup.procs- PIDs currently in this cgroup (root has all of them).memory.max,cpu.max,io.max- controller-specific knobs.
Create a cgroup¶
$ sudo mkdir /sys/fs/cgroup/demo
$ ls /sys/fs/cgroup/demo
cgroup.controllers cgroup.events cgroup.procs cgroup.stat cgroup.type ...
The kernel populated the new directory with the standard files. Some are read-only (cgroup.controllers), some are writable knobs.
But notice: there are no memory.max or cpu.max files yet. Controllers must be enabled by the parent before they appear in a child:
$ cat /sys/fs/cgroup/cgroup.subtree_control
(empty)
$ echo "+memory +cpu +io" | sudo tee /sys/fs/cgroup/cgroup.subtree_control
$ ls /sys/fs/cgroup/demo | grep -E '^(memory|cpu|io)'
cpu.idle
cpu.max
cpu.stat
io.max
io.stat
memory.current
memory.events
memory.max
memory.peak
memory.stat
...
Now the controllers are available in /sys/fs/cgroup/demo/.
Limit memory¶
Cap the cgroup at 100 MB:
Or use the M / G suffix syntax (kernel parses it):
That's the hard limit. The kernel will OOM-kill anything in this cgroup that pushes past it.
Put a process in the cgroup¶
To move a process in, write its PID to cgroup.procs:
The trick: $$ is the PID of the subshell before exec replaces it; writing to cgroup.procs moves the PID; exec bash then replaces the subshell so the new bash inherits the cgroup. From this new shell, every child process is also in demo.
Verify:
Yes - this process is in /demo.
Watch the limit bite¶
In the demo shell, allocate memory aggressively:
$ python3 -c '
chunks = []
for i in range(1000):
chunks.append(b"x" * (1024 * 1024))
if i % 10 == 0:
print(i, "MB")
'
0 MB
10 MB
20 MB
...
90 MB
Killed
At ~100 MB, the OOM killer fires. Note that only the process in this cgroup was killed; the rest of your system is fine.
Check what happened:
oom_kill 1 - one process was OOM-killed inside this cgroup. Tells you why your container died next time.
Limit CPU¶
The format is "quota period." 50000 100000 means "50ms of CPU per 100ms wall-clock period." That's 50% of one core, regardless of how many cores you have.
Test:
You can also use cpu.max max 100000 to remove the cap. Or set cpu.weight (proportional sharing) for a softer policy.
Limit IO¶
$ ls -la /dev/nvme0n1 # find the device number
brw-rw---- 1 root disk 259, 0 May 17 09:00 /dev/nvme0n1
$ echo "259:0 wbps=10485760" | sudo tee /sys/fs/cgroup/demo/io.max
That throttles writes to the named device to 10 MB/s. Run dd if=/dev/zero of=/tmp/test bs=1M count=200; watch the throughput cap.
Nested cgroups¶
Make /demo/web and /demo/worker:
$ sudo mkdir /sys/fs/cgroup/demo/web /sys/fs/cgroup/demo/worker
$ echo "+memory +cpu" | sudo tee /sys/fs/cgroup/demo/cgroup.subtree_control
$ echo "50M" | sudo tee /sys/fs/cgroup/demo/web/memory.max
$ echo "30M" | sudo tee /sys/fs/cgroup/demo/worker/memory.max
The limits compose: web is capped at 50 MB, worker at 30 MB, both inside demo's 100 MB ceiling. If web would exceed 50 MB, it's killed; if both together would exceed 100 MB, the kernel kills whichever is closer to its limit.
This is exactly the model container runtimes use. A pod is a cgroup, each container inside the pod is a sub-cgroup, the pod's memory.max is the pod's resource limit, the container's is each container's limit.
Tear it down¶
To delete a cgroup, first move all its processes out:
$ sudo bash -c 'echo $$ > /sys/fs/cgroup/cgroup.procs'
$ sudo rmdir /sys/fs/cgroup/demo/web /sys/fs/cgroup/demo/worker /sys/fs/cgroup/demo
rmdir only succeeds if the cgroup is empty (no procs, no descendants).
The trap¶
Setting memory.max to a value below your process's current memory use is fine on cgroups v1 (the limit is "from now on") but on v2, the kernel can immediately OOM-kill processes to bring memory back under the limit. Always reserve headroom; never tighten limits live unless you understand what's currently in there.
The other trap: cpu.weight is relative and useless on an idle system. If your cgroup is the only one running, it gets 100% of the CPU regardless of its weight. Only meaningful under contention.
Exercise¶
- Recreate the demo above. Confirm the OOM kill triggers at exactly the limit.
- Add
memory.highset to 80M (belowmemory.max=100M). Re-run the Python allocator. Observe: undermemory.high, the kernel throttles allocation but doesn't kill. What doesmemory.eventsshow? - Look at how Docker uses this.
docker run --memory=100m -d busybox. Thencat /sys/fs/cgroup/system.slice/docker-*.scope/memory.max. The match should be exact. - (Advanced) Read
cpu.pressure- the PSI (pressure-stall information) interface. It reports how much your cgroup is waiting on CPU. More useful for capacity planning than instantaneous CPU%.
Related reading¶
- The main Week 10 chapter covers the controller architecture and v1 vs v2 differences.
- Container Internals → capabilities walkthrough is the sibling - container runtimes layer cgroups on top of namespaces on top of capabilities.
- Glossary: Cgroup, Namespace, OOM killer, PSI in the main glossary.