Worked investigation - Write a one-line bpftrace tool¶
Companion to Linux Kernel -> Month 03 -> Week 12: eBPF and Observability. The chapter explains eBPF lets you run safe programs inside the kernel to observe anything. This page makes you write working tracers in one line and watch live kernel events on your own machine - the skill that lets you answer "what is the kernel doing right now?" without recompiling anything or installing an agent. ~40 minutes, Linux 5.x+ with bpftrace (apt install bpftrace / dnf install bpftrace), root.
The symptom you're learning to handle¶
Something is happening on a production box and no existing tool shows it. "Which process is opening that file 10,000 times a second?" "What's the latency distribution of disk reads right now?" "Who is killing my process?" The metrics dashboard doesn't have it, adding logging means a redeploy, and
straceis too slow for production (the syscall investigation). You need to ask the running kernel a custom question, safely, in one line. That's bpftrace.
Step 0: the one fact that makes this safe¶
eBPF lets you load a small program into the running kernel that fires on an event (a syscall, a function call, a timer) and collects data - without recompiling the kernel, rebooting, or installing anything. The magic is the verifier: before the kernel runs your program, it proves the program can't crash, loop forever, or read invalid memory. So you get kernel-level visibility with production-grade safety. bpftrace is the awk-like front-end: you write a one-liner, it compiles to eBPF, loads it, prints results. This is why eBPF tools (bpftrace, the BCC suite, Cilium, modern observability agents) took over Linux observability.
The shape of every bpftrace program: probe { action } - "when this event fires, do this."
Step 1: your first tracer - who opens files?¶
Trace every file open, system-wide, showing the process and filename:
$ sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s -> %s\n", comm, str(args.filename)); }'
Attaching 1 probe...
bash -> /etc/ld.so.cache
firefox -> /home/you/.mozilla/firefox/prefs.js
node -> /proc/self/stat
...
Decode the one-liner: - tracepoint:syscalls:sys_enter_openat - the probe: fire on entry to the openat syscall. Tracepoints are stable, kernel-provided hook points (the safest probe type). - { ... } - the action, run in-kernel each time the probe fires. - comm - a built-in: the name of the process that triggered it. - str(args.filename) - read the syscall's filename argument and convert to a string.
You're now watching every file open on the machine, live, with the responsible process. That's already more than most logging gives you - and you wrote it in one line. Ctrl-C to stop.
Step 2: count instead of flood - aggregation¶
A raw firehose is rarely what you want. bpftrace's killer feature is in-kernel maps that aggregate. "Which processes open the most files in 10 seconds?":
$ sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat { @[comm] = count(); }'
^C (Ctrl-C after ~10s)
@[sshd]: 3
@[bash]: 14
@[systemd]: 88
@[node]: 2847
@[chrome]: 5012
@[comm] = count() builds a histogram keyed by process name, in the kernel (no per-event userspace overhead - this is why it's production-safe where strace isn't). On Ctrl-C, bpftrace prints the map sorted. Instantly you see chrome and node dominate file opens. This is how you find "who is hammering the filesystem?" in 10 seconds, on a live box, with no redeploy. Replace count() with sum(args.count) on a read probe and you measure bytes; the pattern generalizes.
Step 3: measure latency distributions - histograms¶
The most powerful built-in: hist() builds a power-of-2 latency histogram. "What's the latency distribution of block I/O right now?":
$ sudo bpftrace -e '
tracepoint:block:block_rq_issue { @start[args.sector] = nsecs; }
tracepoint:block:block_rq_complete /@start[args.sector]/ {
@usecs = hist((nsecs - @start[args.sector]) / 1000);
delete(@start[args.sector]);
}'
^C
@usecs:
[16, 32) 12 |@@@ |
[32, 64) 103 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[64, 128) 148 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[128, 256) 41 |@@@@@@@@@@@ |
[256, 512) 7 |@ |
[2048, 4096) 2 | | <- the slow tail!
This is a real, live latency distribution of disk I/O - the kind of thing you'd otherwise need a dedicated APM product for. Read it: most reads complete in 64-128 µs (NVMe), but there's a tail at 2-4 ms - a handful of slow reads. That tail is what causes your p99 latency spikes. Averages would hide it; the histogram shows it. The two-probe pattern (record start time on issue, compute delta on complete, keyed by an ID) is the universal recipe for measuring the latency of any paired kernel events: syscalls, function calls, network ops.
Step 4: trace a kernel function directly - kprobes¶
Tracepoints are stable but limited to predefined spots. kprobes let you attach to almost any kernel function by name. Catch every process that's killed (the OOM investigation's live-watcher, generalized):
$ sudo bpftrace -e 'kprobe:oom_kill_process { printf("%s OOM-killed something at %s\n", comm, strftime("%H:%M:%S", nsecs)); }'
Or watch every TCP connection the machine initiates:
$ sudo bpftrace -e 'kprobe:tcp_connect { printf("%s connecting\n", comm); }'
chrome connecting
node connecting
curl connecting
kprobe:<function> fires when that kernel function is called. kretprobe:<function> fires on return (with retval). Between them you can trace and time any kernel internal. List what's available: bpftrace -l 'kprobe:tcp_*' shows every TCP-related kernel function you can hook. This is the "observe literally anything in the kernel" power the chapter promised - now in your hands.
Step 5: a real diagnostic - find the source of disk writes¶
Put it together to answer a real production question: "what's writing to my disk and how much?" - a frequent "why is my SSD wearing out / disk full" investigation:
$ sudo bpftrace -e 'tracepoint:syscalls:sys_enter_write { @bytes[comm] = sum(args.count); }'
^C
@bytes[sshd]: 4096
@bytes[bash]: 8192
@bytes[journald]: 1048576
@bytes[postgres]: 524288000 <- 500 MB! there's your disk churn
In ten seconds, on a live system, you found that postgres wrote 500 MB while everything else was negligible. No agent installed, no redeploy, no guesswork. That is the bpftrace value proposition: custom, ad-hoc, production-safe answers to questions no dashboard anticipated.
The stable-vs-fragile tradeoff¶
TRACEPOINTS (prefer): KPROBES (when you must):
tracepoint:syscalls:sys_enter_openat kprobe:do_sys_openat2
- stable kernel API, won't break - any function, max flexibility
- limited to predefined points - names/args change between kernel versions
- use for syscalls, scheduler, block, - your one-liner may break on a kernel
net - the common cases upgrade; pin/test accordingly
Reach for tracepoints first (stable); drop to kprobes only when no tracepoint covers what you need. This is the same stable-API-vs-internals tradeoff as everywhere in systems.
Now you do it¶
- Run the file-open tracer (Step 1). Watch your own actions (open a file in an editor) appear. Then the
@[comm] = count()aggregation (Step 2) for 10 seconds - who's the top file-opener on your box? - Run the block-I/O latency histogram (Step 3) while copying a large file. Find the tail. What's your p99-ish bucket?
bpftrace -l 'tracepoint:syscalls:*' | head -40- browse the available syscall tracepoints. Pick one (sys_enter_connect,sys_enter_execve) and write a one-liner for it.- Write a tracer that counts
execve(new process launches) by parent:tracepoint:syscalls:sys_enter_execve { @[comm] = count(); }. Run a build or open an app and watch the process explosion.
What you might wonder¶
"Is this really safe on production?" Yes - that's the entire point and why eBPF won. The verifier guarantees the program can't crash the kernel or loop forever; aggregation happens in-kernel so overhead is tiny (typically <1% even for busy probes); and you attach/detach with no restart. This is the production-safe tracing the strace investigation pointed you toward. (Caveat: extremely high-frequency probes - e.g. tracing every memory allocation - can still add measurable overhead; aggregate, don't printf per-event, on hot paths.)
"bpftrace vs BCC vs the bigger eBPF picture?" bpftrace is for one-liners and short scripts (what you just did). BCC (the BPF Compiler Collection) is a Python/C toolkit for full-blown tools (biolatency, execsnoop, tcplife - many are productized bpftrace patterns). And eBPF underpasses far more than tracing now: networking (Cilium - the K8s investigation), security (Falco, seccomp-bpf), load balancing. Same kernel technology; tracing is the most accessible door into it.
"Do I need to know C to use eBPF?" Not for bpftrace - its language is awk-like and you've now written several programs in it. Full BCC tools involve C for the kernel side, but you can get enormous value from bpftrace one-liners alone. Start here; graduate to BCC when you need a polished, reusable tool.
"What kernel version do I need?" bpftrace needs ~4.9+ for basics, 5.x for the full feature set (and this is the same eBPF infrastructure modern observability agents - Pixie, Parca, Grafana Beyla - are built on). On older kernels, fall back to perf and ftrace (the syscall investigation). Most production Linux today is new enough.
What this gave you¶
- You wrote working kernel tracers in one line each, loaded safely via the eBPF verifier.
- You can flood (per-event
printf) or aggregate (in-kernelcount/sum/hist) - and you know to aggregate on hot paths. - You built a live latency histogram and read its slow tail - the source of p99 spikes.
- You can hook stable tracepoints and (when needed) arbitrary kernel functions via kprobes, and you know the tradeoff.
- You answered a real "who's writing to my disk?" production question in 10 seconds with no agent and no redeploy.
- You see where eBPF goes beyond tracing (networking, security) and how bpftrace relates to BCC and the broader ecosystem.
Back to the Namespaces & eBPF month, or on to following a packet through netfilter.