Worked investigation - Trace a syscall end to end¶
Companion to Linux Kernel -> Month 01 -> Week 1-4: Kernel Foundations (syscalls, VFS). The chapter explains that user programs cross into the kernel via syscalls. This page makes you watch that boundary: see every syscall a real program makes, time them, catch the one that's slow, and then follow a syscall into the kernel with ftrace. ~40 minutes, any Linux box.
The symptom you're learning to diagnose¶
A program is slow or stuck and you have no idea why. It's not CPU-bound (
topshows it idle). It's not obviously crashed. It's just... waiting. Or it works on one machine and fails on another with no useful error. The question "what is this process actually doing?" has a precise answer, andstracegives it to you in seconds.
Every interesting thing a program does - read a file, open a socket, allocate memory, talk to another process - is a syscall, a controlled jump from user space into the kernel. Watch the syscalls and you watch the program's real behavior, stripped of its source code.
Step 0: the one fact¶
Your program runs in user space, which cannot touch hardware, files, or the network directly. To do anything real, it executes a special instruction (syscall on x86-64) that traps into the kernel, which does the privileged work and returns. That boundary crossing is the syscall. There are ~400 of them (open, read, write, mmap, futex, ...). strace sits on that boundary and logs every crossing. Nothing a process does to the outside world escapes it.
Step 1: watch a program's syscalls¶
Trace something simple - cat reading a file:
You'll see a flood. Ignore the startup noise (loading libraries) and find the heart of it:
openat(AT_FDCWD, "/tmp/demo.txt", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=6, ...}) = 0
read(3, "hello\n", 131072) = 6
write(1, "hello\n", 6) = 6
read(3, "", 131072) = 0
close(3) = 0
This is cat, stripped to its essence - and you can read the whole program's behavior without its source:
openat(..., "/tmp/demo.txt", O_RDONLY) = 3- open the file read-only. Returns 3, a file descriptor (0/1/2 are stdin/stdout/stderr, so the first opened file is 3).fstat(3, {st_size=6}) = 0- ask the kernel about fd 3; it's a 6-byte regular file.catuses this to size its buffer.read(3, "hello\n", 131072) = 6- read up to 131072 bytes from fd 3; got 6 ("hello\n"). The return value is the actual bytes read.write(1, "hello\n", 6) = 6- write those 6 bytes to fd 1 (stdout). This is what you see on screen.read(3, "", 131072) = 0- read again; got 0 bytes = end of file.close(3) = 0- release the descriptor.
You just read a program's complete I/O behavior from the outside. No source needed. This is the superpower.
Step 2: catch the slow syscall - time them¶
The -T flag shows how long each syscall took; -tt adds timestamps. This is how you find where a program waits:
$ strace -T -tt curl -s https://example.com -o /dev/null 2>&1 | grep -E 'connect|recvfrom|poll' | head
14:22:01.118 connect(5, {sa_family=AF_INET, ...}) = -1 EINPROGRESS <0.000119>
14:22:01.118 poll([{fd=5, events=POLLOUT}], 1, 200) = 1 <0.044271>
14:22:01.163 recvfrom(5, "HTTP/1.1 200..." ) = 1448 <0.000038>
The <...> is seconds spent in that call. The poll took 44 ms - that's the network round trip, the program waiting for the server. The connect/recvfrom are microseconds. When a program is "slow," strace -T shows you the single syscall it's blocked in - almost always a read, poll, recvfrom, or futex (lock wait). You stop guessing and point at the exact wait.
Step 3: the summary view - where does time actually go¶
For a busy program, the per-call log is too much. -c gives a profile - a syscall histogram:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
41.2 0.018429 4 4096 newfstatat
28.7 0.012834 6 2048 getdents64
19.1 0.008544 16 512 12 openat
...
Now you see find is dominated by newfstatat (stat each file) and getdents64 (read directory entries) - exactly what a filesystem walk should do. If instead you saw thousands of unexpected openat calls on a config file, you'd have found a program re-reading config in a loop. The -c profile is "where is this program spending its kernel time," the syscall-level cousin of a CPU profiler.
Step 4: attach to something already running¶
You don't have to launch the program. Attach to a stuck process by PID and see what it's blocked on right now:
$ strace -p <pid>
# ... if it's hung, you'll see it sitting in a single call, e.g.:
futex(0x55..., FUTEX_WAIT_PRIVATE, 2, NULL
# (no return value printed = it's blocked HERE, waiting on a lock)
A syscall line with no return value is the process blocked in that call. futex(... FUTEX_WAIT ...) with no return = waiting on a lock (a deadlock or contention - tie this to the concurrency chapters). read(...) with no return = waiting for input. This single trick - strace -p on a hung process - resolves a huge fraction of "why is it stuck" questions instantly. Press Ctrl-C to detach (the process keeps running).
Step 5: go deeper - follow a syscall INTO the kernel with ftrace¶
strace shows the boundary. To see what the kernel does on the other side, use ftrace, the kernel's built-in tracer. Watch the kernel functions called when anyone opens a file:
$ cd /sys/kernel/debug/tracing # ftrace's control interface (needs root)
$ echo function > current_tracer
$ echo 'do_sys_openat2' > set_ftrace_filter # trace the open implementation
$ echo 1 > tracing_on
$ cat /tmp/demo.txt >/dev/null # trigger an open
$ echo 0 > tracing_on
$ cat trace | head
You just saw the kernel-side function (do_sys_openat2) that services the openat syscall strace showed in Step 1. With function_graph instead of function, you can see the entire call tree inside the kernel - VFS layer, the filesystem driver, the block layer - with timings. That's the full path from your read() to the disk and back, which is the conceptual content of the Kernel Foundations month made literally visible.
$ echo function_graph > current_tracer
$ echo do_sys_openat2 > set_graph_function
$ echo 1 > tracing_on ; cat /tmp/demo.txt >/dev/null ; echo 0 > tracing_on
$ cat trace | head -30 # a timed call tree INSIDE the kernel
(Remember to echo nop > current_tracer and clear set_ftrace_filter when done so tracing doesn't keep running.)
Healthy vs stuck, side by side¶
HEALTHY (program doing work): STUCK (program hung):
read(3,...) = 4096 futex(0x55.., FUTEX_WAIT, 2, NULL
write(1,...) = 4096 (no return - blocked here)
read(3,...) = 4096 ^ deadlock or lock contention
(calls returning, time advancing) ^ strace -p shows ONE frozen call
A running program's strace scrolls (calls returning). A hung one stops on a single call with no return. That visual alone tells you "working hard" from "deadlocked."
Now you do it¶
strace -T ls -l /etc | head- find the slowest syscall. (Probably thegetdents64/statxon the directory.)strace -ca program you wrote or use daily. Read the profile. Are the top syscalls the ones you'd expect? Any surprise (a file opened 1000 times)?- Start a
sleep 300 &, thenstrace -p <pid>. You'll see it blocked inclock_nanosleepwith no return - the process literally waiting. Detach with Ctrl-C; confirmsleepkeeps running. - (Root) ftrace
do_sys_openat2withfunction_graphand read one call tree. Find a recognizable layer name (VFS, ext4/xfs). You're looking at the inside of the syscall boundary.
What you might wonder¶
"Is strace safe to run in production?" With caution. strace uses ptrace, which stops the target on every syscall - it can slow a busy process 10-100x. Fine for a quick look at a stuck or low-traffic process; risky on a hot path. For production-safe syscall tracing, use perf trace or a bpftrace tracepoint:syscalls:* script (the bpftrace investigation) - those sample via eBPF with far lower overhead. strace is the dev/debug tool; eBPF is the production tool.
"Why are there so many syscalls before my program even starts?" The dynamic linker loading shared libraries - lots of openat/mmap/read on .so files. strace from the very first line shows the whole machinery. Filter with -e trace=openat (or -e trace=network, -e trace=file) to see only the family you care about.
"What's the difference between strace and ltrace?" strace traces syscalls (the kernel boundary). ltrace traces library calls (e.g. malloc, printf - user-space function calls into shared libraries). Different boundaries; strace is the one you'll reach for 90% of the time because the syscall boundary is where waiting and external effects happen.
"ftrace vs eBPF/bpftrace?" ftrace is the older, always-present, file-interface tracer - great for "what kernel functions ran." eBPF/bpftrace is the modern, programmable successor that can do custom aggregation and run safely in production. Learn ftrace for the mental model (it's simple and everywhere); reach for bpftrace for real tooling (next investigation).
What this gave you¶
- You can read a program's complete external behavior from its syscalls, no source required.
- You can find where a program waits with
strace -T(the blocked syscall). - You can profile kernel-time with
strace -cand spot pathological patterns. - You can attach to a hung process and see the single call it's frozen in - the fastest "why is it stuck" answer there is.
- You followed a syscall across the boundary into the kernel's own functions with ftrace.
- You know the production-safe alternatives (
perf trace, bpftrace) and when strace's overhead matters.
Back to the Kernel Foundations month, or on to the page-cache investigation.