Saltar a contenido

Worked investigation - Trace a syscall end to end

Companion to Linux Kernel -> Month 01 -> Week 1-4: Kernel Foundations (syscalls, VFS). The chapter explains that user programs cross into the kernel via syscalls. This page makes you watch that boundary: see every syscall a real program makes, time them, catch the one that's slow, and then follow a syscall into the kernel with ftrace. ~40 minutes, any Linux box.

The symptom you're learning to diagnose

A program is slow or stuck and you have no idea why. It's not CPU-bound (top shows it idle). It's not obviously crashed. It's just... waiting. Or it works on one machine and fails on another with no useful error. The question "what is this process actually doing?" has a precise answer, and strace gives it to you in seconds.

Every interesting thing a program does - read a file, open a socket, allocate memory, talk to another process - is a syscall, a controlled jump from user space into the kernel. Watch the syscalls and you watch the program's real behavior, stripped of its source code.

Step 0: the one fact

Your program runs in user space, which cannot touch hardware, files, or the network directly. To do anything real, it executes a special instruction (syscall on x86-64) that traps into the kernel, which does the privileged work and returns. That boundary crossing is the syscall. There are ~400 of them (open, read, write, mmap, futex, ...). strace sits on that boundary and logs every crossing. Nothing a process does to the outside world escapes it.

Step 1: watch a program's syscalls

Trace something simple - cat reading a file:

$ echo "hello" > /tmp/demo.txt
$ strace cat /tmp/demo.txt

You'll see a flood. Ignore the startup noise (loading libraries) and find the heart of it:

openat(AT_FDCWD, "/tmp/demo.txt", O_RDONLY)   = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=6, ...}) = 0
read(3, "hello\n", 131072)                    = 6
write(1, "hello\n", 6)                        = 6
read(3, "", 131072)                           = 0
close(3)                                      = 0

This is cat, stripped to its essence - and you can read the whole program's behavior without its source:

  • openat(..., "/tmp/demo.txt", O_RDONLY) = 3 - open the file read-only. Returns 3, a file descriptor (0/1/2 are stdin/stdout/stderr, so the first opened file is 3).
  • fstat(3, {st_size=6}) = 0 - ask the kernel about fd 3; it's a 6-byte regular file. cat uses this to size its buffer.
  • read(3, "hello\n", 131072) = 6 - read up to 131072 bytes from fd 3; got 6 ("hello\n"). The return value is the actual bytes read.
  • write(1, "hello\n", 6) = 6 - write those 6 bytes to fd 1 (stdout). This is what you see on screen.
  • read(3, "", 131072) = 0 - read again; got 0 bytes = end of file.
  • close(3) = 0 - release the descriptor.

You just read a program's complete I/O behavior from the outside. No source needed. This is the superpower.

Step 2: catch the slow syscall - time them

The -T flag shows how long each syscall took; -tt adds timestamps. This is how you find where a program waits:

$ strace -T -tt curl -s https://example.com -o /dev/null 2>&1 | grep -E 'connect|recvfrom|poll' | head
14:22:01.118 connect(5, {sa_family=AF_INET, ...}) = -1 EINPROGRESS <0.000119>
14:22:01.118 poll([{fd=5, events=POLLOUT}], 1, 200) = 1 <0.044271>
14:22:01.163 recvfrom(5, "HTTP/1.1 200..."           ) = 1448 <0.000038>

The <...> is seconds spent in that call. The poll took 44 ms - that's the network round trip, the program waiting for the server. The connect/recvfrom are microseconds. When a program is "slow," strace -T shows you the single syscall it's blocked in - almost always a read, poll, recvfrom, or futex (lock wait). You stop guessing and point at the exact wait.

Step 3: the summary view - where does time actually go

For a busy program, the per-call log is too much. -c gives a profile - a syscall histogram:

$ strace -c -f find /usr/share/doc -name '*.txt' >/dev/null
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 41.2    0.018429           4      4096           newfstatat
 28.7    0.012834           6      2048           getdents64
 19.1    0.008544          16       512        12 openat
  ...

Now you see find is dominated by newfstatat (stat each file) and getdents64 (read directory entries) - exactly what a filesystem walk should do. If instead you saw thousands of unexpected openat calls on a config file, you'd have found a program re-reading config in a loop. The -c profile is "where is this program spending its kernel time," the syscall-level cousin of a CPU profiler.

Step 4: attach to something already running

You don't have to launch the program. Attach to a stuck process by PID and see what it's blocked on right now:

$ strace -p <pid>
# ... if it's hung, you'll see it sitting in a single call, e.g.:
futex(0x55..., FUTEX_WAIT_PRIVATE, 2, NULL
# (no return value printed = it's blocked HERE, waiting on a lock)

A syscall line with no return value is the process blocked in that call. futex(... FUTEX_WAIT ...) with no return = waiting on a lock (a deadlock or contention - tie this to the concurrency chapters). read(...) with no return = waiting for input. This single trick - strace -p on a hung process - resolves a huge fraction of "why is it stuck" questions instantly. Press Ctrl-C to detach (the process keeps running).

Step 5: go deeper - follow a syscall INTO the kernel with ftrace

strace shows the boundary. To see what the kernel does on the other side, use ftrace, the kernel's built-in tracer. Watch the kernel functions called when anyone opens a file:

$ cd /sys/kernel/debug/tracing          # ftrace's control interface (needs root)
$ echo function > current_tracer
$ echo 'do_sys_openat2' > set_ftrace_filter    # trace the open implementation
$ echo 1 > tracing_on
$ cat /tmp/demo.txt >/dev/null          # trigger an open
$ echo 0 > tracing_on
$ cat trace | head
   cat-19847 [003] ...  do_sys_openat2

You just saw the kernel-side function (do_sys_openat2) that services the openat syscall strace showed in Step 1. With function_graph instead of function, you can see the entire call tree inside the kernel - VFS layer, the filesystem driver, the block layer - with timings. That's the full path from your read() to the disk and back, which is the conceptual content of the Kernel Foundations month made literally visible.

$ echo function_graph > current_tracer
$ echo do_sys_openat2 > set_graph_function
$ echo 1 > tracing_on ; cat /tmp/demo.txt >/dev/null ; echo 0 > tracing_on
$ cat trace | head -30      # a timed call tree INSIDE the kernel

(Remember to echo nop > current_tracer and clear set_ftrace_filter when done so tracing doesn't keep running.)

Healthy vs stuck, side by side

HEALTHY (program doing work):            STUCK (program hung):
read(3,...) = 4096                       futex(0x55.., FUTEX_WAIT, 2, NULL
write(1,...) = 4096                      (no return - blocked here)
read(3,...) = 4096                       ^ deadlock or lock contention
(calls returning, time advancing)        ^ strace -p shows ONE frozen call

A running program's strace scrolls (calls returning). A hung one stops on a single call with no return. That visual alone tells you "working hard" from "deadlocked."

Now you do it

  1. strace -T ls -l /etc | head - find the slowest syscall. (Probably the getdents64/statx on the directory.)
  2. strace -c a program you wrote or use daily. Read the profile. Are the top syscalls the ones you'd expect? Any surprise (a file opened 1000 times)?
  3. Start a sleep 300 &, then strace -p <pid>. You'll see it blocked in clock_nanosleep with no return - the process literally waiting. Detach with Ctrl-C; confirm sleep keeps running.
  4. (Root) ftrace do_sys_openat2 with function_graph and read one call tree. Find a recognizable layer name (VFS, ext4/xfs). You're looking at the inside of the syscall boundary.

What you might wonder

"Is strace safe to run in production?" With caution. strace uses ptrace, which stops the target on every syscall - it can slow a busy process 10-100x. Fine for a quick look at a stuck or low-traffic process; risky on a hot path. For production-safe syscall tracing, use perf trace or a bpftrace tracepoint:syscalls:* script (the bpftrace investigation) - those sample via eBPF with far lower overhead. strace is the dev/debug tool; eBPF is the production tool.

"Why are there so many syscalls before my program even starts?" The dynamic linker loading shared libraries - lots of openat/mmap/read on .so files. strace from the very first line shows the whole machinery. Filter with -e trace=openat (or -e trace=network, -e trace=file) to see only the family you care about.

"What's the difference between strace and ltrace?" strace traces syscalls (the kernel boundary). ltrace traces library calls (e.g. malloc, printf - user-space function calls into shared libraries). Different boundaries; strace is the one you'll reach for 90% of the time because the syscall boundary is where waiting and external effects happen.

"ftrace vs eBPF/bpftrace?" ftrace is the older, always-present, file-interface tracer - great for "what kernel functions ran." eBPF/bpftrace is the modern, programmable successor that can do custom aggregation and run safely in production. Learn ftrace for the mental model (it's simple and everywhere); reach for bpftrace for real tooling (next investigation).

What this gave you

  • You can read a program's complete external behavior from its syscalls, no source required.
  • You can find where a program waits with strace -T (the blocked syscall).
  • You can profile kernel-time with strace -c and spot pathological patterns.
  • You can attach to a hung process and see the single call it's frozen in - the fastest "why is it stuck" answer there is.
  • You followed a syscall across the boundary into the kernel's own functions with ftrace.
  • You know the production-safe alternatives (perf trace, bpftrace) and when strace's overhead matters.

Back to the Kernel Foundations month, or on to the page-cache investigation.

Comments