Saltar a contenido

08 - Processes

What this session is

About 45 minutes. You'll learn what a process is, how to see what's running, how to kill misbehaving programs, and how to manage background jobs.

What's a process

Every running program is a process with: - A PID (process ID) - a unique number. - A user - the user who started it. - Resources - open files, memory, network sockets. - A state - running, sleeping, waiting, zombie.

When you run a command, the shell creates a child process to execute it. When the command finishes, the process exits.

See what's running: ps

ps                       # processes in your current terminal
ps aux                   # ALL processes, with detailed info
ps aux | grep python     # only Python processes

ps aux is the most-used form. Columns:

Column Meaning
USER who started it
PID process ID
%CPU CPU usage
%MEM memory usage
VSZ / RSS virtual / resident memory (KB)
TTY terminal it's attached to (? if none)
STAT state (R running, S sleeping, Z zombie, T stopped)
START / TIME when started / cumulative CPU time
COMMAND what's running

Live view: top and htop

top is the classic interactive process viewer:

top

Press q to quit. M to sort by memory. P by CPU. 1 to show per-CPU breakdown.

htop is the modern, prettier alternative. Install: - sudo apt install htop - brew install htop

htop

F10 or q to quit. Use arrows to scroll; F9 to kill (with menu). Much easier than top.

btop (newer) is similar - colorful, more visual. Try it: sudo apt install btop / brew install btop.

Killing a process

kill PID                # send the default signal (TERM - polite request to exit)
kill -9 PID             # send SIGKILL - force kill (process can't ignore)
killall name            # kill all processes named "name"
pkill name              # similar

The two signals to know: - SIGTERM (15) - the default. "Please exit cleanly." The process can save state, close files, then exit. - SIGKILL (9) - "Die now." The kernel terminates the process immediately. No cleanup. Use only when SIGTERM doesn't work.

kill 1234          # SIGTERM to PID 1234
kill -TERM 1234    # same
kill -9 1234       # SIGKILL - use as last resort

Running things in the background

Add & to run a command in the background, freeing your terminal:

sleep 60 &
[1] 12345          # the shell tells you job # and PID

Your terminal returns. The process runs.

To see your background jobs in this shell:

jobs

To bring a background job to the foreground:

fg              # bring last
fg %1           # bring job #1

To send a foreground job to the background: - Press Ctrl-Z to suspend (pause it). - Type bg to continue it in the background.

To kill a job:

kill %1         # kill job #1

Long-running processes that survive logout

Jobs started with & die when you log out. For things that should survive:

nohup ./long-job.sh &      # ignore hangup signal

Output goes to nohup.out by default.

The modern alternative: tmux or screen - terminal multiplexers. Start a session, run things in it, detach, log out, come back hours later, reattach. Out of scope here; install and learn one - tmux is the more popular.

For production services: use systemd units (/etc/systemd/system/myservice.service). Beyond beginner; mentioned for recognition.

Process tree: pstree

pstree
pstree -p          # include PIDs
pstree alice       # only alice's processes

Shows parent-child relationships visually. Useful for understanding what spawned what.

Why a process won't die

Sometimes kill PID doesn't work: 1. The process is in uninterruptible sleep (waiting on disk or kernel - state D in ps). Wait it out; reboot if persistent. 2. The process is a child of another, and the parent ignores SIGCHLD. Kill the parent. 3. You don't own the process. sudo kill PID if you must.

Last resort always: sudo kill -9 PID.

Going deeper

The commands above handle 95% of process work. This section is the depth that turns "I can run ps" into "I can diagnose what's wrong with a process" - the failure modes you'll actually hit, with what you'll see.

Process states - and the one that means trouble

The STAT column in ps aux is a status code most people ignore. Each letter is a real diagnosis:

$ ps aux | awk '{print $8, $11}' | head
STAT COMMAND
S    /usr/bin/bash       # S = interruptible sleep (waiting for something, normal - most procs)
R    python train.py     # R = running or runnable (actually using CPU right now)
D    cp bigfile /mnt      # D = UNINTERRUPTIBLE sleep (stuck in a kernel/IO operation)
Z    [defunct]           # Z = zombie (finished, but parent hasn't collected it)
T    vim                 # T = stopped (you hit Ctrl-Z, or it got SIGSTOP)

The two that signal trouble:

  • D (uninterruptible sleep) - the process is stuck inside a kernel operation, almost always waiting on slow or hung I/O (a dead NFS mount, a failing disk). Here's the kicker: a D-state process cannot be killed, not even with kill -9 - it's not ignoring your signal, it's not running user code at all, it's blocked in the kernel. If you see a process stuck in D, the problem is the I/O it's waiting on (check dmesg for disk errors, check the mount). The process will come back when the I/O completes or errors out; if it never does, only a reboot clears it. Seeing D and knowing "that's stuck I/O, not a killable process" saves you from fruitlessly hammering kill -9.

  • Z (zombie) - covered next.

Zombies - what they are and why kill -9 won't clear them

A zombie is a process that has already exited but whose parent hasn't called wait() to read its exit status. It's not running - it's a corpse holding a slot in the process table (just a PID and exit code, no memory). You'll see them as <defunct>:

$ ps aux | grep defunct
user  4823  0.0  0.0  0  0  ?  Z  10:15  0:00 [python] <defunct>

The trap everyone falls into: you cannot kill a zombie. kill -9 4823 does nothing - you can't kill something that's already dead. The fix is counterintuitive: the zombie is the parent's fault (the parent failed to reap it), so you signal or kill the parent. Find the parent:

$ ps -o ppid= -p 4823       # get the zombie's parent PID
3401
$ ps -p 3401                 # who's the negligent parent?
  PID TTY  CMD
 3401 ?    my_buggy_server
$ kill 3401                   # killing/signaling the parent makes init reap the zombie

When the parent dies (or handles SIGCHLD), the zombie gets reaped by init (PID 1) and vanishes. A few transient zombies are normal; thousands accumulating means a buggy parent program that never reaps children - a real bug to report or fix. This is the kind of thing that looks like a mystery ("why won't this dead process go away?") until you know zombies are reaped through the parent.

What you'll actually see: a runaway process

The most common real incident: something pins a CPU core at 100%. Here's the diagnosis, with output:

$ top -o %CPU            # sort by CPU, the runaway floats to the top
  PID USER  %CPU %MEM  COMMAND
 5012 user 99.7  2.1   node                  <- one core fully pegged

99.7% means one core saturated (on an 8-core box, top can show up to 800% for a process using all cores). Before you kill it, see what it's doing - is it a real workload or a stuck loop? Attach strace (the syscall investigation, if you've seen the senior path) or check /proc:

$ cat /proc/5012/wchan       # what kernel function it's waiting in (empty = actively running, not waiting)
$ ls -l /proc/5012/fd        # what files/sockets it has open - hints at what it's doing

A process at 100% CPU with an empty wchan is genuinely computing (maybe a real job, maybe an infinite loop). One that's at 100% but stuck is rarer. This is the difference between "kill it blindly" and "understand it, then decide."

Signals are a vocabulary, not just kill -9

kill sends signals, and the signal you send matters:

kill -TERM PID    # (= plain `kill`) SIGTERM: "please shut down" - the process can clean up
kill -INT PID     # SIGINT: same as Ctrl-C - interrupt
kill -HUP PID     # SIGHUP: many servers reload their config on this (no restart!)
kill -KILL PID    # (= kill -9) SIGKILL: cannot be caught or ignored, instant death, NO cleanup
kill -STOP PID    # pause the process (resume with -CONT)
kill -CONT PID    # resume a stopped process

The discipline: always try SIGTERM (plain kill) first. It lets the process flush buffers, close files, finish a transaction, and exit cleanly. kill -9 (SIGKILL) is the sledgehammer - it can't be caught, so the process dies immediately with no chance to clean up, which can corrupt files or leave locks held. Reach for -9 only when SIGTERM is ignored. Bonus: kill -HUP reloads many daemons' config without downtime - kill -HUP $(pgrep nginx) re-reads nginx config live.

Try it (with what you'll see)

  1. Make a zombie: run bash -c 'sleep 1 & exec sleep 5' - briefly creates a child whose parent is busy. Inspect with ps aux | grep defunct. (Or write a tiny program that forks and never wait()s.) Confirm kill -9 on the zombie does nothing, and killing the parent clears it.
  2. Find a D-state process if you can (run dd if=/dev/sda of=/dev/null as root briefly - it may flash D). Note you can't reliably kill -9 it mid-I/O.
  3. Spin a runaway: python3 -c 'while True: pass' &, find it in top -o %CPU at ~100%, check /proc/<pid>/wchan (empty = actively running), then kill it (SIGTERM) and confirm it dies cleanly.
  4. Send kill -STOP then kill -CONT to a process and watch its STAT flip T then back to S/R.

Exercise

  1. List your processes:

    ps -u $USER
    

  2. Count Python processes on your system:

    ps aux | grep python | wc -l
    
    (Note: this counts the grep itself too. ps aux | grep python | grep -v grep | wc -l to exclude.)

  3. Start a sleep in the background:

    sleep 120 &
    jobs
    
    Note the PID. Bring it back to the foreground:
    fg
    
    Press Ctrl-Z to suspend, then bg to continue background, then kill it:
    kill %1
    

  4. Launch htop (or top). Sort by memory (in htop, F6 → MEM%; in top, press M). Find the biggest process. Quit.

  5. Bonus: find what process is using a given file:

    lsof | grep filename       # may need to install lsof
    
    Or what's listening on a port:
    ss -tlnp                   # TCP listening sockets, with process info
    

What you might wonder

"What's a zombie process?" A process that has finished but whose exit status hasn't been reaped by its parent. It still has a PID but no resources. Mostly harmless; reflects a buggy parent. Reboot clears them.

"Why does kill need a number? What are all the signals?" kill -l shows them all. The common ones: - 1 SIGHUP - terminal hangup; often triggers reload in daemons. - 2 SIGINT - what Ctrl-C sends. - 9 SIGKILL - uncatchable. - 15 SIGTERM - polite termination. - 19 SIGSTOP / 18 SIGCONT - pause / resume.

"How do I run a job on a schedule?" cron (Linux/macOS) or systemd timers. crontab -e to edit your scheduled jobs. Out of scope here; useful to know it exists.

"What's the relationship between processes and threads?" A process can have multiple threads (lighter-weight units of execution within the process, sharing memory). For most beginner tasks, this distinction doesn't matter. ps -L shows threads if you need to see them.

Done

  • Inspect processes with ps aux, top, htop.
  • Kill processes with kill, killall, pkill.
  • Distinguish SIGTERM (15) from SIGKILL (9).
  • Run jobs in the background (&, jobs, fg, bg).
  • Keep jobs alive past logout (nohup, tmux).

Next: Shell scripting basics →

Comments