Skip to content

Worked investigation - Follow a packet through netfilter

Companion to Linux Kernel -> Month 04 -> Week 13-14: Netfilter, the Network Stack. The chapter explains packets traverse hooks where rules can filter and rewrite them. This page makes you watch a packet's journey - see it hit each netfilter hook, watch a firewall rule drop it, watch NAT rewrite its address - on your own machine. The host-network counterpart to the Kubernetes Cilium investigation. ~40 minutes, Linux with root.

The symptom you're learning to diagnose

A connection isn't working and the application logs are useless. "Connection refused"? "Connection timed out"? "No route to host"? Each points at a different layer and a different fix - but most engineers can't tell which, so they flail. Or: a firewall rule "isn't working" and nobody can see whether packets are even reaching it. The Linux network stack is a precise pipeline with observable checkpoints. Learn the checkpoints and "the network is broken" becomes "the packet died at hook X for reason Y."

Step 0: the one fact - packets flow through hooks

Every packet entering or leaving the machine traverses a fixed sequence of netfilter hooks - checkpoints in the kernel network stack where rules can inspect, drop, accept, or rewrite it. The five hooks, in order:

                   ┌─────────────┐
incoming packet -> │ PREROUTING  │ -> [routing decision] ─┬─> for THIS host:  INPUT  -> local process
                   └─────────────┘                        │
                                                          └─> for ANOTHER host: FORWARD -> ...
local process -> OUTPUT -> [routing decision] -> POSTROUTING -> outgoing packet
  • PREROUTING - first touch for every incoming packet (before the kernel decides where it goes). Where DNAT (destination rewriting) happens.
  • INPUT - packets destined for this host's processes. Where a host firewall accepts/drops inbound.
  • FORWARD - packets passing through (this host as a router - how containers and VMs get connectivity).
  • OUTPUT - packets generated by local processes, heading out.
  • POSTROUTING - last touch before a packet leaves. Where SNAT/masquerade (source rewriting) happens.

iptables/nftables rules attach to these hooks. Every firewall, every Docker network, every Kubernetes Service is rules on these five hooks. We'll watch packets hit them.

Step 1: see the hooks and the default rules

Look at what's already attached (Docker, if installed, has added plenty):

$ sudo nft list ruleset | head -30        # nftables (modern)
# or, the classic view:
$ sudo iptables -L -v -n --line-numbers
Chain INPUT (policy ACCEPT 1247 packets, 892K bytes)
num  pkts bytes target  prot  source       destination
1    1247  892K ACCEPT  all   0.0.0.0/0    0.0.0.0/0    ctstate ESTABLISHED
...

The crucial columns are pkts and bytes - per-rule counters of how many packets matched that rule. These counters are your X-ray: they tell you whether a packet actually reached and matched a rule. A rule with pkts 0 never fired - either no matching traffic, or the packet died earlier. This is how you debug "my rule isn't working": check if its counter moves.

Step 2: watch a rule drop a packet (and see the counter move)

Add a rule that drops pings to yourself, then watch its counter climb as packets hit it:

$ sudo iptables -A INPUT -p icmp --icmp-type echo-request -j DROP   # drop incoming pings
$ ping -c 3 127.0.0.1                                                # try to ping
PING 127.0.0.1: 56 data bytes
--- 127.0.0.1 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss          # all dropped

$ sudo iptables -L INPUT -v -n | grep icmp
   3   252 DROP  icmp  0.0.0.0/0  0.0.0.0/0  icmptype 8       # pkts=3: our 3 pings, dropped here

The counter says pkts 3 - your three pings hit this exact rule and were dropped at the INPUT hook. You can see where the packet died. Now the diagnostic distinction that trips everyone up:

  • DROP (what we did) - the packet vanishes silently. The sender waits and eventually reports "timed out." No response at all.
  • REJECT - the kernel sends back an ICMP "port unreachable" / "administratively prohibited." The sender gets "connection refused" immediately.

Change the rule to REJECT and ping again - now it fails instantly with a rejection instead of hanging. This is why "timed out" vs "connection refused" matters: timed-out usually means a firewall DROP (or nothing listening with packets blackholed); refused means something actively said no (a REJECT rule, or a closed port the host acknowledged). The error text tells you which hook behavior you're hitting. Remove the rule when done:

$ sudo iptables -D INPUT -p icmp --icmp-type echo-request -j DROP

Step 3: watch the packet's journey with trace

netfilter can log a packet's path through every hook. Enable tracing for ping packets and watch one traverse the whole pipeline:

$ sudo nft add table ip trace_demo
$ sudo nft add chain ip trace_demo prerouting '{ type filter hook prerouting priority -350; }'
$ sudo nft add rule ip trace_demo prerouting icmp type echo-request meta nftrace set 1
$ sudo modprobe nfnetlink_log 2>/dev/null
# in another terminal, watch the trace:
$ sudo nft monitor trace &
$ ping -c1 8.8.8.8

You'll see the packet announced at each hook it passes - prerouting, then the routing decision, then output/postrouting for the reply path - each line naming the hook and the verdict. This is the abstract "packets traverse hooks" made literally visible: you watch one packet walk the pipeline. (Clean up: sudo nft delete table ip trace_demo.)

Step 4: watch NAT rewrite an address

NAT (Network Address Translation) is how your laptop's many devices share one public IP, and how every container reaches the internet. Watch it rewrite a source address. The simplest observable version - masquerade outgoing traffic from a network namespace (building on the container investigation):

# Counters on the NAT table show rewrites happening:
$ sudo iptables -t nat -L POSTROUTING -v -n
Chain POSTROUTING (policy ACCEPT)
 pkts bytes target      prot source         destination
 1893  114K MASQUERADE  all  172.17.0.0/16  0.0.0.0/0      # Docker's masquerade rule

That MASQUERADE rule (Docker installed it) rewrites the source address of every packet leaving a container (172.17.x.x, a private address that couldn't be routed on the internet) to the host's real IP, at the POSTROUTING hook - the last touch before the packet leaves. The reply comes back to the host, and conntrack (below) reverses the rewrite to deliver it to the right container. The pkts 1893 counter is every container packet that got SNAT'd. This is the entire mechanism behind "containers can reach the internet but have private IPs" - one rule on one hook.

Step 5: connection tracking - the memory behind it all

NAT and stateful firewalls work because the kernel remembers connections in the conntrack table. See it live:

$ sudo conntrack -L | head
tcp  6 431999 ESTABLISHED src=192.168.1.50 dst=140.82.121.4 sport=54221 dport=443 \
     src=140.82.121.4 dst=192.168.1.50 sport=443 dport=54221 [ASSURED]

Each line is a tracked connection with both directions (the second src=/dst= is the reply tuple - note it's the reverse, and for NAT'd connections the addresses differ, which is how the kernel un-rewrites replies). This table is why:

  • A stateful firewall can ACCEPT ... ctstate ESTABLISHED (Step 1's rule 1) - it recognizes reply packets of connections it already allowed.
  • NAT replies find their way home - conntrack stores the rewrite to reverse it.
  • A conntrack table that fills up (nf_conntrack: table full, dropping packet in dmesg) is a real production outage - a busy proxy or NAT box runs out of connection slots and silently drops new connections. Watching conntrack -C (the count) vs net.netfilter.nf_conntrack_max is how you catch it before it bites.

The diagnostic decision tree

What you can now do that you couldn't before - map the symptom to the layer:

"connection timed out"     -> packet DROPped (firewall) or blackholed; check INPUT/FORWARD
                              counters and `iptables -L -v` for a DROP that's incrementing.
"connection refused"       -> a REJECT rule, or nothing listening (closed port). Check for
                              REJECT rules; check `ss -tlnp` for a listener on that port.
"no route to host"         -> routing/ARP layer, before netfilter even. Check `ip route`.
"works locally, not remote"-> FORWARD hook / NAT / the path between hosts; check FORWARD
                              counters and conntrack.
"intermittent drops at scale" -> conntrack table full; check `conntrack -C` vs nf_conntrack_max.

That mapping - error text to hook to fix - is the difference between flailing and diagnosing.

Now you do it

  1. Add the ICMP DROP rule, ping yourself, watch the counter hit pkts 3. Switch DROP to REJECT, ping again, feel the instant "refused" vs the hanging "timed out." This single contrast is the most useful network-debugging intuition there is.
  2. sudo iptables -L -v -n (or nft list ruleset) on a box with Docker. Find the MASQUERADE rule and watch its counter climb while a container makes requests.
  3. sudo conntrack -L while you load a few web pages. Watch connections appear (ESTABLISHED) and expire. Run sudo conntrack -C for the live count.
  4. Enable nft monitor trace for a ping and watch one packet traverse the hooks. Map each line to the diagram in Step 0.

What you might wonder

"iptables or nftables?" nftables is the modern replacement; iptables commands now often run through an nftables backend (iptables-nft). Learn to read both - you'll meet iptables syntax constantly in older docs and Docker, and nftables in new systems. The hooks (PREROUTING/INPUT/FORWARD/OUTPUT/POSTROUTING) and the concepts are identical; only the rule syntax differs.

"How does this relate to the Cilium/eBPF Kubernetes investigation?" Same problem, newer mechanism. Traditional kube-proxy implements Kubernetes Services as iptables rules on these hooks - which scales poorly (O(n) rules). Cilium replaces that with eBPF programs at the network hooks (the bpftrace investigation's technology), doing the same NAT/routing with O(1) map lookups. This page is the classic netfilter foundation; the Cilium page is the eBPF evolution. Understanding netfilter is what makes the Cilium page make sense.

"Why are there so many Docker rules I didn't add?" Docker programs netfilter heavily: MASQUERADE for outbound container traffic (Step 4), DNAT for published ports (-p 8080:80 becomes a PREROUTING DNAT rule), FORWARD rules for the bridge. Reading them is reading exactly how container networking works - it's all visible in iptables -t nat -L -v.

"What's the difference between a packet DROPped at INPUT vs FORWARD?" INPUT = the packet was for a local process and the host firewall blocked it. FORWARD = the packet was passing through this host (to a container, VM, or another network) and was blocked in transit. If a container can't reach the internet, the FORWARD chain (and masquerade) is where to look, not INPUT. Knowing which chain owns which traffic is half of network debugging.

What this gave you

  • You know the five netfilter hooks and the path a packet takes through them.
  • You watched a firewall rule drop packets and saw the per-rule counter prove it.
  • You can distinguish "timed out" (DROP) from "connection refused" (REJECT) and map each to a cause.
  • You watched NAT/masquerade rewrite addresses at POSTROUTING - the mechanism behind container internet access.
  • You read the conntrack table and know a full conntrack table is a real outage to watch for.
  • You have a symptom-to-hook-to-fix decision tree, and you see how netfilter underlies Docker and (via eBPF) Kubernetes networking.

Back to the Networking month, or revisit the Cilium/eBPF Kubernetes investigation for the modern evolution.

Comments