Skip to content

Week 8 - Disk I/O Scheduling, Filesystems Beyond ext4

8.1 Conceptual Core

  • The I/O stack: filesystem → block layer (with merging, sorting) → I/O scheduler → device driver → hardware.
  • I/O schedulers (settable per-device in /sys/block/<dev>/queue/scheduler): none (preferred for NVMe), mq-deadline, kyber, bfq. Choose none for fast SSDs, mq-deadline for mixed, bfq for desktop fairness.
  • Filesystems:
  • `ext4 - the default; well-understood; journaled.
  • `xfs - high-throughput, parallel metadata; the RHEL default.
  • `btrfs - copy-on-write, snapshots, multi-device. Use cautiously for high-throughput.
  • `zfs - out-of-tree (CDDL); mature, snapshots, integrity. Heavy memory footprint.
  • `tmpfs - RAM-backed.

8.2 Mechanical Detail

  • /sys/block/<dev>/queue/{nr_requests,read_ahead_kb,scheduler,rotational}. Tunable per-device.
  • iostat -xz 1 for per-device I/O stats. Watch %util, await, svctm, r/s, w/s.
  • blktrace + btt for fine-grained I/O timing. Modern alternative: bpftrace's biolatency/biosnoop recipes.
  • Mount options for performance:
  • `noatime - don't update access times. Always set on busy filesystems.
  • discard vs periodic `fstrim - for SSDs; periodic is usually better.
  • `commit=N - ext4 journal commit interval.

8.3 Lab-"I/O Forensics"

  1. Run fio with a representative workload. Measure baseline.
  2. Toggle the I/O scheduler. Re-run. Compare.
  3. Use bpftrace -e 'tracepoint:block:block_rq_issue { @[args->comm] = count() }' to see who's hitting the disk.
  4. Mount with vs without noatime and measure metadata-write traffic difference.

8.4 Hardening Drill

  • Set nodev,nosuid,noexec on /tmp, /home, /var/tmp mounts. Document why each matters.

8.5 Performance Tuning Slice

  • bpftrace -e 'tracepoint:block:block_rq_complete /args->dev == 0x800/ { @ = hist((nsecs - @start[arg0]) / 1000) }' to histogram I/O latencies in microseconds.

Month 2 Capstone Deliverable

A memory-and-scheduling/ directory: 1. meminfo-decoder/ - a script that reads/proc/meminfoand outputs a human-readable health report. 2.psi-watcher/ - a daemon that alerts when pressure.memory exceeds a threshold. 3. sched-bench/ - comparing nice-weighted, cgroup-weighted, and pinned workloads. 4.io-tuner/ - a fio harness sweeping I/O-scheduler options on the local disk.

Each comes with a markdown writeup of measurements and tuning conclusions.

Comments