Skip to content

09 - Files and the Standard Library

What this session is

About an hour. You'll learn how to read and write files, work with file paths portably, parse JSON, handle dates and times, and get acquainted with Python's standard library - the giant collection of useful modules that ship with Python itself.

The standard library is one of Python's biggest selling points. "Batteries included" is the slogan. Whatever you need - HTTP, JSON, CSV, dates, sockets, subprocesses, regular expressions, threading - there's a module for it.

Reading a file

The basic pattern (already met in page 07):

with open("notes.txt") as f:
    contents = f.read()
print(contents)

open returns a file object; with closes it automatically when the block ends. .read() returns the entire contents as one string.

For large files, iterate line by line - Python streams (page 08):

with open("notes.txt") as f:
    for line in f:
        print(line.rstrip())   # rstrip removes trailing newline

To get a list of lines:

with open("notes.txt") as f:
    lines = f.readlines()      # list of strings, each with newline

Writing a file

Open in write mode ("w"):

with open("output.txt", "w") as f:
    f.write("hello, world\n")
    f.write("second line\n")

"w" truncates the file if it exists (you lose what was there). For append, use "a". For read-and-write, "r+".

You can also print to a file:

with open("output.txt", "w") as f:
    print("hello, world", file=f)
    print("second line", file=f)

print adds a newline automatically (which is sometimes nicer than f.write with manual \n).

Text vs binary mode

open(path) opens in text mode by default - Python decodes bytes to a string. For binary data (images, archives, anything non-text), open in binary mode ("rb", "wb"):

with open("photo.jpg", "rb") as f:
    data = f.read()       # bytes, not str
print(type(data))         # <class 'bytes'>
print(len(data))          # size in bytes

When in doubt: text mode for text, binary for everything else.

File paths: pathlib

You'll see two ways to handle paths in Python:

Old way - strings + os.path:

import os
path = os.path.join("data", "files", "notes.txt")
exists = os.path.exists(path)
parent = os.path.dirname(path)

Modern way - pathlib:

from pathlib import Path
path = Path("data") / "files" / "notes.txt"
exists = path.exists()
parent = path.parent

pathlib's Path object overrides / to mean "join path components." That makes building paths intuitive. It also has useful methods:

p = Path("notes.txt")
p.read_text()                    # whole file as str
p.write_text("hello")            # write str to file (creates file)
p.read_bytes()                   # whole file as bytes
p.exists()                       # does it exist?
p.is_file()                      # is it a file?
p.is_dir()                       # is it a directory?
p.stat().st_size                 # size in bytes
p.suffix                         # ".txt"
p.stem                           # "notes"
p.parent                         # Path("")
for child in Path(".").iterdir():
    print(child)                 # every file/dir in current folder

Use pathlib for all new code. It's clearer than os.path and works portably across Windows / macOS / Linux.

JSON: structured data on disk and over the wire

Most APIs and config files use JSON. Python has a built-in json module:

import json

# Python -> JSON string
data = {"name": "Alice", "age": 30, "languages": ["Python", "Go"]}
text = json.dumps(data)
print(text)              # {"name": "Alice", "age": 30, "languages": ["Python", "Go"]}

# JSON string -> Python
loaded = json.loads(text)
print(loaded)            # {'name': 'Alice', 'age': 30, 'languages': ['Python', 'Go']}
print(loaded["age"])     # 30

Reading/writing JSON files:

import json
from pathlib import Path

# Write
data = {"name": "Alice", "age": 30}
Path("data.json").write_text(json.dumps(data, indent=2))

# Read
loaded = json.loads(Path("data.json").read_text())
print(loaded)

indent=2 makes the output human-readable (pretty-printed).

JSON maps cleanly to Python: - JSON object → Python dict. - JSON array → Python list. - JSON string → str. - JSON number → int or float. - JSON true/false/null → Python True/False/None.

Anything that can be expressed in JSON can round-trip through json.dumps/json.loads. Things that can't: dates, custom classes, sets, tuples (become lists). For complex types, you wire custom encoders or use a richer format (msgpack, pickle, protobuf).

Dates and times: datetime

The standard module is datetime:

from datetime import datetime, timedelta, timezone

now = datetime.now(timezone.utc)
print(now)                  # 2026-05-17 14:23:45.123456+00:00

# Construct a specific time
launch = datetime(2026, 12, 1, 9, 0, 0, tzinfo=timezone.utc)
print(launch)               # 2026-12-01 09:00:00+00:00

# Arithmetic
diff = launch - now
print(diff)                 # 197 days, 18:36:14.876544
print(diff.days)            # 197

# Add/subtract
later = now + timedelta(hours=3, minutes=15)
print(later)

# Format
print(now.strftime("%Y-%m-%d %H:%M"))         # "2026-05-17 14:23"

# Parse
parsed = datetime.strptime("2026-05-17 14:23", "%Y-%m-%d %H:%M")
print(parsed)

Lessons: - Always use timezone-aware datetimes for anything that touches more than one machine. datetime.now() without timezone.utc is "naive" - no zone info - and silently breaks across timezones. - timedelta is the type for durations. Use it for arithmetic. - strftime formats; strptime parses. The format codes (%Y, %m, %d, %H, ...) are the same as C's strftime.

Other essential standard library modules

You don't need to learn these all now - just know they exist. Each is a python -m <name> away (the module name is the import name).

Module What it does
os OS-level operations (env vars, processes, working directory)
sys Python interpreter info, argv, stdin/stdout/stderr
pathlib File paths (you met it)
json JSON encoding/decoding (you met it)
datetime Dates and times (you met it)
csv CSV files
re Regular expressions
urllib, urllib.request Basic HTTP (use requests or httpx for anything serious)
http.server A simple HTTP server. python -m http.server 8000 serves the current directory.
subprocess Run external commands
argparse Command-line argument parsing
logging Structured logging
collections Specialized data types (Counter, defaultdict, deque, namedtuple)
itertools Combinatorics and iterator helpers (chain, groupby, combinations, product)
functools Higher-order function helpers (partial, reduce, lru_cache)
unittest Built-in testing framework (most code uses pytest instead)

Two especially useful ones to know about:

collections.Counter - count things easily:

from collections import Counter
words = "the quick brown fox jumps over the lazy dog the end".split()
counts = Counter(words)
print(counts)               # Counter({'the': 3, 'quick': 1, ...})
print(counts.most_common(2))  # [('the', 3), ('quick', 1)]
(Compare to your page 06 wordcount exercise - Counter is the one-liner.)

itertools - useful iterator combinators:

from itertools import chain, groupby, combinations
list(chain([1,2,3], [4,5,6]))             # [1, 2, 3, 4, 5, 6]
list(combinations([1,2,3,4], 2))          # [(1,2), (1,3), (1,4), (2,3), (2,4), (3,4)]

Exercise

In a new file summarize.py:

Write a program that:

  1. Reads a small JSON file events.json containing a list of events, where each event has name, timestamp (ISO format like "2026-05-17T14:23:00+00:00"), and severity ("low", "medium", "high").

Example data:

[
  {"name": "login", "timestamp": "2026-05-17T08:00:00+00:00", "severity": "low"},
  {"name": "error", "timestamp": "2026-05-17T08:15:00+00:00", "severity": "high"},
  {"name": "login", "timestamp": "2026-05-17T09:00:00+00:00", "severity": "low"},
  {"name": "error", "timestamp": "2026-05-17T10:30:00+00:00", "severity": "high"},
  {"name": "warn",  "timestamp": "2026-05-17T11:00:00+00:00", "severity": "medium"}
]
Save this as events.json first.

  1. Parses the timestamps with datetime.fromisoformat.

  2. Counts events by severity (Counter). Print the result.

  3. Finds the earliest and latest event timestamps. Print them.

  4. Writes a summary to summary.json containing:

  5. total: total event count.
  6. by_severity: the counts.
  7. first: ISO timestamp of the earliest.
  8. last: ISO timestamp of the latest.

Use pathlib, json, datetime, and collections.Counter.

What you might wonder

"Why two ways to do file paths?" History. os.path is the original; pathlib was added in 3.4 and is now the recommended way. Old code uses os.path; new code uses pathlib. You'll see both.

"Are there standard library bits I should NOT use?" A few. urllib.request for HTTP is awkward - use the third-party httpx or requests instead. xml.etree is OK but lxml is faster for serious XML work. pickle is convenient but unsafe - never pickle.loads untrusted data. (Pickle can execute arbitrary code during deserialization - security CVE territory.)

"What's the right way to do dates?" For business logic: datetime with explicit UTC. For date-only (no time): datetime.date. For more complex calendar work or natural-language parsing: third-party arrow or pendulum. Avoid naive datetimes (no timezone) like the plague.

"How big should my standard library tour be?" Don't try to read all of it. Skim the index at docs.python.org/3/library/ once so you know what categories exist. Then look up specifics when you have a real need.

Done

You can now: - Read and write text and binary files with with open(...). - Manipulate file paths portably with pathlib. - Encode and decode JSON for both API payloads and config files. - Work with timezone-aware datetimes and durations. - Reach for Counter, defaultdict, itertools when they fit. - Know that the standard library is huge and worth skimming.

You can now do practical I/O work. Next page: testing your own code with pytest.

Next: Tests →

Comments