04 - Your First Neural Network¶

What this session is¶

About 45 minutes. Build a small multi-layer perceptron (MLP) in PyTorch - the simplest interesting neural network. You'll see how nn.Module, layers, forward pass, and autograd compose.

The plan¶

We'll build a network that classifies handwritten digits (MNIST - the "hello world" of ML). 28×28 grayscale images → one of 10 digit classes.

We won't train it yet (page 05 covers training). This page is about defining the model.

`nn.Module`: PyTorch's central abstraction¶

Every model in PyTorch is a class extending torch.nn.Module. Define your layers in __init__; define how data flows through them in forward.

import torch
import torch.nn as nn
import torch.nn.functional as F

class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 128)     # input layer: 784 → 128
        self.fc2 = nn.Linear(128, 64)       # hidden:      128 → 64
        self.fc3 = nn.Linear(64, 10)        # output:      64 → 10

    def forward(self, x):
        # x is shape (batch_size, 784)
        x = F.relu(self.fc1(x))             # → (batch, 128)
        x = F.relu(self.fc2(x))             # → (batch, 64)
        x = self.fc3(x)                     # → (batch, 10) - raw logits
        return x

model = MLP()
print(model)

Run this. You'll see:

MLP(
  (fc1): Linear(in_features=784, out_features=128, bias=True)
  (fc2): Linear(in_features=128, out_features=64, bias=True)
  (fc3): Linear(in_features=64, out_features=10, bias=True)
)

PyTorch auto-prints the architecture.

What each line does¶

nn.Linear(in_features, out_features) - a fully-connected layer. Internally: a weight matrix of shape (in_features, out_features) and a bias vector. When you call it with input x, it computes x @ W + b.

F.relu(x) - element-wise ReLU activation. max(0, x). Adds non-linearity (page 03).

The last layer produces logits - raw scores, one per class. We don't apply softmax here; the loss function (page 05) does it more numerically stably.

Run a forward pass¶

x = torch.randn(4, 784)            # batch of 4 random "images"
out = model(x)
print(out.shape)                   # torch.Size([4, 10])
print(out[0])                      # 10 logits for the first sample

model(x) calls forward(x) under the hood. The output is (batch_size, num_classes) - 10 logits per input.

To turn logits into probabilities:

probs = F.softmax(out, dim=1)
print(probs[0])
print(probs.sum(dim=1))            # each row sums to 1

To get the predicted class (argmax over the classes):

preds = out.argmax(dim=1)
print(preds)                       # tensor of class indices

The model is randomly initialized - predictions are garbage. Training (page 05) fixes that.

Counting parameters¶

total = sum(p.numel() for p in model.parameters())
print(f"total params: {total:,}")    # ~109,000

.parameters() yields all the learnable tensors (weights + biases). For this MLP: - fc1: 784 × 128 weights + 128 biases = 100,480. - fc2: 128 × 64 + 64 = 8,256. - fc3: 64 × 10 + 10 = 650. - Total: 109,386.

For comparison: GPT-2 (small) is 124M params. GPT-3 is 175B. Modern open-source LLMs (Llama 3 70B) are 70B. Parameter count is one rough proxy for model capability (and one direct proxy for memory cost).

A more compact form: `nn.Sequential`¶

For simple feed-forward stacks:

model = nn.Sequential(
    nn.Linear(784, 128),
    nn.ReLU(),
    nn.Linear(128, 64),
    nn.ReLU(),
    nn.Linear(64, 10),
)

Layers in order. Use Sequential for "just call these in sequence" cases. Use the class form when forward needs anything more than that (skip connections, conditional flow).

Activations as modules vs functions¶

Two ways to apply ReLU:

# As a function (no parameters):
x = F.relu(x)

# As a module (in Sequential):
nn.ReLU()

Both work. Functions are cleaner inside forward. Modules are required inside Sequential. Use whichever fits.

Initialization¶

PyTorch initializes weights with sensible defaults (Kaiming uniform for linear layers). For most cases this is fine. To override:

for p in model.parameters():
    if p.dim() > 1:
        nn.init.kaiming_normal_(p)

Initialization matters for very deep networks; modern architectures (with normalization layers) are robust enough that the default usually works.

Move to GPU¶

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

x = torch.randn(4, 784).to(device)
out = model(x)

.to(device) moves all parameters. After that, your input must also be on the same device.

Save and load¶

# Save just the parameters
torch.save(model.state_dict(), "mlp.pt")

# Load into the same architecture
model2 = MLP()
model2.load_state_dict(torch.load("mlp.pt"))
model2.eval()

state_dict() returns a dict of {name: tensor} for every parameter. Saving the state dict (not the whole model object) is the recommended pattern - portable across code changes.

.eval() switches the model into evaluation mode (matters for layers like dropout and batch norm that behave differently during training vs inference).

Common architectures (very brief preview)¶

The MLP is the simplest. You'll meet others:

CNN (Convolutional Neural Network) - for images. Layers detect local patterns (edges, textures) at multiple scales.
RNN / LSTM - for sequences (older approach). Largely replaced by transformers.
Transformer - attention-based, the modern default for language and increasingly vision. Page 07 covers it.

For this beginner path: MLPs for the early pages; transformers for the LLM pages.

Exercise¶

Create mlp_demo.py:

import torch
import torch.nn as nn
import torch.nn.functional as F


class MLP(nn.Module):
    def __init__(self, input_dim=784, hidden=128, num_classes=10):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, hidden)
        self.fc2 = nn.Linear(hidden, hidden)
        self.fc3 = nn.Linear(hidden, num_classes)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return self.fc3(x)


def main():
    device = "cuda" if torch.cuda.is_available() else "cpu"
    print("device:", device)

    model = MLP().to(device)
    print(model)
    total = sum(p.numel() for p in model.parameters())
    print(f"params: {total:,}")

    # Forward pass on a fake batch
    x = torch.randn(8, 784, device=device)
    logits = model(x)
    print("logits shape:", logits.shape)
    probs = F.softmax(logits, dim=1)
    print("sample probs:", probs[0])
    print("sample sum (should be ~1):", probs[0].sum().item())


if __name__ == "__main__":
    main()

Run it. Verify everything makes sense.

Stretch: rewrite the model as nn.Sequential. Same output, fewer lines.

Bigger stretch: add a dropout layer (nn.Dropout(p=0.2)) between the hidden layers. Dropout randomly zeros some activations during training; an effective regularizer. Print the model to see the new architecture.

What you might wonder¶

"What's super().__init__() for?" Calls nn.Module's constructor - sets up internal bookkeeping so .parameters() and .to() work on your model. Always call it first in __init__.

"Why F.relu vs nn.ReLU?" Same thing. Functional form is shorter inside forward; module form composes with Sequential and is registered as a child module (so it shows in print(model)).

"What's a 'logit'?" A pre-softmax output. The raw score the model assigns to each class. The class with the largest logit is the prediction. Logits aren't probabilities; softmax converts them.

"Should I worry about backpropagation math?" No - PyTorch's autograd handles it. You define the forward pass; gradients are computed automatically. Page 05 shows the loop.

Done¶

Define a model by extending nn.Module.
Use nn.Linear, F.relu, nn.Sequential.
Run a forward pass.
Count parameters.
Save and load state_dict.
Move models to GPU.

Next: Training loop →