Skip to content

AI Expert Roadmap From Scratch (Beginner)

Beginner orientation: 12-month career arc, what to learn when, math you actually need, picking a specialization, portfolio, interview prep, first 90 days.

Printing this page

Use your browser's PrintSave as PDF. The print stylesheet hides navigation, comments, and other site chrome; pages break cleanly at section boundaries; advanced content stays included regardless of beginner-mode state.


AI Expert Roadmap From Scratch (Beginner)

The 12-month path from "I've used ChatGPT a few times" to "I'm employable as an AI engineer," with the hype removed.

This is not a coding tutorial. The hands-on coding lives in AI Systems From Scratch - tensors, training loops, LoRA, RAG, serving. This path is the meta-layer: what to learn in what order, the math you actually need (and don't), how to read papers, how to build a portfolio, how to evaluate jobs, how the first 90 days go.

Read this first if you're trying to figure out where to start. Read it alongside AI Systems From Scratch once you've picked a direction.

Table of contents

  1. Where are you starting from
  2. The 12-month arc, honest version
  3. Math you actually need (and what you don't)
  4. The Python + Linux baseline
  5. ML mental model in one page
  6. Transformers in one page
  7. The Hugging Face ecosystem map
  8. Picking a specialization
  9. Reading papers without drowning
  10. Building in public
  11. Your portfolio: 3 projects
  12. Evaluating job postings honestly
  13. Open source as resume
  14. Interview prep: what they actually ask
  15. First 90 days on the job
  16. Done - the next 12 months

How to use this path

  • Read in order, once, fast. ~3 hours.
  • Then come back to specific pages when the decisions actually need making.
  • Pair with AI Systems From Scratch for the coding side.

What this path is not

  • A tutorial. Almost no code.
  • A motivational deck. No "AI will change the world" pep talk.
  • A guarantee. The market is real, brutal, and changing monthly.

What you get instead: an honest map.

00 - Where are you starting from

What this session is

Ten minutes. Honest self-assessment. Determines which pages of this roadmap actually apply to you and which you can skip.

The four starting points

People show up to "I want to be an AI engineer" from very different places. The advice for each is different. Find yours.

A. Never coded

You've used ChatGPT. You've read articles. You've never written a for loop. You're not sure if you're "the kind of person" who codes.

Reality: 12-18 months. Most of it is just becoming a programmer. The AI-specific layer is the last 4-6 months.

Start with: Python from Scratch on this site. Finish all 16 pages, do every exercise, get one PR merged. Then come back here.

B. Coded a bit (HTML/CSS, Excel macros, a Python tutorial)

You can write a script that prints something. Loops, ifs, basic data structures are familiar in some language but you're shaky.

Reality: 9-12 months. Skip the absolute basics, but reinforce - there's a difference between "did the tutorial" and "could rebuild it from memory."

Start with: Finish the back half of Python from Scratch. Read this roadmap. Then start AI Systems from Scratch.

C. Working programmer, not in ML

Backend, frontend, SRE, data engineering, whatever. You ship code for a living. You haven't trained a model.

Reality: 6-9 months. The hardest transition isn't programming - it's the math, the model intuition, and learning to think in tensors.

Start with: Skim Python from Scratch (1 hour). Read this roadmap. Spend most time in AI Systems from Scratch and the senior AI Systems path.

D. ML-adjacent (data scientist, ML researcher, analyst)

You've trained models. You've read papers. You're trying to move from notebooks to production AI engineering.

Reality: 3-6 months. The gap is engineering: serving, infra, evaluation in production, cost, latency.

Start with: Read this roadmap quickly. Skip ahead to Picking a specialization. Spend most time in serving / infra / OSS contribution.

The honest test

Answer these out loud:

  1. Can you, right now, write a Python script that reads a file, counts the words, and prints the top 10 most common?
  2. Do you know what a function, a class, and a list comprehension are without looking them up?
  3. Have you ever installed a Python library with pip and used it?
  4. Have you ever used a Linux terminal for anything beyond cd?

  5. 0-1 yes → bucket A.

  6. 2 yes → bucket B.
  7. 3 yes → bucket C (lean B if shaky).
  8. 4 yes + can train a model → bucket D.

What this means for the rest of this path

The rest of these 16 pages assume bucket C - a working programmer moving into AI. Bucket A/B readers: complete the prerequisite paths first, then come back. Bucket D readers: the orientation pages will feel obvious; the strategy pages (specialization, jobs, portfolio) are the value.

What you might wonder

"Can I skip programming and just learn 'AI'?" No. The job is "AI engineer." The first word is descriptive; the second is the role. Engineers code.

"What about no-code AI tools?" Different career. Useful skill, but not what hiring managers mean by "AI engineer."

"I'm 40 / 50 / changing careers. Too late?" No. The field is 5 years old in its current form. Everyone is mid-career-pivot.

Done

  • Located your starting bucket.
  • Picked the right entry point on this site.

Next: The 12-month arc, honest version →

01 - The 12-month arc, honest version

What this session is

The real shape of 12 months. Not "learn AI in 30 days." Not "transformer specialist in 6 weeks." The actual phases, the actual time each takes, and which months are demoralizing.

The four phases

Months 1-2: Foundation

What you do: Python you can write from memory. Linux comfortable. Git for real. NumPy and pandas. Read about ML at a conceptual level. No models trained yet.

What it feels like: Slow. You're not "doing AI." You'll wonder if you picked the wrong path. You haven't. The people who skip this phase plateau at month 4.

Signal you're ready for phase 2: You can write a Python script that loads a CSV, does some pandas operations, and saves the result. Without Googling syntax.

Months 3-5: Hands-on ML and DL

What you do: Train a logistic regression, a small neural net, an MLP, a CNN on MNIST/CIFAR. Read the PyTorch docs front-to-back. Build the things in AI Systems From Scratch. Fine-tune a small open model. Run RAG.

What it feels like: Fun. Confusing. You'll keep hitting "but why" walls. You'll Google a lot. You'll feel like you're memorizing magic incantations.

Signal you're ready for phase 3: You can explain, out loud, what backprop roughly does and why a learning rate matters.

Months 6-9: Specialization + portfolio

What you do: Pick a specialization (see page 7). Build two real projects (page 10). Read 5 papers in your area. Start contributing to one OSS project. Build in public.

What it feels like: Lonely. Slow visible progress. Mostly hard work with no audience yet.

Signal you're ready for phase 4: You have one project that's live, written up, and you can demo end-to-end in 5 minutes.

What you do: Apply, interview, fail, learn, repeat. Continue contributing to OSS. Continue building. Network.

What it feels like: Bruising. Most applications get no reply. The interviews you do get teach you what's missing. You go back to building, then re-apply.

Signal you're ready to ship: Offer in hand. Or, more realistically: a couple of final-round interviews, one of which converts.

What's not on this list

  • "Read every paper on arXiv." You can't. Don't try.
  • "Master CUDA." Optional. Mostly not needed.
  • "Get a PhD." Different job.
  • "Learn every framework." Pick one stack, get deep.

The demoralizing months

Statistically, the people who quit do so around:

  • Month 2 - "this is too much math/code, maybe I'm not smart enough." (You are. Push through.)
  • Month 6 - "I've been doing this for half a year and I don't feel hireable." (You aren't yet. That's fine. Two more phases to go.)
  • Month 10 - "I've applied to 80 jobs and nobody's writing back." (Yes. That's the job search. Don't stop.)

If you know these slumps are coming, they hurt less.

What you can accelerate, and what you can't

Accelerate: Programming. Linux. Tooling. Code-reading speed. Project shipping pace.

Cannot accelerate: Building real intuition for what models do and don't do. That's reps. Lots of small experiments, watching loss curves, breaking things. There's no shortcut.

If you have less than 12 months

Compress, but don't skip. Phase 1 is the only one you can shorten meaningfully (by leveraging existing programming skill). The rest scale with calendar weeks, not hours per week - your brain needs sleep between concepts.

People who go faster than 6 months usually had a programming + math background already.

What you might wonder

"Can I do this part-time?" Yes. Most people do. Add 50-100% to every timeline. 15-20 hours/week sustained beats 40 hours/week for 3 weeks and burn out.

"Should I quit my job?" Usually no. Pay for the runway with your current job. The first AI engineering offer often pays less than your current senior backend role does. Plan for that.

"What about bootcamps?" Some are good. Most are expensive and teach exactly what's on YouTube. If you're disciplined, you don't need one. If you're not, a bootcamp's structure may be worth the money.

Done

  • Know the four phases.
  • Know the slumps to expect.
  • Have a realistic calendar.

Next: Math you actually need →

02 - Math you actually need (and what you don't)

What this session is

The math debate. What's hype, what's required, what's gatekeeping, what's optional. With specific resources for each.

The honest answer

You can be a productive AI engineer with high-school math plus a working understanding - not formal mastery - of four topics: linear algebra, calculus (just gradients), probability (just expectations), and basic statistics (just sampling, distributions).

Anyone who tells you you need to grind Strang, Spivak, and Casella before writing PyTorch is wrong, or preparing you for a different job (research, not engineering).

What you need, ranked by ROI

Tier 1: load-bearing

  • Matrix multiplication, dot products. What shape times what shape gives what shape. You'll do this every day.
  • Gradients (one-variable, multi-variable, intuitively). What backprop is updating. Why learning rates matter. Why gradient explosion/vanishing happens.
  • Probability distributions, basic. Normal, uniform, categorical. Sampling vs argmax.
  • Logarithms, exponents, softmax. Why log-likelihood loss looks weird. Why softmax over logits.

Tier 2: helpful

  • Eigenvalues / SVD, conceptually. Used in PCA, embeddings, attention analysis. You don't need to compute by hand.
  • Information theory basics. Entropy, cross-entropy, KL divergence. Shows up in loss functions and evaluation.
  • Basic statistics. Variance, expectation, central limit theorem. For understanding evaluation noise.

Tier 3: nice-to-have

  • Convex optimization. Useful when reading papers. Not blocking.
  • Measure-theoretic probability. Required for research; not for engineering.
  • Tensor calculus / differential geometry. Required for very specific specializations (e.g., diffusion models theory).

What you don't need

  • Rigorous epsilon-delta calculus proofs.
  • Real analysis.
  • PDEs from scratch.
  • Group theory.
  • The full Strang course unless you enjoy it.

What "working understanding" means

You can:

  • Read a paper's equations and parse the shape of what's happening, even if you couldn't reproduce the derivation.
  • Know when a derivative would be near zero (saturating activations, etc).
  • Sanity-check that a probability distribution sums to one.
  • Compute matrix shapes without paper.

You can't (and don't need to):

  • Derive backprop on a paper napkin.
  • Prove convergence properties.
  • Write your own optimizer from scratch.

Resources, ranked

In order, by efficiency for the engineering track:

  1. 3Blue1Brown's "Essence of Linear Algebra" + "Essence of Calculus" YouTube series. ~6 hours total. Best ROI on this list. Watch even if you "know" linear algebra.
  2. fast.ai's "Practical Deep Learning" course. Math taught just in time, alongside code. Many people learn the math better here than in standalone math courses, because it's grounded.
  3. MIT 18.06 (Strang) Linear Algebra lectures. If you want depth. Watch at 1.5x. Skip the homework.
  4. MIT 6.041 Probability (Tsitsiklis). Same deal. Watch the first 8 lectures, skip the rest unless interested.
  5. The Deep Learning Book (Goodfellow et al), chapters 2-4. Free online. The "math you need" chapters. Skim, don't grind.

What I'd skip

  • Long-form Coursera ML math specializations. Slow, repetitive, will-of-the-living.
  • "Mathematics for Machine Learning" book. Fine but encyclopedic; you'll bog down.
  • Khan Academy linear algebra. Too elementary; you'll be bored.

The pragmatic plan

Weeks 1-2: 3Blue1Brown linear algebra + calculus. 6 hours total.

Weeks 3-4: First 4 chapters of the Deep Learning book. Skim, take notes on confusion points, move on.

Ongoing: When a paper or library confuses you, look up the one math concept you need. Wikipedia is fine. Don't preemptively learn things.

That's it. You can always come back. Math-first is a trap that costs people 3-6 months.

What you might wonder

"But I keep hearing AI is 'all math.'" By people who do research. Engineers use frameworks that abstract the math. The math you need to understand is to debug what your model is doing, not to derive new architectures.

"What if I want to do research?" Different career. Different roadmap. Read Picking a specialization. Research roles usually require a PhD or equivalent published work.

"What if I'm bad at math?" "Bad at math" usually means "didn't have a good teacher" or "stopped before things got interesting." 3Blue1Brown will likely change your relationship with the material. Try it before deciding.

Done

  • Know what's required, helpful, optional, and gatekeeping.
  • Have a specific 4-week plan.
  • Are not going to do a 6-month math detour.

Next: The Python + Linux baseline →

03 - The Python + Linux baseline

What this session is

The non-AI prerequisites people skip and then can't debug their way out of. Specific skill checklist.

Why this page exists

Almost every blocker I've watched a new AI engineer hit was actually a Python or Linux blocker, dressed up. CUDA out-of-memory turning out to be a path issue. "The model won't train" turning out to be a shell environment problem. "The script hangs" turning out to be a buffered stdout question.

If your Python and Linux are strong, AI engineering is just engineering.

Python baseline checklist

You should be able to do all of these without Googling syntax:

Language

  • Write a function with default args, keyword args, and *args, **kwargs.
  • Write a class with __init__, an instance method, and a __repr__.
  • Write a generator with yield. Know when generators help.
  • Use list, dict, set comprehensions.
  • Use with statements. Know what a context manager does.
  • Read a stack trace top-to-bottom and find the actual cause.

Standard library

  • pathlib over os.path.
  • json for serialization.
  • argparse or click for CLIs.
  • subprocess.run with check=True.
  • logging (not print) for non-trivial scripts.
  • pytest for tests.

Ecosystem

  • pip and pip install -e . for a local package.
  • venv or uv for environments. (uv is faster; use it.)
  • pyproject.toml over setup.py for new projects.
  • Know what a requirements.txt lockfile vs declarations difference is.
  • Read a pyproject.toml and know what [project], [tool.uv], [tool.pytest.ini_options] blocks do.

AI-specific

  • numpy: arrays, shapes, broadcasting, slicing, np.einsum for sanity-check matmul.
  • pandas: load CSV, filter, group, save. Know when not to use pandas (huge data, structured tensors).
  • matplotlib or seaborn: plot a curve, plot a histogram. Save to PNG.

If any of this is shaky, do Python from Scratch before moving on.

Linux baseline checklist

You should be able to:

Filesystem and process

  • Navigate with cd, ls, tree. Use find and grep (or rg).
  • chmod, chown for permission issues.
  • ps aux | grep <thing>, kill -9 <pid>.
  • top / htop for "what's eating my CPU/memory."
  • df -h, du -sh *, du -sh * | sort -h - disk usage matters for AI.
  • nvidia-smi if you have a GPU. Watch live with watch -n 1 nvidia-smi.

Shell

  • Pipes (|), redirects (>, >>, 2>&1).
  • Environment variables: export FOO=bar, $FOO, env | grep FOO.
  • Background jobs (&), nohup, tmux or screen.
  • Editing in vim or nano at least for quick edits.

Networking

  • curl -v <url> to debug an API.
  • ss -tnlp or lsof -i :8000 to find what's on a port.
  • ssh user@host, scp, rsync -avh for moving files.

Python-on-Linux specifics

  • Where which python is and why it matters.
  • Why pip install sometimes installs to the wrong env. Use python -m pip install instead.
  • Reading journalctl or systemd logs if a service won't start.

If any of this is shaky, do Linux from Scratch before moving on.

Git baseline

The minimum:

  • clone, add, commit, push, pull.
  • branch, checkout -b, merge, rebase (use merge until you understand rebase).
  • Resolve merge conflicts at least once with intent.
  • stash, stash pop.
  • log --oneline --graph, diff, blame.
  • Read a .gitignore. Write one for a new repo.

The 30-minute self-test

Do this in your shell:

mkdir /tmp/baseline && cd /tmp/baseline
uv init && uv venv && source .venv/bin/activate
uv add numpy pandas matplotlib pytest

cat > work.py <<'EOF'
import numpy as np, pandas as pd
df = pd.DataFrame({"x": np.random.randn(1000), "y": np.random.randn(1000)})
df["z"] = df.x * 2 + df.y
print(df.describe())
df.to_csv("out.csv", index=False)
EOF

python work.py
head out.csv

If every step felt natural, you're at baseline. If any step was confusing, that's the gap to close.

What you might wonder

"Conda vs uv vs pip vs poetry?" uv is the current right answer. Fast, modern, replaces venv/pip/pip-tools/pyenv. Conda is fine if your team uses it; don't fight a team's choice. Avoid switching mid-project.

"Mac, Linux, or WSL?" Linux native is best. Mac is fine (mps works for many models). WSL2 is fine for development; production is Linux. Windows native is painful for AI; avoid.

"Do I need to learn Bash scripting properly?" Read it. Don't write big ones. Reach for Python when a shell script exceeds 30 lines.

Done

  • Self-tested Python baseline.
  • Self-tested Linux baseline.
  • Know what to revisit.

Next: ML mental model in one page →

04 - ML mental model in one page

What this session is

The whole of "what is machine learning" in one page. The mental model you'll keep coming back to. Not a course; a frame.

The one-sentence definition

Machine learning is fitting a function to data, where the function has lots of knobs ("parameters"), and the fitting is done by an optimizer that adjusts the knobs to make the function's outputs closer to known answers on training examples - and you hope it generalizes to new examples.

That's it. Everything else is variations.

The four pieces

Every ML system has these four:

1. Data

Examples. Inputs paired (usually) with desired outputs.

  • Supervised: labeled (input → output known).
  • Unsupervised: no labels (find structure in the inputs).
  • Self-supervised: the input is the label, in a clever way (e.g., predict next token of text).
  • Reinforcement: no labels; reward signal from the environment.

2. Model

A function with parameters. For neural nets, "parameters" means "the weights." A model with 7 billion parameters is a function with 7 billion knobs.

Bigger models can fit more complex functions. They also need more data and compute.

3. Loss function

A number that measures "how wrong is the model right now." Lower is better.

  • For regression (predict a number): mean squared error.
  • For classification (predict a category): cross-entropy.
  • For generation (predict a sequence): next-token prediction loss.

4. Optimizer

The algorithm that adjusts the knobs to reduce the loss. Usually a variant of gradient descent (SGD, Adam, AdamW). It looks at the gradient of the loss with respect to each knob and nudges that knob in the loss-reducing direction.

That's training: data → model predicts → loss measures wrongness → optimizer adjusts knobs → repeat.

What "learning" really is

Imagine you're trying to fit a curve through dots on a graph. The curve has 1000 wiggles you can adjust. Each step:

  1. The curve makes a guess at each dot.
  2. You measure how far off it is.
  3. You nudge each wiggle a tiny bit to reduce the error.
  4. Repeat 10,000 times.

That's it. Neural network training is this, with billions of wiggles instead of 1000, and the "dots" being images, text tokens, or actions.

Why it works (mostly)

Two reasons:

  1. Universal approximation: big enough neural nets can fit (approximate) any reasonable function.
  2. Gradient descent finds good-enough minima: in high dimensions, the loss landscape has many decent solutions and the optimizer usually finds one.

It works better than people predicted in the 2010s. Nobody fully understands why so much of it generalizes. The empirical answer: "scale + transformers + lots of compute."

The three common shapes

Classification

Inputs → one of N categories. "Is this email spam?" Output: probabilities over categories.

Regression

Inputs → a number. "What price will this house sell for?" Output: a continuous value.

Sequence-to-sequence (generation)

Inputs → another sequence. "Translate this French to English." "Continue this paragraph." Output: tokens one at a time.

Most modern AI (LLMs, image generation, speech, video) is some flavor of sequence generation.

Generalization, overfitting, underfitting

Three states:

  • Underfitting: model too small or undertrained. Wrong on training and test data.
  • Good fit: correct on training data, mostly correct on new data.
  • Overfitting: memorized training data but fails on new data.

The whole game is finding the sweet spot. Techniques: more data, regularization, dropout, early stopping, smaller models.

What "deep learning" adds

Same four pieces. The model is a neural network - a function built of many simple layers stacked. "Deep" means lots of layers. Deep nets:

  • Can fit much more complex functions than classical ML.
  • Need more data and compute.
  • Discover useful intermediate representations automatically (the layers' outputs).

The last point is huge. In classical ML, you'd hand-craft features. In deep learning, the model figures them out.

What "LLMs" add

LLMs (Large Language Models) are deep nets, trained self-supervised on enormous text corpora, with a specific architecture (transformer - see page 5). They're "trained to predict the next token" at planet-scale, and out of that simple task comes the ability to translate, summarize, code, reason at-least-somewhat, and converse.

Nobody planned all those abilities. They emerged from scale. This is both the most exciting and most uncomfortable fact about modern AI.

The map of the field

  • Classical ML: decision trees, SVMs, random forests, linear regression. Still used. Still useful. Often the right answer for tabular data.
  • Deep learning: neural nets for perception (images, audio, video).
  • NLP / LLMs: transformers for language. Currently the loudest part of the field.
  • RL: agents learning from reward. Used heavily in LLM post-training (RLHF, DPO).
  • Generative models: GANs (older), diffusion (current). For images, video, audio.

You don't need to specialize in all of these. Pick one in page 7.

What you might wonder

"Is ML just statistics?" Sort of. Heavy overlap. ML emphasizes prediction over inference; statistics emphasizes inference over prediction. The math is largely shared.

"Why does GPT 'understand' me if it's just predicting tokens?" Open question. Empirically, next-token prediction at scale produces models that pass many tests of understanding. Whether they "really" understand is philosophy. Engineers treat them as useful tools and measure outputs.

"How can a model with no understanding write working code?" Patterns in training data + emergent generalization. The model has seen billions of examples of code-and-explanation pairs. It learns the conditional distribution. Often-but-not-always, this produces correct code.

Done

  • One mental model for all of ML.
  • Four pieces, three shapes, three fit-states.
  • Know where LLMs sit in the landscape.

Next: Transformers in one page →

05 - Transformers in one page

What this session is

Transformers explained at the level an AI engineer needs. Not a paper reading. Not a derivation. Just: what they are, what they do, the four ideas that make them work, and why they took over.

The one-sentence definition

A transformer is a neural network architecture where every position in the input can directly attend to (look at, weight, and pull information from) every other position, in parallel, using a mechanism called attention.

That's the whole thing. "Attention is all you need" was the 2017 paper's title; it remains the elevator pitch.

The four ideas

1. Tokens

Text gets split into tokens - pieces of words. "Tokenization" turns a string into a list of integers. The transformer never sees text; it sees token IDs.

"Hello, world"[15496, 11, 995] (or similar).

The model's vocabulary is ~30K-200K tokens. Each token has a learned vector (an "embedding") of some hundreds or thousands of numbers.

2. Attention

For each token in the sequence, the model asks: "Which other tokens in this sequence should I pay attention to, and how much?"

Concretely, every token computes three vectors: a query, a key, and a value. Each token's attention to every other token is query · key (dot product). High dot product = high attention. The output for each position is a weighted sum of all values, weighted by attention.

This is the "every position attends to every position" part. It happens in parallel - that's why transformers train fast on GPUs.

3. Layers

A transformer has many layers. Each layer does: attention, then a small per-position feed-forward network. Stack 12-100+ of these. Each layer can route information differently. Lower layers tend to handle local structure (syntax); higher layers handle semantics; the very highest handle task-specific behavior.

Nobody hard-codes this. The optimizer discovers it.

4. Positional encoding

Attention is order-agnostic by default - it sees a set, not a sequence. So you add a positional encoding to each token embedding to inject "this is position 0, this is position 1, ..." Different schemes exist (sinusoidal, learned, RoPE, ALiBi). Modern LLMs use RoPE.

That's it. Tokens → embeddings + position → many attention layers → output logits over vocab → softmax → predicted next token.

Why transformers won

Three reasons:

  1. Parallelism. Unlike RNNs, transformers can process all positions in parallel during training. GPUs love that. Training is 10-100x faster per parameter.
  2. Scaling. Transformer performance scales smoothly with parameters, data, and compute. Doubling each gives predictable improvements. This made the AI investment thesis possible.
  3. Generality. Same architecture works for text, code, images (Vision Transformer), audio, video. One architecture, many domains.

How LLMs use them

LLMs are decoder-only transformers trained to predict the next token. Show them billions of token sequences with the task "predict token N+1 given tokens 0..N." The optimizer tunes hundreds of billions of parameters until the model gets very good at this.

To generate text:

  1. Tokenize the prompt.
  2. Forward pass: get logits over vocabulary at the last position.
  3. Sample one token (greedy, top-k, top-p, temperature).
  4. Append to sequence.
  5. Repeat from step 2.

That's inference. Slow because it's sequential - one token at a time. The whole serving optimization industry exists to make this fast.

What transformers don't do (intuition)

  • They don't reason in the human sense. They compute conditional probabilities over tokens.
  • They don't have memory across conversations unless you give it to them (context window or external memory).
  • They don't know what's true. They know what's statistically likely given the training data.
  • They don't refuse to make things up unless trained to.

Everything an LLM appears to do - reasoning, planning, refusing - is a learned behavior. Some emerge from scale (basic reasoning). Some are added in post-training (refusals, helpfulness).

What to know to do AI engineering

  • Context window. How many tokens the model can attend to at once. 8K, 128K, 1M depending on model.
  • Tokenization. Different tokenizers = different costs and behaviors. BPE, sentencepiece, tiktoken.
  • Attention is O(n²) in sequence length. Long context is expensive. This drives most serving optimizations.
  • KV cache. During generation, you cache the keys and values from previous tokens so you don't recompute. Huge memory consumer.
  • Quantization. Reducing the precision of weights (16-bit → 8-bit → 4-bit) to fit bigger models in memory. Some accuracy lost; often worth it.

These five items come up in every serving conversation. Know them.

Architectural variants

You'll hear these in interviews:

  • Encoder-only (BERT): good at understanding tasks, embeddings.
  • Decoder-only (GPT, Llama): good at generation. The current default for LLMs.
  • Encoder-decoder (T5): good at translation, summarization.
  • Mixture-of-Experts (MoE): only some parameters activate per token. Bigger total capacity, cheaper inference. DeepSeek, Mixtral.
  • Diffusion transformers: transformers for image/video diffusion. Stable Diffusion 3, SoRA.

What you might wonder

"Do I need to read the original 'Attention Is All You Need' paper?" Once, yes. Skim it. It's clear. The Annotated Transformer (Harvard NLP) walks through it with code.

"Will transformers be replaced soon?" Maybe. Active research: state-space models (Mamba), linear attention, hybrid architectures. Whatever replaces them will be a variation on "every position attends to every other position, cheaply." Skills transfer.

"Should I implement a transformer from scratch?" Optional. Karpathy's "Let's build GPT" video does it in ~2 hours. Worth it once for intuition. Don't make it your project.

Done

  • Know what tokens, attention, layers, positional encoding are.
  • Know why transformers won.
  • Know the engineer's checklist: context window, tokenization, O(n²), KV cache, quantization.

Next: The Hugging Face ecosystem map →

06 - The Hugging Face ecosystem map

What this session is

The Hugging Face universe is huge and overwhelming on first contact. This page is the map: what each piece does, what it competes with, which to actually use.

Why HF matters

HF is the GitHub of AI. Hundreds of thousands of models, tens of thousands of datasets, the leaderboard everyone watches, and the libraries that wire them together. If you're doing applied AI today, you're using HF for something, even if you don't know it.

The pieces

transformers

The main library. Loads thousands of pretrained models. Same API for GPT, Llama, BERT, T5, vision models, audio models. Use it to:

  • Run inference (pipeline, AutoModelForX).
  • Fine-tune (Trainer).
  • Mix and match tokenizers, models, configurations.

If you only learn one HF library, learn this one.

datasets

Library to load and stream datasets. ~100K datasets available. Standardized format. Memory-mapped - works with bigger-than-RAM data. Use for:

  • Loading common datasets (load_dataset("squad")).
  • Streaming huge datasets (streaming=True).
  • Tokenizing pipelines via .map().

tokenizers

Fast tokenizer implementations (Rust core, Python bindings). Most users access tokenizers via transformers - but tokenizers directly is what you need if you're training a new tokenizer or doing volume processing.

accelerate

Wraps PyTorch training to run on CPU / single GPU / multi-GPU / TPU with the same code. Handles distributed training boilerplate. Use for any training that won't fit on one GPU.

peft

Parameter-efficient fine-tuning. LoRA, QLoRA, prefix tuning, IA3. Lets you fine-tune big models on small GPUs by only training tiny adapters. This is the right way to fine-tune almost always.

trl

Transformer reinforcement learning. RLHF, DPO, ORPO, KTO - all the modern preference alignment algorithms. Higher-level than transformers. Use for fine-tuning chat models with human feedback or synthetic preferences.

bitsandbytes

Quantization library. 8-bit and 4-bit weight quantization. Lets you load a 70B model on 24GB of VRAM. Used heavily by peft for QLoRA.

diffusers

For image, video, and audio generation models. Stable Diffusion, FLUX, AnimateDiff, music gen. Same pipeline UX as transformers but for diffusion models.

sentence-transformers

For embeddings. Vector representations of text. Used everywhere in RAG. Different from transformers because optimized for embedding production, not generation.

evaluate

Library for computing metrics (BLEU, ROUGE, accuracy, perplexity, plus custom). Used in training loops and benchmarks.

Spaces

Hosted demos. Build a Gradio or Streamlit app, push to Spaces, free hosting (with limits). Great for portfolio.

Hub (the website)

Where models, datasets, and Spaces live. Free for public, paid for private and Pro features. You'll spend hours here.

Inference Endpoints / Inference API / TGI

Hosted inference services. Useful for prototyping, expensive for production.

What competes with what

You'll see overlapping options. Honest comparison:

For... HF tool Competition Pick
Model serving text-generation-inference vLLM, Ollama, llama.cpp vLLM for prod, Ollama for local
Fine-tuning UI Trainer axolotl, unsloth, lit-gpt Trainer + trl for flexibility; axolotl for less code
Embeddings sentence-transformers OpenAI embeddings API, Cohere sentence-transformers if self-host; APIs if not
Datasets datasets Pandas, polars, raw files datasets for ML workflows
Vector DB - Chroma, Qdrant, Weaviate, pgvector Qdrant or pgvector for prod
Eval evaluate, lm-eval-harness Promptfoo, Ragas, custom lm-eval-harness for benchmarks; Promptfoo for app eval

The pattern: HF's libraries are usually a strong default for model-side work. For infrastructure (serving, vector DBs), specialized tools tend to win.

Minimum-viable HF skill

You should be able to:

  1. Load any model from the Hub: AutoModel.from_pretrained("org/model-name").
  2. Use pipeline("text-generation", model=...) for quick inference.
  3. Fine-tune with Trainer on a dataset from datasets.
  4. LoRA fine-tune with peft + trl.
  5. Push a model or Space to your account.

If you can do all five, you're past the on-ramp.

The HF version pain

HF moves fast and breaks things. A tutorial from 6 months ago might not work. Solutions:

  • Pin versions in requirements.txt when you find a working combo.
  • Check release notes for breaking changes before upgrading.
  • Use transformers[testing] extras when you need the full test toolchain.

This isn't HF being careless; it's that the field is moving and they're tracking it. Plan for churn.

How to keep up

  • HF blog (huggingface.co/blog) - solid technical posts.
  • @huggingface on Twitter/X - release announcements.
  • The Hub leaderboards - open LLM leaderboard, embedding leaderboard.
  • Daily papers (huggingface.co/papers) - curated arXiv. Use this instead of trying to read all of arXiv.

What you might wonder

"Does everyone in production use HF?" For training and prototyping: very often. For serving: usually not their inference services (cost). The libraries are widely used; the hosted services less so.

"Are HF models commercially usable?" Depends on the license. Some yes (Apache 2.0, MIT, Llama license with caveats), some no (research-only). Always check the model card.

"What about Anthropic / OpenAI / Google models?" Closed-source. Different mental model: you call an API, you don't own the weights. Often the right answer for product features; the wrong answer when you need control, customization, or low cost at scale.

Done

  • Mapped the HF ecosystem.
  • Know which HF tools to use and which to swap.
  • Have a "minimum-viable HF skill" target.

Next: Picking a specialization →

07 - Picking a specialization

What this session is

"AI engineer" is not one job. It's six. This page lays them out, honestly, with skills, day-to-day, salary range, and how to break in.

Why specialization matters now

In 2020 you could be a generalist. In 2026 the field has differentiated. Hiring managers ask "what kind of AI engineer" and expect a specific answer. "All of it" reads as "none of it."

Pick one direction by month 4. Build your portfolio around it. You can always pivot - but pivoting without ever specializing means you never look hireable to anyone.

The six specializations

1. Applied LLM engineer

Day-to-day: Building product features on top of LLM APIs (OpenAI, Anthropic) or self-hosted models. Prompt engineering. RAG pipelines. Tool use. Evals. Agents.

Skills: Strong Python. API integration. Vector DBs. Frontend competence helpful. Eval discipline.

Salary range (US, 2026): $130-220K early career; $200-400K senior.

Hires the most. Lowest barrier to entry. Easy to demonstrate via portfolio (build apps). Most competitive for entry roles because everyone is targeting these. The senior end is wide open.

Break in by: Build 3 real LLM apps. Write them up. Contribute to LangChain / LlamaIndex / similar. Get good at evals.

2. Inference / serving engineer

Day-to-day: Making models run fast in production. vLLM, TGI, custom kernels. Latency optimization. Cost optimization. Quantization. Batching strategies. GPU scheduling.

Skills: Strong systems background (the "polyglot" engineer profile). C++/CUDA helpful. Kubernetes. Profiling tools. PyTorch internals.

Salary range (US, 2026): $180-280K early; $300-600K senior.

Hires steadily. Higher barrier; fewer candidates with the right systems background. Excellent fit if you came from backend/SRE.

Break in by: Contribute to vLLM. Benchmark inference setups. Write blog posts comparing serving strategies. Build a serving setup for an open model and document end-to-end.

3. Fine-tuning / training engineer

Day-to-day: Train custom models. Fine-tune with LoRA/full FT/RLHF/DPO. Manage training infrastructure. Hyperparameter sweeps. Data curation. Eval-driven iteration.

Skills: PyTorch deep. Distributed training (DeepSpeed, FSDP). Familiarity with paper-reading. Data work.

Salary range (US, 2026): $180-300K early; $300-500K senior.

Hires fewer than serving. Many companies use off-the-shelf models. Real demand at frontier labs (Anthropic, OpenAI, DeepMind, smaller research orgs).

Break in by: Fine-tune and publish 2-3 open models. Write up the experiment design. Contribute to trl, peft, transformers.

4. MLOps / platform engineer

Day-to-day: Build the platform other AI engineers use. Experiment tracking. Model registry. Deployment pipelines. Monitoring. Feature stores. Data pipelines for ML.

Skills: Strong infra/DevOps. Kubernetes. Airflow/Dagster. Familiar with the ML lifecycle. Some ML enough to talk to ML engineers.

Salary range (US, 2026): $160-260K early; $250-450K senior.

Hires consistently. The non-glamorous backbone. Often the easiest pivot from existing DevOps/SRE.

Break in by: Build an end-to-end MLOps stack for a personal project. Contribute to MLflow / Kubeflow / similar. Run your own model serving in K8s and document.

5. ML researcher / research engineer

Day-to-day: Reproduce papers. Run experiments. Propose new architectures, training objectives, evaluation methods. Write papers. (Research engineers do less paper-writing, more implementation.)

Skills: PhD or equivalent published work for researcher. For research engineer, very strong ML + systems but no PhD required. Mathematics. Paper-reading speed.

Salary range (US, 2026): $200-400K early; $400-1M+ senior at frontier labs.

Hires few but pays the most. Frontier labs. Highly competitive.

Break in by: This roadmap probably isn't your path. Research roles want PhDs or extraordinary equivalents. If you're set on this, the path is academia first.

6. AI safety / evaluation engineer

Day-to-day: Build evaluation pipelines. Red-team models. Measure capabilities and harms. Build alignment tooling. Write up findings.

Skills: Eval discipline. Critical thinking. Some ML. Strong writing.

Salary range (US, 2026): $180-300K early; $300-500K senior.

Hiring grew fast 2024-2025. Frontier labs and AI safety orgs (Anthropic, Apollo, METR, AI Safety Institutes). Mission-driven.

Break in by: Public evaluation work. Replicate published evals. Blog about failure modes. Apply to safety-specific fellowships.

How to choose

Three honest questions:

  1. What's your existing background?
  2. Backend/SRE → serving or MLOps.
  3. Frontend/product → applied LLM.
  4. Researcher/ML-adjacent → fine-tuning or research engineer.
  5. Devops platform → MLOps.

  6. What do you want your day to look like?

  7. Building product? Applied LLM.
  8. Debugging GPU memory? Serving.
  9. Reading papers? Research engineer or fine-tuning.
  10. Building pipelines? MLOps.
  11. Probing failure modes? Safety.

  12. What's the market in your area / target company?

  13. Some specializations cluster geographically (research at SF/UK hubs; applied LLM everywhere).
  14. Check job postings on companies you'd want to join. What's the actual mix?

What if you pick wrong

You can switch within the field. Applied LLM → serving is a common one. Serving → fine-tuning is harder (lots of math gap). Research → applied is easy (downward in pay, easier transition).

The cost of switching specialization mid-roadmap: 2-3 months of regroup. Not catastrophic. Don't paralyze yourself trying to "perfectly" pick.

What you might wonder

"Can I just be a generalist?" For your first AI role, no. The portfolio has to read as something specific. After 1-2 years on the job, generalist makes sense again.

"Aren't agents going to replace all this?" Maybe. Eventually. Not on a 12-month timeline. Build for the world that exists.

"Is it too late?" For research, harder than 5 years ago. For applied / serving / MLOps, the market is wider than ever because every company suddenly needs AI capabilities.

Done

  • Know the six specializations.
  • Have a leaning toward one (or two to compare further).
  • Will pick by end of month 4.

Next: Reading papers without drowning →

08 - Reading papers without drowning

What this session is

How to engage with AI research without burning out. The discipline most engineers don't have but should.

The trap

There are ~10,000 AI papers per month on arXiv. People who try to "stay current" by reading all of them burn out by month 3. People who don't read any never level up past tutorial-follower.

The right answer is curated, slow, deliberate. Maybe one paper a week, read deeply. Plus a fast skim queue for awareness.

The two reading modes

Skim mode (5 minutes per paper)

For breadth. ~10 papers per week. Goal: know it exists, vaguely what it does, whether it's worth a deep read later.

Read in order: 1. Title. 2. Abstract. 3. The headline figure (usually Figure 1 or 2). 4. First sentence of each section. 5. Final paragraph of conclusion.

That's it. 5 minutes. Move on.

Tag papers worth a deep read.

Deep mode (2-3 hours per paper)

For depth. 1 paper per week, sometimes 2. The paper you tagged from skim. Goal: understand it well enough to explain to a colleague.

Read in order: 1. Abstract + intro. Get the problem. 2. Related work. Skim, skip if you know it. 3. Method. Slow. Re-read until you can explain the diagram out loud. 4. Experiments. Look at the tables. What did they ablate? What's the baseline? 5. Limitations / discussion. Honest? Hand-wavy? 6. Look at the code. Most influential papers release code. Read the actual implementation. The paper and the code disagree, and the code is what's true. 7. Find a blog post / video / podcast about it. Cross-reference your understanding.

Take notes. Specifically: write 5 sentences answering "what's the new idea, why does it matter, what was the baseline, what does it cost, what would I do with it."

What to read

The 10 papers every AI engineer should have skimmed

(2026 edition; the list updates)

  1. Attention Is All You Need (Vaswani et al, 2017) - the transformer paper.
  2. BERT (Devlin et al, 2018) - bidirectional transformer for understanding.
  3. GPT-3 paper (Brown et al, 2020) - scaling laws + few-shot.
  4. Chinchilla (Hoffmann et al, 2022) - the compute/data tradeoff.
  5. LoRA (Hu et al, 2021) - low-rank adaptation.
  6. InstructGPT / RLHF (Ouyang et al, 2022) - how we made GPT helpful.
  7. Constitutional AI (Bai et al, 2022) - Anthropic's alignment approach.
  8. Mixture of Experts (Switch Transformer) (Fedus et al, 2021) - sparsity for scale.
  9. vLLM (PagedAttention) (Kwon et al, 2023) - serving with KV cache management.
  10. DPO (Rafailov et al, 2023) - preference fine-tuning without RL.

Most are findable on arXiv. Many have annotated walkthroughs (The Illustrated Transformer, etc.).

Where to find what's worth reading

  • HuggingFace Daily Papers (huggingface.co/papers) - curated daily list, voted by community. Use this instead of raw arXiv.
  • arXiv-sanity-lite (arxiv-sanity-lite.com) - Karpathy's filter. Filter by tags.
  • Papers With Code (paperswithcode.com) - sorted by benchmarks. Useful for finding state-of-the-art in your specialization.
  • Newsletters: Sebastian Raschka's "Ahead of AI", Jack Clark's "Import AI". One per week, no more.
  • Twitter/X: follow 10-20 researchers in your specialization. Their retweets surface things.

Where NOT to look for papers

  • Twitter timeline directly. Will eat your week.
  • arXiv firehose. Drowns you.
  • LinkedIn AI influencers. Mostly recycled content.

Reading specific specializations

By specialization, here's what's worth tracking:

  • Applied LLM: new prompting techniques (rare these days), agent frameworks, eval papers, retrieval improvements. ~5 papers per month.
  • Inference / serving: vLLM team output, FlashAttention variants, kernel-level optimization papers, MoE serving. ~3 papers per month.
  • Fine-tuning: new PEFT methods, preference algorithm variants, synthetic data papers. ~5 papers per month.
  • MLOps: less paper-driven. Read blog posts (Anyscale, Databricks, Modal). Conference talks (Ray Summit, MLOps World).
  • Research engineer: depends on subfield. Talk to your lab's mentors.
  • Safety: Anthropic / OpenAI / DeepMind safety blogs. METR. Apollo Research. AI safety reading lists curated by AGI Safety Fundamentals.

How to know if a paper is solid

Honest tells:

  • Code released? Major positive signal.
  • Multiple seeds in experiments? Hides noise; good papers report it.
  • Honest comparison to strong baselines? Suspicious if their baseline is weirdly weak.
  • Limitations section that admits real things? Good papers do this.
  • Reproduced by others? Look for follow-up papers citing it. Did others get similar results?
  • Author track record? Researchers at top labs aren't always right, but they're usually rigorous.

Red flags: dataset-cherry-picking, only one seed, no released code, breathless claims, no comparison to obvious baselines.

What to do with what you read

  • Add the 5-sentence summary to a notes file.
  • If you can apply the idea to your project, try a minimal experiment.
  • Talk about it (blog, lunch chat, Slack). You learn by explaining.

Reading without writing or applying decays in days. Reading + applying makes it stick.

What you might wonder

"Do I need to read papers to be an applied AI engineer?" A few, yes. Enough to converse with researchers and not look ignorant in interviews. You don't need to be at the frontier.

"What if the math defeats me?" Try anyway. Get what you can. Look up specific equations. Often the math is more intimidating than the actual idea. Karpathy's videos and "The Annotated X" series rescue many readers.

"Where do PhD-level researchers learn to read this fast?" Reading 10x more papers over 6 years. There's no shortcut. You don't need to match them. You need a working pace, sustained.

Done

  • Have two-mode reading discipline.
  • Have a curated source list.
  • Know the 10 papers to skim first.

Next: Building in public →

09 - Building in public

What this session is

Why and how to share your work as you learn. The non-obvious part of breaking into AI engineering.

Why this matters more than people admit

The traditional "build resume → apply → get filtered by ATS" funnel is broken in AI specifically because:

  • The market is flooded with applicants for entry roles.
  • Recruiters can't tell from a resume who actually understands LLMs vs who took a Coursera course.
  • Hiring managers increasingly pre-filter by signals outside the resume: GitHub, blog, Twitter, a working demo.

If a hiring manager Googles your name and finds nothing, you're starting at zero. If they find a blog with real technical writing, an active GitHub, a Hugging Face profile with a couple of models, a working demo - you're starting in the "let's interview this person" pile.

This effect is more pronounced in AI than in traditional software engineering.

What "building in public" means

Three behaviors:

  1. Share your work as you do it, not after.
  2. Write honestly about what you tried, what worked, what didn't.
  3. Stay consistent for months, not weeks.

It does not mean: hot takes on Twitter, follower-chasing, motivational posts, AI-generated threads.

The minimum viable public footprint

By month 6 you should have:

  • GitHub profile that looks alive. Not 80 forks. A few real repos with READMEs, commits across several months. Pinned repos representing your best work.
  • Hugging Face profile with at least one model or dataset you uploaded.
  • A personal site or blog with 3-5 honest technical posts.
  • One social account (Twitter/X, Bluesky, or LinkedIn) where you occasionally share what you're building. Not daily. Weekly is fine. Don't overthink it.

That's it. No need for newsletters, podcasts, YouTube. Diminishing returns past the basics.

What to write about

Tutorials are saturated. Don't add another "intro to RAG." Instead:

  • "I built X and here's what surprised me." Specific projects. Honest reactions.
  • "I benchmarked A vs B." Comparison posts age well and get linked.
  • "I tried to reproduce paper X and here's what happened." Honest reproduction work is valuable and uncommon.
  • "Here's what I got wrong about X and learned." Vulnerability posts often resonate.
  • "Here's the bug that took me 4 hours." Future-you and other engineers will Google this.

What to avoid:

  • "Top 10 AI tools" listicles.
  • "Why I think AGI will arrive in 2027." (You don't know.)
  • AI-generated content. Readers notice.
  • "I'm learning AI" posts with nothing to show.

Formats that work

In order of return:

  1. Working demos (HF Space, Modal app, Vercel deploy). One demo > ten blog posts.
  2. Long-form technical write-ups (1500-3000 words). On a personal site or Substack. Cross-post excerpts.
  3. Annotated GitHub repos. README quality matters more than star count.
  4. Short videos (5-15 min walking through code on your screen). Most underused format.
  5. Tweet threads or Bluesky posts linking to the above.

Consistency over volume

Six honest posts over six months beats sixty posts in the first month.

The signal a hiring manager wants: this person showed up consistently to a hard thing. That's only visible across time.

The 10x non-obvious move: publish models

Most aspiring AI engineers build apps; almost none publish models. Even a tiny fine-tuned model on a niche task - pushed to the Hub with a real model card - sets you apart.

Examples that have landed real jobs:

  • A LoRA fine-tune of a small model for a domain-specific task (legal, medical, code).
  • A re-implementation of a paper's method, with benchmarks.
  • A merged model variant.
  • A dataset for an under-served task, with documentation.

Time investment: 1-2 weeks. Signal: enormous.

The other 10x move: contribute to OSS

Already covered in Open source as resume. The short version: an accepted PR to vLLM / transformers / langchain weighs more than a personal repo with 1000 stars.

How to start

Week 1: - Set up a GitHub profile README. - Create a Hugging Face account. - Pick one platform (Twitter/X, Bluesky, LinkedIn). Don't fragment yet. - Buy a domain (yourname.dev or similar). Set up a one-page site.

Week 2-4: - Write your first post. Make it about something you actually built or learned this week. - Push your first working repo with a real README. - Don't worry about audience. Audience comes much later, if at all.

Month 2 onward: - Post once every 1-2 weeks. Schedule it. - Engage genuinely with 5-10 accounts in your specialization. - Don't chase trends. Stay on your specialization.

The grim truth about audience

Most of your posts will get no engagement for the first 6-12 months. That's normal. Audience compounds slowly. The people you need to reach are not the masses - they're the 10-100 hiring managers who'll find your work when they're considering you.

If you check stats more than once a week, you'll quit. Don't.

Optional: newsletters, podcasts, YouTube

If you genuinely enjoy them, do them. They're force multipliers if sustained. They're black holes if you're doing them for the wrong reasons. Default to "no."

What you might wonder

"What if my employer doesn't allow side projects?" Some companies require IP assignment for everything. Check. Negotiate. Most are flexible for non-competing work. Worst case: build on personal time, with personal hardware, on topics unrelated to your day job.

"What if I'm not 'good at writing'?" Doesn't matter. Honest, specific, technical writing wins. Polish less, ship more.

"What if I sound stupid?" You will sometimes. Everyone does in retrospect. Posts from a year ago will embarrass you - that's the proof you've grown.

"Should I use AI to write the posts?" Use it to edit. Don't use it to write. Readers can tell. It's an immediate trust-killer when noticed.

Done

  • Set up the minimum viable footprint.
  • Have a posting cadence target.
  • Picked formats over chasing every channel.

Next: Your portfolio - 3 projects →

10 - Your portfolio: 3 projects

What this session is

The specific 3-project portfolio that opens interviews. What each project demonstrates, how to scope it, how long it should take, and what NOT to build.

Why three

Two reasons:

  1. Hiring managers can't read 10 projects. They open your GitHub, look at 2-3 pinned repos, decide. If the top three look strong, you're in. If they look weak, you're out.
  2. Three projects = three different proofs. One shows breadth. One shows depth. One shows you can ship.

More than three pinned projects dilutes the signal. Fewer than three feels thin.

The portfolio formula

The strongest 3-project portfolios I've seen follow this pattern:

  1. A clear product demo - proves you can ship something a user could use.
  2. A reproduced or extended paper - proves you can read research and implement.
  3. An OSS contribution - proves you can work in someone else's codebase.

Three different muscles. Together: shippable, technical, collaborative.

Project 1: A clear product demo

Scope: A working application that uses LLMs (or your specialization's models) to do something specific and useful. Hosted live. Looks decent. Has a README. Has at least basic evaluation.

Examples that have landed jobs:

  • A RAG-powered Q&A bot over a specific corpus (a textbook, the python docs, a podcast archive).
  • A code-review assistant for a specific framework.
  • An LLM-powered tool for a niche profession (lawyers reviewing contracts, doctors summarizing notes - with appropriate disclaimers).
  • A model-comparison playground for a specific task.
  • An agent that automates a real workflow you actually use.

What makes it strong:

  • Solves a real, specific problem. Not "an AI chatbot."
  • Live and working (Vercel, Modal, Hugging Face Space, Railway).
  • Has an "eval" section - how do you know it works?
  • Has a write-up that's honest about limitations.

What kills it:

  • Yet another generic chatbot.
  • Demo broken or behind a login wall recruiters can't access.
  • No write-up - just code.
  • Uses OpenAI for everything with no thought about cost/latency.

Time investment: 4-6 weeks.

Project 2: A reproduced or extended paper

Scope: Pick a paper from your specialization. Re-implement the key method. Compare your results to the paper's claims. Write up the discrepancies honestly.

Examples:

  • Reproduce a LoRA fine-tune from a published paper, on a different base model.
  • Reproduce an evaluation result (e.g., a paper claiming model X beats model Y at task Z - does it?).
  • Re-implement a serving optimization (FlashAttention from scratch, or a small piece of vLLM's KV cache management).
  • Compare two preference algorithms (DPO vs ORPO) on the same dataset.

What makes it strong:

  • Tackles a paper from the last 12-18 months (current).
  • Honest about deltas: "I got 73% where they got 78%, here's why I think so."
  • Code is clean and runnable from scratch.
  • Write-up is technical and specific.

What kills it:

  • Toy paper, decade old.
  • Copy-paste of someone else's repo, lightly rebranded.
  • Claims to "beat the paper" with hand-wavy methodology.

Time investment: 3-5 weeks.

Project 3: An OSS contribution

Scope: A merged PR to a well-known AI OSS project. Doesn't need to be huge. The bar is "real PR in a real codebase."

Strongest targets:

  • huggingface/transformers
  • huggingface/peft or trl
  • vllm-project/vllm
  • langchain-ai/langchain or run-llama/llama_index
  • axolotl-ai-cloud/axolotl

What makes it strong:

  • The PR is actually merged, not stale-and-closed.
  • It's not just a typo fix (that's table stakes, not a portfolio item; a few of those help warm up).
  • The maintainer thread shows back-and-forth - you addressed feedback.
  • Your linked GitHub profile shows activity over time.

What kills it:

  • "Contributed" issues without PRs.
  • One-character typo fix as the headline.
  • Closed/rejected PRs presented as wins.

Time investment: 4-8 weeks for a first non-trivial merged PR. Often longer for big projects.

This is also the project that takes the longest to start - you'll spend weeks just orienting in the codebase. That's normal. See Open source as resume.

What NOT to put in your portfolio

These read as red flags or as "junior":

  • ❌ MNIST classifier. Done by every tutorial.
  • ❌ Generic chatbot wrapping OpenAI. Done by everyone.
  • ❌ Stable Diffusion image generator with no specific application.
  • ❌ "Coursera capstone projects." Generic; signals you only do guided work.
  • ❌ Anything you can't explain in detail in 10 minutes.
  • ❌ Forks with no original work.

Polish standard

Each pinned repo needs:

  • README with: what it does, screenshots/gif if visual, install + run instructions, eval results, limitations, license.
  • Working installation from a fresh clone. Test it on a clean machine before pinning.
  • CI that's green (basic tests, lint).
  • A LICENSE file.
  • No committed secrets, no committed model weights, no committed data. Use .gitignore properly.

The bar isn't "production-quality." The bar is "someone could clone this and run it without help."

How they fit on a resume / LinkedIn

On your CV:

Projects
- {name}: {one sentence}. [link]
- {name}: {one sentence}. [link]
- {name}: {one sentence}. [link]

On LinkedIn featured items, same three. On the GitHub profile README, same three.

Hiring managers see consistency across surfaces and assume there's a coherent person behind it.

What you might wonder

"Can I have more than three?" Sure, in your repo list. Pin three. The pinned three are the signal.

"What if my project ideas overlap with my employer's IP?" Don't risk it. Pick a different domain. Many strong portfolios are in unrelated niches.

"What if I'm not great at frontend?" Use Gradio or Streamlit. They look adequate. Nobody expects a designer's UI for an AI project. Make it functional and clean.

"What if my paper-reproduction project doesn't reproduce?" Write that up honestly. "I couldn't reproduce X; here's what I think happened" is a strong portfolio piece. Many published results don't reproduce; honest reproduction work is valuable.

Done

  • Have the 3-project formula.
  • Have target candidates for each.
  • Know what to avoid.

Next: Evaluating job postings honestly →

11 - Evaluating job postings honestly

What this session is

How to read AI job postings without getting demoralized or scammed. What requirements actually mean. What "preferred" really translates to. Red flags vs green flags.

The lying isn't intentional

AI job postings are bad signals because:

  • Recruiters often write them without understanding the role.
  • The "requirements" are aspirational, not literal.
  • Companies copy-paste each other's requirements.
  • "5+ years of LLM experience" is impossible (the field is 3 years old in its current form) - but it's written anyway.

If you take them literally, you'll never apply. The trick is reading between the lines.

The general decoder

The post says Likely means
"5+ years experience with LLMs" "Has shipped something with an LLM."
"PhD preferred" "Bonus, not required, unless it's a research role."
"Expert in PyTorch" "Can debug a training loop."
"Strong mathematical background" "Won't be intimidated by a derivative."
"Production experience" "Has put something behind an HTTP endpoint."
"Familiarity with distributed systems" "Knows what Kubernetes is."
"Publications in top venues" (Only literal for research roles.)
"Self-starter" "Manager won't define your tasks."
"Fast-paced environment" "Understaffed."
"Wear many hats" "Understaffed and not hiring more."
"Stock options" "Pay attention to the equity terms before negotiating."

If you hit ~60-70% of the listed requirements, apply. Hitting 100% means you're overqualified or the post is too vague.

Red flags

These mean: pass.

  • No salary band listed in jurisdictions that require it. US/UK/EU postings increasingly must disclose. Refusal hints at games during negotiation.
  • Vague responsibilities ("you'll work on AI things"). Often means the role isn't defined; you'll be a glorified researcher-or-prompt-engineer of-the-moment.
  • "Looking for someone passionate about AI." Code for "we'll underpay you because the work is its own reward."
  • No mention of what model / stack / problem. Generic.
  • Buzzword salad ("GenAI, AGI, agents, multimodal, RAG, fine-tuning, MLOps"). Means the company doesn't know what it wants.
  • "Must be available 24/7" / "thrives under pressure" / "willing to wear many hats." Burnout shop.
  • Asks for unpaid take-homes longer than 4 hours, or live coding without a base. Disrespectful.
  • No engineering blog, no GitHub presence, no public talks. Means you can't validate from the outside.
  • Company name you've never heard of with an "AI" suffix added in 2024. Many of these are dropshipped wrappers around OpenAI.
  • VC-funded but no shipped product after 12+ months. Possible cash burn / pivot ahead.

Green flags

These mean: apply, even if you don't match every line.

  • Specific stack mentioned (e.g., "PyTorch, vLLM, AWS, K8s, our LLM evals are in Promptfoo"). Means the team knows their setup.
  • Specific problem mentioned ("building RAG over legal contracts" / "optimizing serving latency for our 70B model").
  • Engineering blog with technical depth (not just marketing).
  • Public GitHub presence with active repos.
  • Clear interview process described (sometimes on the careers page).
  • Salary band published.
  • Mention of evaluation discipline. "We measure our model outputs against [specific benchmarks]" is rare and excellent.
  • Hiring manager visible on LinkedIn. You can research them.

Company types and what they offer

Big tech (Google, Meta, Microsoft, Amazon, Apple, Nvidia)

  • Pay: highest. $250-600K+ all-in.
  • Stability: moderate. AI teams have been reorged frequently.
  • Visibility: high. Resume value.
  • Bureaucracy: high. Real-world impact often slow.
  • AI work: real, varied. Includes both frontier work and product integrations.

AI labs (Anthropic, OpenAI, DeepMind, xAI, Mistral, Cohere)

  • Pay: very high. $300K-1M+ for senior; lower for new grads but still strong.
  • Stability: changing fast.
  • Visibility: highest in the field.
  • Selectivity: brutal. Top 1-2% of applicants.
  • AI work: the frontier. Research-engineering blend.

AI startups (Series A-C)

  • Pay: moderate to high. $150-300K + meaningful equity.
  • Stability: low to moderate. Many fail.
  • Visibility: depends.
  • Bureaucracy: low.
  • AI work: often more applied than frontier. Sometimes a thin wrapper around APIs.

Non-AI companies adopting AI (most of the market)

  • Pay: moderate. $130-220K.
  • Stability: higher.
  • Visibility: lower.
  • Bureaucracy: depends.
  • AI work: mostly applied. Building product features. The bulk of hiring.

"AI-first" early stage startups (seed / pre-seed)

  • Pay: lower base, more equity.
  • Stability: lowest.
  • Visibility: varies.
  • AI work: intense.
  • Risk: highest. Many won't exist in 18 months.

How to research a company before applying

10 minutes per company:

  1. Engineering blog? Read 2 recent posts.
  2. GitHub? Active? Open source anything useful?
  3. Glassdoor + Blind? Salary signals, culture signals. Take with salt.
  4. Funding stage? If startup, runway?
  5. Recent layoffs? levels.fyi, Layoffs.fyi, news.
  6. Hiring manager on LinkedIn? Their background tells you what they value.
  7. Product actually exists and works? Try the free tier if there is one.

If 4+ of these are positive, apply.

The application math

Honest numbers from 2025-2026 entry-level applicants:

  • 80-200 applications per offer.
  • 10-20% reply rate (mostly auto-rejection or "we'll keep your resume").
  • 5-10% first-round interview rate.
  • 1-3% final offer rate.

This is brutal, but normal. If you're getting 0 first-round interviews after 50 applications, the resume + portfolio need work. If you're getting first rounds but no offers, the interview prep is the gap (see page 13).

What you might wonder

"Should I lie about years of experience?" No. They check. Frame your relevant work honestly - projects, OSS, side work - and let the interview demonstrate ability.

"Should I apply to roles I'm slightly under-qualified for?" Yes. The posted requirements are ceilings, not floors.

"What about visa / location restrictions?" Some companies sponsor; many don't. Filter early. Don't waste cycles applying to roles you can't accept.

"Should I take a contract role to get experience?" Yes, often. Contracts to FTE is a common path in AI. Less competition, faster decisions, real production exposure.

Done

  • Can decode posting language.
  • Have a red-flag / green-flag filter.
  • Have realistic application-math expectations.

Next: Open source as resume →

12 - Open source as resume

What this session is

Why OSS contributions outweigh almost everything else on an AI resume, and how to get them right.

The honest claim

A single merged non-trivial PR to huggingface/transformers or vllm-project/vllm is worth more than a year of unverified "AI experience" on a resume.

Why: it's verifiable. The hiring manager can read your PR. They can read the maintainer thread. They can see how you take feedback. They can see whether your code compiles. None of those are visible from a job title.

This is especially true if you're transitioning from outside AI. OSS contributions bridge the gap that "I taught myself AI for 12 months" otherwise can't.

What counts and what doesn't

Counts Doesn't count
Merged PR with real code changes Star on a repo
Merged PR with a meaningful test added Forking and "improving" something privately
Sustained contributions to one project over months One typo fix, claimed as a contribution
Documentation PR that fixes a real gap "Issue created"
A new feature merged after maintainer review A PR opened but never merged
A bug-fix PR with the failing-test-first A 5000-line PR maintainers won't review

The bar isn't "contribution exists." The bar is "real engagement with a project."

The high-leverage projects to contribute to

For each specialization, the projects whose names recruiters recognize:

Applied LLM

  • langchain-ai/langchain
  • run-llama/llama_index
  • microsoft/autogen
  • crewAI-Inc/crewAI
  • promptfoo/promptfoo

Inference / serving

  • vllm-project/vllm
  • huggingface/text-generation-inference
  • ollama/ollama
  • ggerganov/llama.cpp
  • sgl-project/sglang

Fine-tuning

  • huggingface/transformers (Trainer-side)
  • huggingface/peft
  • huggingface/trl
  • axolotl-ai-cloud/axolotl
  • unslothai/unsloth

MLOps

  • mlflow/mlflow
  • kubeflow/kubeflow
  • ray-project/ray
  • wandb/wandb (client library)
  • bentoml/BentoML

Eval / safety

  • EleutherAI/lm-evaluation-harness
  • explodinggradients/ragas
  • promptfoo/promptfoo
  • confident-ai/deepeval

Pick ONE. Get deep. Three superficial contributions across three projects is worse than five real PRs into one.

How to start (the 4-week plan)

Week 1: Orient

  • Clone. Build. Run tests. Get the dev loop working.
  • Read CONTRIBUTING.md.
  • Read 5 recently merged PRs end-to-end. Look at what good "looks like" in this project.
  • Lurk on the Discord / Slack / GitHub Discussions if it exists.

Week 2: First small PR

  • Find a good first issue or a clear typo/doc gap.
  • Open the PR. Follow the template exactly.
  • Address feedback promptly.
  • Goal: one merged PR.

Week 3: Real bug fix

  • Find a bug issue that's been triaged but unassigned.
  • Reproduce it.
  • Write a failing test first.
  • Fix it. Make the test pass.
  • Open the PR with both the test and the fix.

Week 4: Small feature or improvement

  • Find a help wanted or feature-request issue with maintainer interest.
  • Comment to claim. Wait for approval.
  • Implement. Add tests. Update docs.

After this 4-week sprint, you have 2-3 merged PRs in one project. That's the start of a real contribution profile.

How not to flame out

The common failure mode: big, ambitious first PR. The PR sits for months. You give up.

Avoid this by:

  • Starting small.
  • Reading maintainer signals (which issues do they actually respond to?).
  • Asking before doing: "I'd like to take on issue #1234. Is that direction correct?" Wait for a 👍.
  • Splitting big work into small PRs. Each PR should be reviewable in 30 minutes.

How to talk about it in interviews

Wrong: "I contributed to PyTorch."

Right: "I contributed three PRs to vLLM over four months - one fixed a memory leak in the KV cache eviction logic, one added support for [specific model], one improved the docs around [specific topic]. The first one involved working with the maintainer to redesign the test."

Specifics. Always specifics. Hiring managers can smell vagueness.

The "I want to but I'm intimidated" problem

Common. The fix is to do the smallest possible thing first.

  • Pick a project.
  • Open CONTRIBUTING.md. Read it.
  • Find the smallest doc PR. A typo. A confusing sentence.
  • Open the PR. Get it merged.
  • The intimidation evaporates after the first merge.

The first PR is the hardest. Everything after is easier.

What if a project rejects your PR

Normal. Doesn't reflect on you personally.

Reasons: - They didn't want that feature. - Style mismatch. - Already being worked on internally. - Bad timing (release week, maintainer vacation).

Try a different issue. Or a different project. Don't dwell.

The compounding effect

Sustained contributions to one project for 6+ months make you a contributor, not just an outsider with PRs. You'll be:

  • Cc'd on related issues.
  • Asked to review others' PRs.
  • Mentioned in release notes.
  • Recommended to companies hiring in the space.

This is the long game. The first PR is the start.

What you might wonder

"What if my PR sits for weeks?" Normal for big projects. Polite check-in after a week. Sometimes longer. Don't take it personally.

"What if I don't know the language well enough?" Pick the project carefully. langchain is mostly Python. llama.cpp is C/C++. vllm is Python with CUDA. Match your skill.

"What if there are no 'good first issues' I can do?" Look at recent merged PRs. The shape of contributions tells you what's wanted. Sometimes the unlabeled issues are easier than the labeled ones, which get claimed instantly.

"Should I contribute to research code (e.g., a paper's reference implementation)?" Less recruiter-friendly than mainline projects. Fine if it's your specialization. Better as project #4 after the well-known ones.

Done

  • Have one target project.
  • Have the 4-week start plan.
  • Know how to talk about contributions.

Next: Interview prep - what they actually ask →

13 - Interview prep: what they actually ask

What this session is

The interview loop for AI engineering roles, decoded. What each round is for, what they actually ask, and how to prep without LeetCode-burnout.

The typical loop

For applied AI / serving / MLOps roles in 2026:

  1. Recruiter screen (30 min). Background, motivation, salary, work auth.
  2. Hiring manager screen (30-45 min). Why this role, your relevant work, technical depth at conversational level.
  3. Technical screen (60 min). One of: light coding, ML concepts, system design lite.
  4. Onsite / virtual onsite (3-5 hours). Multiple rounds: deeper coding, ML/AI knowledge, system design, behavioral.
  5. Offer + negotiation.

Research roles add: paper review, research presentation, possibly a take-home.

What each round actually tests

Recruiter screen

Testing: Are you a real candidate? Do you understand the role? Salary expectations realistic?

Prep: - Have a 60-second background pitch ready. - Know your salary range (use levels.fyi, Blind, payscale). - Have specific reasons for wanting this company.

Hiring manager screen

Testing: Are you technically credible in conversation? Do you understand AI at the level the role requires? Will you fit the team?

Prep: - Be able to describe your portfolio projects in detail. - Be able to talk about a recent paper or industry trend in your specialization. - Have 2-3 questions about the team and their AI stack ready.

Technical screen

Format varies wildly by company. Three common shapes:

a) Coding question. Often easier than software engineering rounds. Usually a Python problem involving strings, dicts, or simple algorithms. Sometimes implementing a small ML primitive (cosine similarity, softmax, attention from scratch). LeetCode-medium at worst.

b) ML / AI concepts oral exam. "Explain how attention works." "What's the difference between BERT and GPT?" "Walk me through fine-tuning vs in-context learning." "What's RAG and when would you use it?"

c) Pseudocode system design. "Design a RAG pipeline for [task]." "How would you serve a 70B model with strict latency requirements?"

You'll get a, b, or sometimes a+b in one screen.

Onsite rounds

For a typical applied AI loop:

  • Coding (1-2 hours): harder version of the screen. Sometimes implement a small attention or training loop from scratch.
  • AI/ML knowledge (1 hour): deeper oral exam. Often paper-related.
  • System design (1 hour): full design problem. RAG-at-scale, eval-at-scale, serving-at-scale.
  • Behavioral (45 min): STAR-format past-project questions.
  • Sometimes: take-home before onsite.

For research / fine-tuning roles, add:

  • Paper presentation: present a paper of your choice or one they assigned.
  • Research design: "How would you investigate X?"

For MLOps roles, add:

  • Infra design: Kubernetes, CI/CD for ML, monitoring.

What they actually ask (top 20)

These come up over and over:

  1. "Explain how a transformer works."
  2. "Explain attention. Why is it O(n²)?"
  3. "What is KV caching? Why does it matter for inference?"
  4. "What is quantization? When would you use it?"
  5. "Explain RAG. What are its failure modes?"
  6. "What's the difference between fine-tuning and in-context learning?"
  7. "What is LoRA? Why is it efficient?"
  8. "How would you evaluate an LLM-powered feature?"
  9. "What's perplexity? When is it misleading?"
  10. "What does temperature do in LLM sampling?"
  11. "Compare DPO and RLHF at a high level."
  12. "What's a vector database and how does it work conceptually?"
  13. "Walk me through a debugging session for a model that won't train."
  14. "How do you handle hallucinations in production?"
  15. "What metrics do you use for a RAG system?"
  16. "Design a system to serve 10M LLM requests/day with p95 latency under 1s."
  17. "Design a fine-tuning pipeline for [domain] with weekly updates."
  18. "Tell me about a project where you debugged something hard."
  19. "Tell me about a time you disagreed with a teammate technically."
  20. "Why this company specifically?"

If you can answer all 20 well, you're ready.

How to prep without burning out

Don't: grind 200 LeetCode problems. The coding bar for AI roles is lower than for SWE. Spending months on tree DP is wasted.

Do:

  • Practice 30 Python LeetCode-easy/medium problems to be loose. That's enough.
  • Practice the "top 20" out loud. Record yourself. Listen back. You'll cringe; that's the point.
  • Mock interviews. Pramp, friends, paid services. 5 mocks > 50 hours of solo prep.
  • Practice system design by writing up 3-5 design docs for the canonical problems. RAG, eval, serving, training pipeline, agent workflow.
  • Re-read your own portfolio projects. You will be asked details. Be specific.

The behavioral round

Underprepped by engineers. STAR format:

  • Situation: brief context.
  • Task: what was your responsibility.
  • Action: what you specifically did.
  • Result: what happened, with numbers if possible.

Prepare 5 stories that cover: hard technical problem, conflict, failure-and-recovery, leadership-without-authority, ambiguity. Rotate across questions.

Avoid: "we did X" (use "I"), vague results, blaming.

Take-homes

Common for AI roles. Format: "build a small RAG system / fine-tune / serving setup / eval pipeline."

Tips: - Time-box. Don't spend 40 hours on a "4 hour" take-home. - Include a README with: what you built, what you'd do with more time, limitations. - Include evals. They want to see eval discipline. - If they ask for unpaid work over 4 hours with no review-time discussion: that's a red flag.

Negotiation

Even on entry roles, negotiate. Companies expect it. The first offer usually has room.

  • Know your market band.
  • Get the offer in writing first.
  • Negotiate by email when possible. Time to think > on-the-spot.
  • Negotiate base, signing bonus, equity, start date, level. In that order of likely flexibility.
  • "Is there any flexibility on the base?" is the magic phrase.

Salary negotiation deserves its own book. Fearless Salary Negotiation by Josh Doody is the standard recommendation.

What you might wonder

"What if I get an LLM-coding question?" Some companies now allow / require using ChatGPT or Claude during the interview, watching how you collaborate with it. Practice this - using AI well in interviews is a separate skill. Don't pretend you "would never use AI"; that reads as dishonest.

"What if I freeze?" Common. Pause. Breathe. Say "let me think for a moment." Silence is OK. Talking-while-confused is worse.

"What if I bomb?" Most candidates bomb several interviews before landing. The data: average AI engineer offer takes 30-60 interviews across 20-30 companies. Failing isn't failure; it's the path.

"Should I take the first offer?" Usually no. Even if you accept, try to negotiate. Companies that retract over a polite negotiation aren't companies you want.

Done

  • Know the loop shape.
  • Know the top 20.
  • Have a prep plan that's not LeetCode-burnout.

Next: First 90 days on the job →

14 - First 90 days on the job

What this session is

What the first 90 days of your first AI engineering job actually look like. What to prioritize, what to avoid, and the moves that compound.

The mistake to avoid first

The number-one failure mode for new AI engineers: trying to ship impressive AI work in the first month. You don't yet know the codebase, the data, the team's conventions, the eval setup, or what "good" looks like here. Impressive work without context produces things that get reverted.

Instead, the first 30 days are about understanding the system you've joined.

The 30-60-90 frame

Days 1-30: Learn

Goal: become useful at small things, learn the system.

Daily: - Pair / shadow as much as possible. - Read existing code in the area you'll work on. - Read past PRs in that area. Understand the team's review style. - Read internal docs. Take notes. Ask questions.

Weekly: - Ship one small PR. Tiny scope. The point is to learn the dev loop, not impress. - Run the eval pipeline end-to-end yourself, even if it's not your task. - Attend every team meeting. Listen more than you speak.

Don't: - Propose architecture changes. - Argue for tools the team isn't using. - Be the loudest person in design reviews.

By day 30 you should: - Have shipped 2-5 small PRs. - Know the names and roles of everyone you work with. - Be able to run the team's training, eval, and inference pipelines. - Have a map of the codebase, even if rough.

Days 31-60: Contribute

Goal: own one meaningful piece of work end-to-end.

Weekly: - Own one ticket of meaningful size. Drive it to completion. Ask for review early. - Continue shipping incidental small PRs. - Start contributing in design conversations - opinions backed by data or experience from the first 30 days.

By day 60 you should: - Have shipped one feature or improvement that the team cares about. - Be the go-to person for at least one narrow area. - Have improved one piece of dev experience or docs for future hires.

Days 61-90: Lead a small thing

Goal: prove judgment, not just execution.

Weekly: - Take on a piece of work where the design is not fully specified. Propose the design. Get feedback. Implement. - Mentor anyone newer than you, even by one week. - Identify one process / tooling gap. Either fix it or write a doc proposing how.

By day 90 you should: - Have driven a piece of work with technical judgment, not just execution. - Have a real opinion about the team's tech direction, expressed constructively. - Have built trust with your manager and at least 2-3 teammates.

What "good" actually looks like at each level

For a junior AI engineer (entry / L3)

  • Ship reliably.
  • Ask good questions, often.
  • Take feedback well.
  • Don't break production.
  • Know your specific area well.

That's it. Don't try to be a senior in your first job. Be a great junior.

For a mid-level (L4)

  • Lead small projects end-to-end.
  • Mentor juniors.
  • Improve team-level tooling.
  • Recognize and surface risks.

For senior (L5+)

  • Drive cross-team initiatives.
  • Set technical direction in your area.
  • Force-multiply via mentorship and review.
  • Manage ambiguity.

If you came in as senior at your first AI role: you may need to recalibrate. Senior backend engineer ≠ senior AI engineer. Spend 30-60 days as an apprentice in the AI specifics, even if your title says otherwise.

The eval discipline shift

The single biggest cultural shock for engineers from web/backend backgrounds: AI work is measured differently.

In web: tests pass = ship.

In AI: tests pass and evals improve and nothing regresses on key metrics = ship. Some changes that "work" make eval worse. Some that "feel wrong" improve eval. Trust eval over intuition.

If your team doesn't have eval discipline, this is the highest-leverage thing you can fix.

The data discipline shift

Second biggest shift: you'll spend a lot more time on data than you expected.

Cleaning. Curating. Labeling. Understanding distributions. Spotting shifts. Building pipelines that handle bad data gracefully.

"AI engineering" sounds like model work. Most of the actual hours are data work.

This is true for every specialization. Make peace with it.

How to learn the codebase fast

The fastest way is not to read everything. It's:

  1. Find the entry points. Where does a request, training run, or pipeline start? main functions, API routes, scheduler entries.
  2. Trace one path end-to-end. Pick one feature. Follow it through every file it touches. Take notes.
  3. Run the debugger on the path. Set breakpoints. Watch state flow.
  4. Read the tests. They document expected behavior more clearly than docs.
  5. Ask "why" before "how." Architecture decisions usually have history.

Do this for 3-4 different paths in your first month. You'll know the codebase better than people who've been there 6 months but never traced.

Managing your manager

Your manager wants:

  • Predictable execution. Tell them when you'll deliver. Hit it. If you won't, tell them early.
  • No surprises. Bad news early > bad news late.
  • Visibility. Don't make them wonder what you're doing.

Practical: - Weekly written update. Short. What you shipped, what you're working on, what's blocked. - 1:1s are yours. Bring an agenda. Don't waste them with status. - Ask for feedback explicitly. Most managers under-deliver feedback unprompted.

The compounding moves

Small things that pay off all year:

  • Build relationships with adjacent teams. Data, infra, product. You'll need favors.
  • Write things down. Decisions, design discussions, "I tried X and here's what happened." Your team will thank you.
  • Be the person who improves docs. Quiet, high-leverage.
  • Reply quickly in Slack/email. Underrated.
  • Show up consistently. Even on hard days.

What you might wonder

"What if my team doesn't do AI well?" Common. Many companies are scrambling to "do AI" without strong fundamentals. You can: (a) learn what not to do; (b) be the person who introduces discipline; (c) leave after 12-18 months. Often all three.

"What if imposter syndrome is bad?" Universal. You'll feel underqualified for at least the first 6 months. So does everyone. The cure is doing the work, not therapy-ing the feeling.

"What if I picked the wrong specialization?" You can pivot from inside a job. Common pivots: applied LLM → serving, applied LLM → MLOps, MLOps → fine-tuning. Stay 12+ months before the pivot to avoid resume-hopper signals.

"When should I start looking for the next job?" Stay for at least 18-24 months in your first AI role. The compounding learning is enormous. After that, fair game.

Done

  • Have a 30-60-90 frame.
  • Know the cultural shifts (eval, data discipline).
  • Know the compounding moves.

Next: The next 12 months →

15 - Done - the next 12 months

What this session is

You finished the roadmap. You shipped, applied, interviewed, hopefully landed something. This page is what comes next. Year 2.

What "done" actually means

Done with this roadmap = you have:

  • A working understanding of AI engineering as a field.
  • A specialization picked.
  • A 3-project portfolio.
  • An OSS contribution profile.
  • Either a job, or interviews in progress.

Done does not mean: you understand AI. The field is too big and moving too fast for that. Done means: you can keep going under your own power.

The year-2 shift

Year 1 was about getting in. Year 2 is about getting good.

The difference:

  • Year 1: breadth. Learn a lot of things shallowly.
  • Year 2: depth. Pick 2-3 things and learn them well.
  • Year 1: imitation. Build things others have built.
  • Year 2: judgment. Build things you decide are worth building.
  • Year 1: get hired.
  • Year 2: become indispensable.

What to focus on in year 2

1. Get deeper in your specialization

Pick the 2-3 deepest sub-topics in your area. Spend a year on each.

For applied LLM: evals, agents, retrieval optimization. For serving: kernel-level optimization, multi-model serving, batching strategies. For fine-tuning: data curation, preference algorithms, evaluation design. For MLOps: experiment tracking at scale, model registry, feature stores. For safety: specific risk categories, interpretability.

2. Build one ambitious thing

Year 1's projects were portfolio pieces - small, demonstrative. Year 2 can sustain something bigger:

  • An open-source library that fills a gap you noticed at work.
  • A real product (could be a side project that becomes income).
  • A research-grade exploration of a problem nobody's solved.

Six-month projects, not six-week ones.

3. Develop teaching capacity

Teaching is the fastest way to deepen. Pick a format:

  • Mentor 1-2 newer engineers.
  • Write a sustained blog series.
  • Speak at a meetup or conference.
  • Run a workshop at your company.

If you can teach a thing clearly, you understand it.

4. Build your network deliberately

In year 1, your network was incidental. In year 2, build it on purpose:

  • Maintain relationships with people from your interview loops, even where you got rejected.
  • Stay in touch with the maintainers of projects you contributed to.
  • Find 3-5 peers at your level in your specialization. Trade notes monthly.
  • Pick 1-2 senior people to learn from. Buy their coffee. Ask specific questions.

Network compounds. People who didn't matter in year 1 become job referrals in year 3.

5. Pick your second specialization

By end of year 2, start exploring an adjacent specialization. Examples:

  • Applied LLM → serving.
  • Serving → kernel-level / hardware.
  • Fine-tuning → research engineer.
  • MLOps → infra at scale.

This is the "T-shape" - deep in one, learning a second. Year-2 dabbling, year-3 contribution.

What to expect: the rough patches

Year 2 has its own slumps.

  • The 6-month plateau. You've learned the easy parts of your job. The hard parts are slow. You'll feel stuck. You aren't - depth always feels slower than breadth.
  • The first major mistake. Production incident, bad PR review, missed deadline. Recover by owning it cleanly.
  • The first time a peer outpaces you. Someone hired the same time as you gets promoted first. Don't make it about you. Their promotion isn't your story.
  • The compensation re-evaluation. A year in, you'll know what the market pays for you. If you're underpaid, address it. With data.

Year 3+ trajectory

Past year 2, paths diverge widely:

  • Stay technical (IC): mid → senior → staff → principal engineer. Several years per step.
  • Move into management: tech lead → engineering manager → director. Different skill set; not a promotion in disguise.
  • Move toward research: harder without a PhD; possible via published applied work + the right team.
  • Build something: founder, indie hacker, independent consultant.
  • Specialize hard: become the person for a specific narrow thing.

You don't have to decide year 2. Notice what you naturally lean toward.

What stays the same

Some things don't change between year 1, year 5, year 15:

  • The field moves fast. You'll always feel behind.
  • The fundamentals always pay off. Linear algebra, probability, systems thinking. The hot frameworks change; the math doesn't.
  • The people who keep showing up keep winning. Consistency over brilliance.
  • Honest writing is rare and respected. Keep writing.

What this site can keep doing for you

The senior reference paths on this site stay relevant for years:

Each is a 24-week reference. Open them when you need depth on a specific topic.

A final note

The honest reality of this career: it's hard, it's slow at points, and most people who start don't finish. If you got through this roadmap, you're in a small minority.

Whatever you build from here - products, papers, infrastructure, teams - the discipline that got you to this page is the same discipline that takes you to wherever you're going. Keep showing up.

Where to from here

If you finished and are still in year 1: re-read Picking a specialization and Your portfolio. Pick one and execute.

If you finished and have a job: re-read First 90 days every quarter for the first year. The advice ages well.

If you finished and are between jobs: re-read Building in public and Open source as resume. The job market punishes silence.

You've got the map. Now go.