Month 2-Week 3: Tabular ML, gradient boosting, and feature engineering¶
Week summary¶
- Goal: Continue course. Train XGBoost on a tabular dataset and beat a simple neural network. Learn one round of feature engineering and quantify its effect.
- Time: ~9 h over 3 sessions.
- Output:
tabular/directory inclassical-mlwith XGBoost vs MLP comparison and feature-engineering experiment. - Sequences relied on: 06-classical-ml rungs 07, 09, 10; 04-python-for-ml rung 04.
Why this week matters¶
It's tempting to use neural nets for everything. The truth: on tabular data, gradient-boosted trees (XGBoost, LightGBM, CatBoost) are still SOTA in 2026. Knowing this-and being able to demonstrate it on real data-is a maturity marker that separates "ML hipsters" from real practitioners. It's also useful in interviews: "Why didn't you use a neural net for this?" "Because the data is tabular and trees beat nets here. Here's the comparison."
Feature engineering is also a skill that quietly compounds. When fine-tuning datasets are curated badly, models suffer. The same instincts apply: clean data, encode well, derive useful features.
Prerequisites¶
- M02-W01 + W02 complete.
- Pandas basics (Series, DataFrame, groupby).
Recommended cadence¶
- Session A-Tue/Wed evening (~3 h): course + Pandas EDA
- Session B-Sat morning (~3.5 h): XGBoost vs MLP head-to-head
- Session C-Sun afternoon (~2.5 h): feature engineering experiment
Session A-Course week 4 + dataset selection + EDA¶
Goal: Continue course. Pick a tabular dataset. Do thorough exploratory data analysis (EDA).
Part 1-Course material (75 min)¶
fast.ai Lesson 4-"NLP and tabular": - Watch. - Run the tabular notebook.
Ng path: Course 2 weeks 1–2 (neural networks, decision trees).
Part 2-Pick a dataset (15 min)¶
Choose one based on your interest: 1. Titanic (Kaggle)-classic, well-understood, small. 2. House Prices (Kaggle)-regression, mixed types. 3. Adult Income / Census (UCI)-classification, social data. 4. A dataset from your work-e.g., synthetic deploys with success/failure labels.
Recommendation: pick the one where the label has business meaning to you.
Part 3-EDA in Pandas (90 min)¶
The first hour with any dataset. Do all of these:
import pandas as pd
df = pd.read_csv('data.csv')
# Overview
df.shape; df.dtypes; df.head()
df.describe() # numerical summary
df.isna().sum() # missing per column
# Distributions (one column at a time)
df['target'].value_counts() # class balance
df['feature'].hist()
df.corr(numeric_only=True) # numerical correlations
# Group-bys
df.groupby('target')['feature'].mean()
Document findings in a EDA.md:
- What does the target distribution look like? (Imbalanced?)
- Which features have strong correlation with target?
- Which features have many missing values?
- Any obvious data quality issues?
This is the kind of work senior engineers do before reaching for a model.
Output of Session A¶
- Course week 4 notebook in repo.
tabular/EDA.mdwith findings.- Cleaned dataset saved.
Session B-XGBoost vs MLP, head-to-head¶
Goal: Train an XGBoost model and an MLP on the same dataset. Compare with proper CV.
Part 1-XGBoost first (75 min)¶
import xgboost as xgb
from sklearn.model_selection import KFold, cross_val_score
from sklearn.preprocessing import LabelEncoder
# Encode categoricals (XGBoost handles numerics well)
for col in df.select_dtypes(include='object').columns:
df[col] = LabelEncoder().fit_transform(df[col].astype(str))
X = df.drop('target', axis=1).values
y = df['target'].values
model = xgb.XGBClassifier(n_estimators=200, max_depth=6, learning_rate=0.1,
random_state=42, eval_metric='logloss')
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kf, scoring='accuracy')
print(f"XGBoost: {scores.mean():.4f} ± {scores.std():.4f}")
Part 2-Simple MLP on the same data (75 min)¶
import torch
import torch.nn as nn
class TabMLP(nn.Module):
def __init__(self, in_dim, hidden=64, out_dim=2):
super().__init__()
self.net = nn.Sequential(
nn.Linear(in_dim, hidden), nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(hidden, hidden), nn.ReLU(),
nn.Linear(hidden, out_dim),
)
def forward(self, x): return self.net(x)
# Same 5-fold CV; train each fold for 20 epochs, report accuracy
# (paste in fold loop with seed control from M02-W02)
Part 3-Compare and analyze (30 min)¶
Almost certainly XGBoost wins on tabular data. Quantify the gap: | Model | CV Mean | CV Std | |---|---|---| | XGBoost | 0.85 | 0.012 | | MLP | 0.81 | 0.025 |
Reflect: why does XGBoost beat MLPs here? Likely: - Trees handle categorical and missing data naturally. - Trees capture sharp non-linearities easily. - Few rows → MLPs underfit; trees don't need much data.
Output of Session B¶
tabular/comparison.ipynbwith both models and 5-fold results.
Session C-Feature engineering experiment¶
Goal: Add 1–3 engineered features and quantify the effect on XGBoost performance.
Part 1-Pick features to engineer (45 min)¶
Feature engineering ideas by dataset type:
- Titanic:
- Family size =
SibSp + Parch. - Title from name (Mr/Mrs/Master).
- Cabin letter as feature.
- House Prices:
- Total square footage = sum of all SF columns.
- Age = YearSold − YearBuilt.
- Has-pool =
PoolArea > 0. - Income/Census:
capital-gain - capital-loss.- Hours-per-week binned.
Pick 1–3 you can defend with intuition.
Part 2-Implement, re-run, compare (75 min)¶
Create a featurize.py with a clear function:
def add_features(df: pd.DataFrame) -> pd.DataFrame:
df = df.copy()
df['family_size'] = df['SibSp'] + df['Parch']
df['is_alone'] = (df['family_size'] == 0).astype(int)
return df
Re-run the XGBoost CV. Compare: | Variant | CV Mean | CV Std | Δ vs baseline | |---|---|---|---| | Baseline | 0.85 | 0.012 |-| | + family features | 0.87 | 0.010 | +0.02 |
Part 3-Reflect + push (30 min)¶
Write 1 paragraph in your notebook: "When would I reach for trees vs nets?"
Likely answer:
- Trees: tabular data, mixed types, small-to-medium scale, when interpretability matters (feature_importances_).
- Nets: unstructured data (text, image, audio), large scale, when you can pre-train.
Push everything to the repo. Update README.
Output of Session C¶
featurize.py+ before/after comparison.- Reflection paragraph.
End-of-week artifact¶
-
tabular/EDA.mdwith findings - XGBoost vs MLP comparison with 5-fold CV
- Feature-engineering experiment with quantified delta
- Reflection on tree vs net selection
End-of-week self-assessment¶
- I can defend "use XGBoost on this" without sounding defensive.
- I can write a basic feature-engineering function.
- I can compute 5-fold cross-validation manually if needed.
Common failure modes for this week¶
- Skipping EDA "to get to modeling." EDA is the modeling. The model is just the tail.
- Using a neural net because it's "AI"-when XGBoost would win. Tool-shopping is anti-engineering.
- Engineering 20 features, finding none help, then quitting. Iteration is feature engineering.
What's next (preview of M02-W04)¶
Course wrap + analysis post on your ablation study + transformer preview. You'll publish your second blog post and watch Karpathy's first transformer-prep lecture.