Month 2-Week 3: Tabular ML, gradient boosting, and feature engineering¶

Week summary¶

Goal: Continue course. Train XGBoost on a tabular dataset and beat a simple neural network. Learn one round of feature engineering and quantify its effect.
Time: ~9 h over 3 sessions.
Output: tabular/ directory in classical-ml with XGBoost vs MLP comparison and feature-engineering experiment.
Sequences relied on: 06-classical-ml rungs 07, 09, 10; 04-python-for-ml rung 04.

Why this week matters¶

It's tempting to use neural nets for everything. The truth: on tabular data, gradient-boosted trees (XGBoost, LightGBM, CatBoost) are still SOTA in 2026. Knowing this-and being able to demonstrate it on real data-is a maturity marker that separates "ML hipsters" from real practitioners. It's also useful in interviews: "Why didn't you use a neural net for this?" "Because the data is tabular and trees beat nets here. Here's the comparison."

Feature engineering is also a skill that quietly compounds. When fine-tuning datasets are curated badly, models suffer. The same instincts apply: clean data, encode well, derive useful features.

Prerequisites¶

M02-W01 + W02 complete.
Pandas basics (Series, DataFrame, groupby).

Recommended cadence¶

Session A-Tue/Wed evening (~3 h): course + Pandas EDA
Session B-Sat morning (~3.5 h): XGBoost vs MLP head-to-head
Session C-Sun afternoon (~2.5 h): feature engineering experiment

Session A-Course week 4 + dataset selection + EDA¶

Goal: Continue course. Pick a tabular dataset. Do thorough exploratory data analysis (EDA).

Part 1-Course material (75 min)¶

fast.ai Lesson 4-"NLP and tabular": - Watch. - Run the tabular notebook.

Ng path: Course 2 weeks 1–2 (neural networks, decision trees).

Part 2-Pick a dataset (15 min)¶

Choose one based on your interest: 1. Titanic (Kaggle)-classic, well-understood, small. 2. House Prices (Kaggle)-regression, mixed types. 3. Adult Income / Census (UCI)-classification, social data. 4. A dataset from your work-e.g., synthetic deploys with success/failure labels.

Recommendation: pick the one where the label has business meaning to you.

Part 3-EDA in Pandas (90 min)¶

The first hour with any dataset. Do all of these:

import pandas as pd
df = pd.read_csv('data.csv')

# Overview
df.shape; df.dtypes; df.head()
df.describe()                  # numerical summary
df.isna().sum()                # missing per column

# Distributions (one column at a time)
df['target'].value_counts()    # class balance
df['feature'].hist()
df.corr(numeric_only=True)     # numerical correlations

# Group-bys
df.groupby('target')['feature'].mean()

Document findings in a EDA.md: - What does the target distribution look like? (Imbalanced?) - Which features have strong correlation with target? - Which features have many missing values? - Any obvious data quality issues?

This is the kind of work senior engineers do before reaching for a model.

Output of Session A¶

Course week 4 notebook in repo.
tabular/EDA.md with findings.
Cleaned dataset saved.

Session B-XGBoost vs MLP, head-to-head¶

Goal: Train an XGBoost model and an MLP on the same dataset. Compare with proper CV.

Part 1-XGBoost first (75 min)¶

import xgboost as xgb
from sklearn.model_selection import KFold, cross_val_score
from sklearn.preprocessing import LabelEncoder

# Encode categoricals (XGBoost handles numerics well)
for col in df.select_dtypes(include='object').columns:
    df[col] = LabelEncoder().fit_transform(df[col].astype(str))

X = df.drop('target', axis=1).values
y = df['target'].values

model = xgb.XGBClassifier(n_estimators=200, max_depth=6, learning_rate=0.1,
                          random_state=42, eval_metric='logloss')

kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kf, scoring='accuracy')
print(f"XGBoost: {scores.mean():.4f} ± {scores.std():.4f}")

Part 2-Simple MLP on the same data (75 min)¶

import torch
import torch.nn as nn

class TabMLP(nn.Module):
    def __init__(self, in_dim, hidden=64, out_dim=2):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(in_dim, hidden), nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(hidden, hidden), nn.ReLU(),
            nn.Linear(hidden, out_dim),
        )
    def forward(self, x): return self.net(x)

# Same 5-fold CV; train each fold for 20 epochs, report accuracy
# (paste in fold loop with seed control from M02-W02)

Part 3-Compare and analyze (30 min)¶

Almost certainly XGBoost wins on tabular data. Quantify the gap: | Model | CV Mean | CV Std | |---|---|---| | XGBoost | 0.85 | 0.012 | | MLP | 0.81 | 0.025 |

Reflect: why does XGBoost beat MLPs here? Likely: - Trees handle categorical and missing data naturally. - Trees capture sharp non-linearities easily. - Few rows → MLPs underfit; trees don't need much data.

Output of Session B¶

tabular/comparison.ipynb with both models and 5-fold results.

Session C-Feature engineering experiment¶

Goal: Add 1–3 engineered features and quantify the effect on XGBoost performance.

Part 1-Pick features to engineer (45 min)¶

Feature engineering ideas by dataset type:

Titanic:
Family size = SibSp + Parch.
Title from name (Mr/Mrs/Master).
Cabin letter as feature.
House Prices:
Total square footage = sum of all SF columns.
Age = YearSold − YearBuilt.
Has-pool = PoolArea > 0.
Income/Census:
capital-gain - capital-loss.
Hours-per-week binned.

Pick 1–3 you can defend with intuition.

Part 2-Implement, re-run, compare (75 min)¶

Create a featurize.py with a clear function:

def add_features(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df['family_size'] = df['SibSp'] + df['Parch']
    df['is_alone'] = (df['family_size'] == 0).astype(int)
    return df

Re-run the XGBoost CV. Compare: | Variant | CV Mean | CV Std | Δ vs baseline | |---|---|---|---| | Baseline | 0.85 | 0.012 |-| | + family features | 0.87 | 0.010 | +0.02 |

Part 3-Reflect + push (30 min)¶

Write 1 paragraph in your notebook: "When would I reach for trees vs nets?"

Likely answer: - Trees: tabular data, mixed types, small-to-medium scale, when interpretability matters (feature_importances_). - Nets: unstructured data (text, image, audio), large scale, when you can pre-train.

Push everything to the repo. Update README.

Output of Session C¶

featurize.py + before/after comparison.
Reflection paragraph.

End-of-week artifact¶

tabular/EDA.md with findings
XGBoost vs MLP comparison with 5-fold CV
Feature-engineering experiment with quantified delta
Reflection on tree vs net selection

End-of-week self-assessment¶

I can defend "use XGBoost on this" without sounding defensive.
I can write a basic feature-engineering function.
I can compute 5-fold cross-validation manually if needed.

Common failure modes for this week¶

Skipping EDA "to get to modeling." EDA is the modeling. The model is just the tail.
Using a neural net because it's "AI"-when XGBoost would win. Tool-shopping is anti-engineering.
Engineering 20 features, finding none help, then quitting. Iteration is feature engineering.

What's next (preview of M02-W04)¶

Course wrap + analysis post on your ablation study + transformer preview. You'll publish your second blog post and watch Karpathy's first transformer-prep lecture.