08 - Hugging Face Transformers¶

What this session is¶

About 30 minutes. Hugging Face is the GitHub of AI models. The transformers library makes using thousands of pre-trained models a 3-line operation.

The library¶

pip install transformers

(You did this in page 01.) The library provides three main classes you'll use:

AutoTokenizer - load any model's tokenizer.
AutoModel / AutoModelForCausalLM / AutoModelForSequenceClassification / etc. - load a model. The AutoModelFor... variants add task-specific heads.
pipeline - a high-level helper that combines tokenization + model + post-processing into one call.

The simplest possible usage: `pipeline`¶

from transformers import pipeline

# Text classification
clf = pipeline("sentiment-analysis")
print(clf("I love this!"))
# [{'label': 'POSITIVE', 'score': 0.9998}]
print(clf("This is terrible."))
# [{'label': 'NEGATIVE', 'score': 0.9991}]

# Text generation
gen = pipeline("text-generation", model="gpt2")
print(gen("The future of AI is", max_new_tokens=20))

# Translation
trans = pipeline("translation_en_to_de", model="Helsinki-NLP/opus-mt-en-de")
print(trans("Hello, how are you?"))

# Question answering
qa = pipeline("question-answering")
print(qa(question="Where do I live?", context="My name is Alice and I live in Lagos."))
# {'answer': 'Lagos', ...}

Each pipeline picks a default model, downloads it (first time), runs it. Useful for prototyping.

Browsing the Hub¶

huggingface.co hosts hundreds of thousands of models. Filter by task, language, license. Common model names you'll see:

gpt2, gpt2-medium - small classical LLMs. Good for learning.
microsoft/phi-3-mini-4k-instruct - small + capable + permissive license.
meta-llama/Llama-3.2-1B, Llama-3.2-3B, Llama-3.1-8B - Meta's open weights (gated; accept license).
mistralai/Mistral-7B-v0.3 - open-source Mistral.
google/gemma-2-2b - small Gemma.
sentence-transformers/all-MiniLM-L6-v2 - tiny embedding model. Page 10.
distilbert-base-uncased - small BERT-family for classification, embedding.

Each model page on the Hub has a README with usage, license, evaluation, intended use.

Direct usage (not via pipeline)¶

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "microsoft/phi-3-mini-4k-instruct"
tok = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

prompt = "Write a haiku about garbage collection:"
inputs = tok(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=50, do_sample=True, temperature=0.7)
print(tok.decode(output[0], skip_special_tokens=True))

Key arguments: - torch_dtype=torch.bfloat16 - load weights in bfloat16 instead of float32. Halves memory; minimal quality loss. - device_map="auto" - automatically distribute layers across available devices (GPU + CPU fallback). - return_tensors="pt" - tokenizer returns PyTorch tensors. - skip_special_tokens=True - strip <eos>, <bos>, etc. from output.

Chat templates¶

Modern chat-tuned models expect a specific message format. The tokenizer knows it:

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What's the capital of Nigeria?"},
]
inputs = tok.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)

with torch.no_grad():
    output = model.generate(inputs, max_new_tokens=50)
response_only = tok.decode(output[0][inputs.shape[-1]:], skip_special_tokens=True)
print(response_only)

apply_chat_template formats the messages with the model's expected special tokens (<|user|>, <|assistant|>, etc.). add_generation_prompt=True adds the assistant's turn-start so the model knows it's its turn to speak.

For chat-tuned models, always use the chat template. Raw prompt-completion produces worse results.

Embedding models¶

For semantic search (used in page 10):

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
texts = ["A dog is running.", "A cat is sleeping.", "I bought milk."]
embeddings = model.encode(texts)
print(embeddings.shape)            # (3, 384) - three 384-dim vectors

# Compute similarities
import numpy as np
sim = embeddings @ embeddings.T
print(sim)        # diagonal is 1.0 (each vector with itself);
                  # off-diagonal close to 0 for unrelated texts

sentence-transformers wraps Hugging Face models and handles the "pool tokens into a sentence vector" step.

Caching¶

By default, models download to ~/.cache/huggingface/. Big models (gigabytes) live here. To change:

export HF_HOME=/path/to/your/cache

To pre-download a model without using it (useful in Docker):

from huggingface_hub import snapshot_download
snapshot_download(repo_id="microsoft/phi-3-mini-4k-instruct")

Quantized models¶

LLMs are huge. Loading FP16 needs gigabytes; FP32 needs 2x. Quantization reduces precision further:

GPTQ / AWQ - 4-bit quantization, requires specific quantized weights.
bitsandbytes - runtime 8-bit / 4-bit quantization for any model:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B", quantization_config=bnb)

8B parameters at 4-bit ≈ 4GB. Fits on consumer GPUs.

Quality drops ~1-5% on benchmarks; for many tasks, indistinguishable. Used heavily in production inference (page 12).

Exercise¶

Run the simplest pipeline:

from transformers import pipeline
clf = pipeline("sentiment-analysis")
print(clf("Containers from scratch is a good path"))

Generate text with a small model:

gen = pipeline("text-generation", model="gpt2")
print(gen("Once upon a time", max_new_tokens=40)[0]["generated_text"])

Direct model usage with the chat-template form above. Use any chat-tuned model that fits your hardware. Try several prompts.
Embeddings: with sentence-transformers, encode 5 sentences (some related, some not). Compute the similarity matrix. Notice which pairs score high.
(Stretch - GPU helpful) Load Llama-3.2-1B (accept license on HF first; small enough for most setups). Compare its outputs to gpt2's.

What you might wonder¶

"How big a model can I run?" Rule of thumb (FP16): need ~2 bytes per parameter for inference. 1B params = 2GB. 7B = 14GB. 70B = 140GB. With 4-bit quantization, ~0.5 bytes per param. 70B at 4-bit ≈ 40GB.

"Why does the model download so slowly?" HF servers throttle anonymous traffic. Authenticate (huggingface-cli login) for higher limits, especially for gated models.

"What's device_map="auto" actually doing?" Hugging Face's accelerate library partitions the model across available devices (GPU layers; CPU offload for excess). For small models, the whole thing goes on GPU. For huge models, layers spill to CPU (much slower but possible).

"Should I use safetensors or pytorch_model.bin?" Safetensors. Faster loading, safer (no arbitrary code execution risk). All modern HF models ship both.

Done¶

Use pipeline for the quickest possible model usage.
Use AutoTokenizer + AutoModelForCausalLM for direct control.
Apply chat templates for chat-tuned models.
Use sentence-transformers for embeddings.
Load quantized models for memory efficiency.

Next: Fine-tuning →