13 - Picking a Project¶

What this session is¶

About 30 minutes plus browsing. AI OSS that accepts first contributions, with specific candidates.

What kinds of AI projects fit beginners¶

Your toolkit so far: PyTorch, Hugging Face, RAG, evaluation, serving. Good targets:

Inference engines & serving (vLLM, llama.cpp, Ollama) - high-quality issue tickets across many difficulty levels.
Tokenization / data tooling (HuggingFace tokenizers, datasets).
Embedding / RAG libraries (sentence-transformers, llama-index, langchain).
Evaluation tools (lm-eval-harness, ragas, promptfoo).
Adjacent ML tools (numpy, scipy, scikit-learn).
Documentation - every major project has doc-improvement work.

For more research-y work (PyTorch core, training algorithms, model architectures), you'll need deeper expertise. Build on this path first.

10-minute evaluation¶

Same standard as other beginner paths:

Signal	Target
Stars	100-50000
Last commit	Within a month
Open PRs	Some, not 300+
Recent PR merge time	Under 14 days
`good first issue` count	≥5
CONTRIBUTING.md	yes, readable
Tests pass on fresh clone	yes

Candidates¶

Tier 1 - friendly, smaller scope¶

huggingface/transformers - yes the big one, BUT they have excellent issue triage. Many docs/examples PRs. Look at good first issue.
langchain-ai/langchain - chained LLM workflows. Large but very welcoming; tons of easy-mode integrations to add.
run-llama/llama_index - RAG-focused alternative to LangChain.
promptfoo/promptfoo - eval tool. Small enough to be approachable; very active.
huggingface/tokenizers - tokenizer library. Rust core + Python bindings.

Tier 2 - well-organized¶

vllm-project/vllm - production inference serving. Issues exist at all levels.
huggingface/peft - LoRA + friends. Smaller surface; active.
huggingface/datasets - data loading. Adding a new dataset adapter is a common first contribution.
sentence-transformers/sentence-transformers - embeddings library.
unslothai/unsloth - fast fine-tuning. Welcoming.

Tier 3 - bigger, more visible¶

After 1-2 PRs.

pytorch/pytorch - yes eventually. Excellent labels; SIG structure; well-shepherded contributors.
huggingface/transformers larger contributions (new model architectures, etc.).
triton-lang/triton - GPU programming DSL. Needs Triton + CUDA understanding.

Tier 4 - don't start here¶

Foundation model labs (OpenAI, Anthropic, DeepMind) - closed source.
PyTorch internals (autograd, distributed) - deep specialty.

Finding issues¶

Project's Issues tab. Filter by good first issue / documentation / help wanted.

Many AI projects label specific kinds of work: - enhancement: docs - models: add - integration: add - bug: confirmed

Read 5-10 issues; find one with clear repro and contained fix. Comment to claim; wait for maintainer.

What counts¶

For AI OSS work:

Fix a typo in a model's documentation.
Add a missing example in a tutorial notebook.
Fix a quantization bug for a specific GPU.
Add a new embedding model adapter.
Improve an evaluation metric's implementation.
Add a missing integration to a chain framework.
Add support for a new fine-tuning recipe.
Translate documentation.

All real. All count.

Specific recommendation: `huggingface/transformers` docs¶

For an easy first PR: open transformers/docs/source/en/ in a clone, find a doc page that's missing an example or has a confusing sentence. Submit a fix. The HF team is responsive and welcoming; you'll hear back within days. PR merges fast.

Exercise¶

Browse three projects from Tier 1-2.
Run the 10-minute eval on each.
Pick the most responsive.
Read CONTRIBUTING.md.

Clone, install, run their tests:

git clone https://github.com/<owner>/<repo>
cd <repo>
pip install -e .[dev]              # or pip install -r dev-requirements.txt
pytest               # or whatever they use

Browse open issues. Pick two candidates. Don't claim yet.

What you might wonder¶

"I want to contribute to PyTorch core. Can I?" Yes, eventually. Start with their docs/ work or with the smaller modules first. The bar is real but lower than the kernel project.

"I'm scared of touching ML papers' reference implementations." Don't be. Start with documentation. Reference implementations of papers (DeepMind's work, etc.) often have terse README, scattered hyperparameters, missing examples. Doc PRs are welcomed.

"What about Anthropic / OpenAI / Google work?" Their research models are mostly closed-source. Their tools (Anthropic's claudette library, OpenAI's cookbook) are public and accept PRs.

Done¶

Recognize AI-OSS contribution shapes.
Run a 10-minute evaluation.
Have specific projects in mind.

Next: Anatomy of an AI OSS project →