Skip to content

Week 3 - Collections, Comprehensions, Iterators, and Generators

3.1 Conceptual Core

  • The four collection workhorses: list (dynamic array), tuple (immutable record), dict (open-addressed hash table, insertion-ordered), set (hash table). Internalize O() costs: list append amortized O(1), insert(0, x) O(n); dict in O(1) avg; list in O(n).
  • An iterator is a stateful cursor; an iterable is anything you can call iter() on. for x in xs: desugars to it = iter(xs); while True: try: x = next(it) except StopIteration: break.
  • Generators are coroutines that yield values, not just iterators. They preserve local state across yield; they accept values via .send(); they handle exceptions via .throw(); they clean up via .close().

3.2 Mechanical Detail

  • collections.deque (O(1) both ends), collections.Counter, collections.defaultdict, collections.ChainMap, collections.OrderedDict (now mostly redundant; still useful for move_to_end), collections.namedtuple (legacy; prefer dataclass(slots=True, frozen=True) or typing.NamedTuple).
  • itertools mastery: chain, islice, takewhile, dropwhile, groupby (note: requires sorted input), tee, product, permutations, combinations, accumulate, pairwise (3.10+), batched (3.12+).
  • Comprehensions in all four flavors: list, set, dict, generator. Generator expression vs. list comprehension: when the consumer is sum, any, all, min, max, or anything that doesn't need a materialized list, prefer the genexp - same syntax minus the brackets.
  • yield from (PEP 380): delegate to a sub-generator, propagate sends/throws.
  • The buffer protocol primer: memoryview over bytes/bytearray/array.array/numpy.ndarray lets you slice without copying. Critical in the AI months.

3.3 Lab - "Streaming Word Count"

  1. Implement wc -w over arbitrarily large files using a generator pipeline: file → lines → words → counts. Constant memory regardless of file size.
  2. Add a --top K flag using heapq.nlargest. Note that you must materialize the counter - discuss why.
  3. Replace your hand-rolled tokenizer with re.finditer and benchmark. Then benchmark a str.split() version. Explain the difference.
  4. Add a --parallel N flag using concurrent.futures.ProcessPoolExecutor and itertools.batched. (We will revisit in Month 4.)

3.4 Idiomatic & Linter Drill

  • Enable ruff C4 (comprehensions) and PERF rules. Refactor for-with-append patterns into comprehensions. Identify cases where the comprehension is less readable and document them.

3.5 Production Hardening Slice

  • Add hypothesis as a dev dep. Write property tests for your word counter: invariants like "total = sum of counts," "tokens are non-empty," "shuffling input lines doesn't change the count."

Comments