Module 3Transformer Deep Dive

Multi-Head Attention

Split representation space into multiple attention subspaces.

Why this module matters

Multi-head attention is where representational diversity enters the transformer block.

Prerequisites

  • Single-head attention

Learning objectives

  • Reshape tensors for heads safely
  • Understand head dimension vs model dimension
  • Inspect parameter and compute costs

Core concepts

Projection matrices
Head concatenation
Per-head specialization

Hands-on practice

  • Implement a minimal multi-head attention module

Expected output

A working multi-head attention class with shape tracing.

Study checklist

  • Reshape tensors for heads safely
  • Understand head dimension vs model dimension
  • Inspect parameter and compute costs

Common mistakes

  • ⚠️ Incorrect transpose order
  • ⚠️ Mismatched head dimensions
  • ⚠️ Using view instead of reshape carelessly

Module rhythm

  • 1. Read the summary and why-it-matters section first.
  • 2. Work through concepts before rushing into practice.
  • 3. Use the checklist to verify real understanding, not just completion.

How to continue

The model still lacks order, so next comes positional information.

Back to course overview →

How to use this page well

Treat each module as a compact learning system: understand the intuition, verify the concepts, do one hands-on task, then use the checklist and mistakes section to pressure-test your understanding.