Architecture
Transformer Architecture
The self-attention mechanism that powers every major LLM — from GPT-4 to Llama 3 to Claude.
All modern LLMs are built on the Transformer architecture (Vaswani et al., 2017). The key innovation is self-attention: every token can attend to every other token in the context window simultaneously, capturing long-range dependencies that RNNs struggled with.
Core Components
- Token embedding: Token IDs → dense vectors in high-dimensional space. Similar meanings cluster together. Dimensionality: 4096–8192 for large models.
- Positional encoding: Since attention is order-agnostic, position information is injected. Modern models use RoPE (Rotary Position Embedding) for length extrapolation.
- Multi-head attention: Each head learns different relationship types (syntax, coreference, semantics). Heads run in parallel and concatenate their outputs.
- Feed-forward network: After attention, each position passes through a 2-layer MLP independently. This is where most "knowledge" is stored in the weights.
Attention in One Equation
Scaled dot-product attention
\[
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
\]
- Q = Query matrix (what am I looking for?)
- K = Key matrix (what do I contain?)
- V = Value matrix (what do I output?)
- d_k = Key dimension (scaling prevents vanishing gradients)