A minimal, from-scratch PyTorch implementation of GPT-2 (124M parameters). Every
component, including causal self-attention, multi-head projections, the position-wise
MLP, pre-norm transformer blocks, learned positional embeddings, and the
weight-tied LM head, is implemented by hand, without nn.Transformer
or F.scaled_dot_product_attention.
Overview
Reimplements GPT-2's architecture directly: attention, MLP, transformer
blocks, and the full model, instead of assembling it from PyTorch's
built-in nn.Transformer or fused attention kernels. The implementation
loads OpenAI's original 124M-parameter GPT-2 weights from HuggingFace and reproduces
its outputs exactly, which serves as the correctness check for every hand-written
component.
Verification
Checked against HuggingFace's reference GPT-2 implementation: logits match within
1e-3 on random input, and greedy generation produces token-for-token
identical output. Each component also has a test for a structural property beyond
output shape (causal masking for attention, position-wise independence for
the MLP, gradient flow through a block, and weight-tying/loss sanity for the full
model).