GPT-2 From Scratch (124M)

View on GitHub

A minimal, from-scratch PyTorch implementation of GPT-2 (124M parameters). Every component, including causal self-attention, multi-head projections, the position-wise MLP, pre-norm transformer blocks, learned positional embeddings, and the weight-tied LM head, is implemented by hand, without nn.Transformer or F.scaled_dot_product_attention.

Tools/Skills: PyTorch, Python, HuggingFace Transformers

Overview

Reimplements GPT-2's architecture directly: attention, MLP, transformer blocks, and the full model, instead of assembling it from PyTorch's built-in nn.Transformer or fused attention kernels. The implementation loads OpenAI's original 124M-parameter GPT-2 weights from HuggingFace and reproduces its outputs exactly, which serves as the correctness check for every hand-written component.

Implementation

CausalSelfAttention. Fused Q/K/V projection, multi-head reshape, causal masking, scaled dot-product attention, and the output projection.
MLP. The position-wise feedforward block, using GELU with the tanh approximation to match GPT-2's original formulation.
Block. One attention sublayer and one MLP sublayer, each with pre-norm LayerNorm and a residual connection.
GPT. Token and learned positional embeddings, a stack of transformer blocks, a final LayerNorm, and a weight-tied LM head.
from_pretrained. Loads real GPT-2 weights from HuggingFace, handling the Conv1D-to-Linear weight transpose the original checkpoint format requires.
generate. Autoregressive next-token generation with temperature and top-k sampling.

Verification

Checked against HuggingFace's reference GPT-2 implementation: logits match within 1e-3 on random input, and greedy generation produces token-for-token identical output. Each component also has a test for a structural property beyond output shape (causal masking for attention, position-wise independence for the MLP, gradient flow through a block, and weight-tying/loss sanity for the full model).