Steering GPT-2's Emotions with Sparse Autoencoders

Finding emotion-related features in GPT-2 using OpenAI’s pretrained SAE, then training one from scratch. Feature patching turns ‘good person’ into ‘shit’.

February 16, 2025 · rick