Steering GPT-2's Emotions with Sparse Autoencoders
Finding emotion-related features in GPT-2 using OpenAI’s pretrained SAE, then training one from scratch. Feature patching turns ‘good person’ into ‘shit’.
Finding emotion-related features in GPT-2 using OpenAI’s pretrained SAE, then training one from scratch. Feature patching turns ‘good person’ into ‘shit’.