Steering GPT-2's Emotions with Sparse Autoencoders
Finding emotion-related features in GPT-2 using OpenAI’s pretrained SAE, then training one from scratch. Feature patching turns ‘good person’ into ‘shit’.
Finding emotion-related features in GPT-2 using OpenAI’s pretrained SAE, then training one from scratch. Feature patching turns ‘good person’ into ‘shit’.
K-means is actually an extreme case of GMM, and GMM is the canonical application of the EM algorithm. How these three connect within a single framework, and how information geometry explains the relationship.
Just as Newton’s F=ma describes the physical world, information geometry describes how AI learns. An intuitive guide for beginners.