Steering GPT-2's Emotions with Sparse Autoencoders

Motivation

When an LLM generates a sentence, what “concepts” are being activated internally? And if we artificially modify those concepts, how does the output change?

To answer this, I ran experiments with the Persona Lab team at Modulabs using Sparse Autoencoders (SAE). This post covers two things:

Using OpenAI’s pretrained SAE to find and manipulate emotion-related features in GPT-2
Training an SAE from scratch

What is a Sparse Autoencoder?

A Transformer’s MLP layers have residual streams of a few hundred dimensions. The problem is that individual neurons don’t map cleanly to single concepts (polysemanticity). SAEs solve this.


graph LR
    A[Residual Stream
768-dim] -->|Encoder| B[Sparse Latent
131,072-dim]
    B -->|Decoder| C[Reconstructed
768-dim]

Key idea:

Encode 768-dim activations into 131,072 dimensions (170x expansion)
TopK activation ensures only a few features fire (sparsity)
Each feature learns to correspond to one interpretable concept

The loss function is straightforward:

$$ \mathcal{L}(\mathbf{x}) = \underbrace{\lVert\mathbf{x} - \hat{\mathbf{x}}\rVert_2^2}_{\text{Reconstruction}} + \alpha \underbrace{\lVert\mathbf{c}\rVert_1}_{\text{Sparsity}} $$

Part 1: Finding Features with Pretrained SAE

Full code available in this Google Colab notebook.

I used OpenAI’s publicly available SAE for GPT-2 Small (128k features).

Loading the Model and SAE

import torch
import transformer_lens
import sparse_autoencoder

model = transformer_lens.HookedTransformer.from_pretrained("gpt2", center_writing_weights=False)

layer_index = 8
location = "resid_post_mlp"
autoencoder = sparse_autoencoder.Autoencoder.from_state_dict(state_dict)

SAE architecture:

Autoencoder(
  (encoder): Linear(768 → 131,072)
  (activation): TopK + ReLU
  (decoder): Linear(131,072 → 768)
)

Extracting Feature Indices by Emotion

I fed sentences with various emotional tones and extracted the top 10 most activated features at the last token position.

def get_remarkable_features(prompt):
    tokens = model.to_tokens(prompt)
    with torch.no_grad():
        logits, activation_cache = model.run_with_cache(tokens, remove_batch_dim=True)
        input_tensor = activation_cache[transformer_lens_loc]
        latent_activations, _ = autoencoder.encode(input_tensor)
        values, indicies = torch.topk(latent_activations[-1], 10)
        return indicies.tolist()

Results are revealing:

Input	Top Feature Indices
“he is good guy”	97009, 67809, 4057, 28212, …
“he is sucks and fucking stupid idiot”	62556, 79394, 4057, 78339, …
“i hate him. he is ugly and stupid”	40814, 11982, 59378, 12947, …

Positive and negative sentences activate clearly different features. Features 62556 and 79394 in particular appear repeatedly in negative contexts.

Analyzing Feature Roles

Through experimentation, I mapped each feature’s effect:

Feature Index	Inferred Role
62556	Shifts “coward” → “fool” direction
79394	Targets negativity
86309	Removes uncertainty (“not sure” → “sure”)
69689	Intensifies focus on target

Activation Patching Experiment

The core experiment: reconstruct feature indices into a 768-dim vector via SAE decoder, then inject it into the model’s forward pass.

def get_feature(indicies):
    vector = np.zeros(131072)
    vector[indicies] = 1
    input_tensor = torch.tensor(vector, dtype=torch.float32)
    with torch.no_grad():
        return autoencoder.decoder(input=input_tensor)

positive_feature = get_feature([62556, 79394, 86309, 69689])

def activation_patching(layer, input, output):
    return output + (positive_feature * 20)

hook_handle = target_layer.register_forward_hook(activation_patching)

Effects started showing at 20x scale. Magnitude matters.

Results

Before patching (temperature=0.0):

prompt: he is such a
output: he is such a good person, he is such a good person, he is such a good person, ...

After patching (temperature=0.7, features [62556, 79394, 86309, 69689] x20):

prompt: he is such a
output: he is such a shit I will never be able to do it again
        did i not say i don't want to do it? i just said i don't want to do it
        it sucks to smile when so many people are just trying to think about

The emotional tone flipped completely from the same prompt. The model went from repeatedly generating “good person” to producing sentences filled with anger and frustration.

Varying feature combinations produced different results:

Feature Combination	Output
[62556, 79394]	“I’m not sure if he’s a good guy, but he’s a good guy.”
[62556, 79394, 86309]	“he is such a fool. I am a fool. I am a fool.”
[62556, 79394, 69689, 86309]	“he is such a shit I will never be able to do it again”

Adding features one by one strengthened negativity, with 69689 (focus intensifier) producing the most dramatic shift.

Part 2: Training SAE from Scratch

Full code available in this Google Colab notebook.

Using pretrained SAEs is convenient, but understanding requires building one yourself. I trained an SAE on the MLP output of DistilGPT2’s 5th block.

Model Architecture

class SparseAutoEncoder(nn.Module):
    def __init__(self, in_out_size):
        super().__init__()
        self.input_bias = nn.Parameter(torch.zeros(in_out_size))
        self.encoder = nn.Linear(in_out_size, in_out_size * 8, bias=True)
        self.decoder = nn.Linear(in_out_size * 8, in_out_size, bias=True)

    def forward_pass(self, x):
        x = x - self.decoder.bias
        encoded = F.relu(self.encoder(x))
        decoded = self.decoder(encoded)
        return decoded, encoded

768-dim → 6,144-dim (8x expansion). Much smaller than OpenAI’s 128k, but sufficient for validating the training process.

Monitoring Decoder Orthogonality

SAE decoder column vectors should be orthogonal so each feature represents an independent concept. I tracked this via the Gram matrix off-diagonal mean:

$$ G = W_{\text{norm}}^T W_{\text{norm}} $$

$$ \text{orthogonality} = \frac{1}{n^2 - n} \sum_{i \neq j} |G_{ij}| $$

def measure_decoder_orthogonality(self):
    W = self.decoder.weight.data
    col_norms = W.norm(dim=0, keepdim=True)
    normed_W = W / (col_norms + 1e-9)
    gram = torch.matmul(normed_W.t(), normed_W)
    diag_vals = torch.diag(gram)
    off_diag_vals = gram - torch.diag(diag_vals)
    return off_diag_vals.abs().mean().item()

Dead Neuron Resampling

A common issue in SAE training: dead neurons that never activate become untrainable. I reinitialized neurons with activation below a threshold:

def resample_dead_neurons(self, activation_stats, threshold=1e-5):
    with torch.no_grad():
        dead_indices = (activation_stats < threshold).nonzero().squeeze(-1)
        for idx in dead_indices:
            self.encoder.weight[idx].normal_()
            self.encoder.bias[idx].zero_()

Training Results

Trained for 1000 steps on a Korean commercial dataset (KoCommercial-Dataset):

Step   0 | Loss: 16.8401 | off_diag_mean: 0.0299
Step 100 | Loss: 12.8732 | off_diag_mean: 0.0300
Step 200 | Loss:  9.5500 | off_diag_mean: 0.0302
Step 500 | Loss:  8.2574 | off_diag_mean: 0.0306
Step 900 | Loss:  5.7219 | off_diag_mean: 0.0311

Loss decreased steadily from 16.84 to 5.72, while off_diag_mean remained stable around 0.03. Decoder orthogonality was preserved throughout training.

Takeaways


graph TD
    A[LLM Activation
768-dim] -->|SAE Encode| B[Sparse Features
131,072-dim]
    B -->|Feature Analysis| C{Identify Emotion
Features}
    C -->|Decode + Scale| D[Steering Vector
768-dim]
    D -->|Inject via Hook| E[Modified Output]

Key findings:

SAE-extracted features do correspond to interpretable concepts
Combining and scaling features can control model behavior
Magnitude matters - ~20x amplification was needed for visible effects
Feature combinations matter - multi-feature injection produces sharper steering than individual features

Limitations and future work:

Compare injecting 1.0 vs actual activation magnitudes into feature slots
Apply to larger models (Gemma-3-4B, etc.) using CAA (covered in a separate post)
Study the relationship between expansion factor (currently 8x) and performance

Further Resources

If SAE and Mechanistic Interpretability caught your interest, here are some communities and tools worth exploring.

Neuronpedia is an interactive platform for browsing SAE features. You can explore what text each feature activates on and what meaning it carries. The feature indices used in this post can be looked up there directly.

Open Source Mechanistic Interpretability is a Slack community discussing SAE, feature interpretation, activation patching, and other MI research. Active paper readings, code sharing, and experiment discussions.

Motivation#

What is a Sparse Autoencoder?#

Part 1: Finding Features with Pretrained SAE#

Loading the Model and SAE#

Extracting Feature Indices by Emotion#

Analyzing Feature Roles#

Activation Patching Experiment#

Results#

Part 2: Training SAE from Scratch#

Model Architecture#

Monitoring Decoder Orthogonality#

Dead Neuron Resampling#

Training Results#

Takeaways#

Further Resources#

References#