Motivation
When an LLM generates a sentence, what “concepts” are being activated internally? And if we artificially modify those concepts, how does the output change?
To answer this, I ran experiments with the Persona Lab team at Modulabs using Sparse Autoencoders (SAE). This post covers two things:
- Using OpenAI’s pretrained SAE to find and manipulate emotion-related features in GPT-2
- Training an SAE from scratch
What is a Sparse Autoencoder?
A Transformer’s MLP layers have residual streams of a few hundred dimensions. The problem is that individual neurons don’t map cleanly to single concepts (polysemanticity). SAEs solve this.
graph LR
A[Residual Stream
768-dim] -->|Encoder| B[Sparse Latent
131,072-dim]
B -->|Decoder| C[Reconstructed
768-dim]
Key idea:
- Encode 768-dim activations into 131,072 dimensions (170x expansion)
- TopK activation ensures only a few features fire (sparsity)
- Each feature learns to correspond to one interpretable concept
The loss function is straightforward:
$$ \mathcal{L}(\mathbf{x}) = \underbrace{\lVert\mathbf{x} - \hat{\mathbf{x}}\rVert_2^2}_{\text{Reconstruction}} + \alpha \underbrace{\lVert\mathbf{c}\rVert_1}_{\text{Sparsity}} $$
Part 1: Finding Features with Pretrained SAE
Full code available in this Google Colab notebook.
I used OpenAI’s publicly available SAE for GPT-2 Small (128k features).
Loading the Model and SAE
import torch
import transformer_lens
import sparse_autoencoder
model = transformer_lens.HookedTransformer.from_pretrained("gpt2", center_writing_weights=False)
layer_index = 8
location = "resid_post_mlp"
autoencoder = sparse_autoencoder.Autoencoder.from_state_dict(state_dict)
SAE architecture:
Autoencoder(
(encoder): Linear(768 → 131,072)
(activation): TopK + ReLU
(decoder): Linear(131,072 → 768)
)
Extracting Feature Indices by Emotion
I fed sentences with various emotional tones and extracted the top 10 most activated features at the last token position.
def get_remarkable_features(prompt):
tokens = model.to_tokens(prompt)
with torch.no_grad():
logits, activation_cache = model.run_with_cache(tokens, remove_batch_dim=True)
input_tensor = activation_cache[transformer_lens_loc]
latent_activations, _ = autoencoder.encode(input_tensor)
values, indicies = torch.topk(latent_activations[-1], 10)
return indicies.tolist()
Results are revealing:
| Input | Top Feature Indices |
|---|---|
| “he is good guy” | 97009, 67809, 4057, 28212, … |
| “he is sucks and fucking stupid idiot” | 62556, 79394, 4057, 78339, … |
| “i hate him. he is ugly and stupid” | 40814, 11982, 59378, 12947, … |
Positive and negative sentences activate clearly different features. Features 62556 and 79394 in particular appear repeatedly in negative contexts.
Analyzing Feature Roles
Through experimentation, I mapped each feature’s effect:
| Feature Index | Inferred Role |
|---|---|
| 62556 | Shifts “coward” → “fool” direction |
| 79394 | Targets negativity |
| 86309 | Removes uncertainty (“not sure” → “sure”) |
| 69689 | Intensifies focus on target |
Activation Patching Experiment
The core experiment: reconstruct feature indices into a 768-dim vector via SAE decoder, then inject it into the model’s forward pass.
def get_feature(indicies):
vector = np.zeros(131072)
vector[indicies] = 1
input_tensor = torch.tensor(vector, dtype=torch.float32)
with torch.no_grad():
return autoencoder.decoder(input=input_tensor)
positive_feature = get_feature([62556, 79394, 86309, 69689])
def activation_patching(layer, input, output):
return output + (positive_feature * 20)
hook_handle = target_layer.register_forward_hook(activation_patching)
Effects started showing at 20x scale. Magnitude matters.
Results
Before patching (temperature=0.0):
prompt: he is such a
output: he is such a good person, he is such a good person, he is such a good person, ...
After patching (temperature=0.7, features [62556, 79394, 86309, 69689] x20):
prompt: he is such a
output: he is such a shit I will never be able to do it again
did i not say i don't want to do it? i just said i don't want to do it
it sucks to smile when so many people are just trying to think about
The emotional tone flipped completely from the same prompt. The model went from repeatedly generating “good person” to producing sentences filled with anger and frustration.
Varying feature combinations produced different results:
| Feature Combination | Output |
|---|---|
| [62556, 79394] | “I’m not sure if he’s a good guy, but he’s a good guy.” |
| [62556, 79394, 86309] | “he is such a fool. I am a fool. I am a fool.” |
| [62556, 79394, 69689, 86309] | “he is such a shit I will never be able to do it again” |
Adding features one by one strengthened negativity, with 69689 (focus intensifier) producing the most dramatic shift.
Part 2: Training SAE from Scratch
Full code available in this Google Colab notebook.
Using pretrained SAEs is convenient, but understanding requires building one yourself. I trained an SAE on the MLP output of DistilGPT2’s 5th block.
Model Architecture
class SparseAutoEncoder(nn.Module):
def __init__(self, in_out_size):
super().__init__()
self.input_bias = nn.Parameter(torch.zeros(in_out_size))
self.encoder = nn.Linear(in_out_size, in_out_size * 8, bias=True)
self.decoder = nn.Linear(in_out_size * 8, in_out_size, bias=True)
def forward_pass(self, x):
x = x - self.decoder.bias
encoded = F.relu(self.encoder(x))
decoded = self.decoder(encoded)
return decoded, encoded
768-dim → 6,144-dim (8x expansion). Much smaller than OpenAI’s 128k, but sufficient for validating the training process.
Monitoring Decoder Orthogonality
SAE decoder column vectors should be orthogonal so each feature represents an independent concept. I tracked this via the Gram matrix off-diagonal mean:
$$ G = W_{\text{norm}}^T W_{\text{norm}} $$
$$ \text{orthogonality} = \frac{1}{n^2 - n} \sum_{i \neq j} |G_{ij}| $$
def measure_decoder_orthogonality(self):
W = self.decoder.weight.data
col_norms = W.norm(dim=0, keepdim=True)
normed_W = W / (col_norms + 1e-9)
gram = torch.matmul(normed_W.t(), normed_W)
diag_vals = torch.diag(gram)
off_diag_vals = gram - torch.diag(diag_vals)
return off_diag_vals.abs().mean().item()
Dead Neuron Resampling
A common issue in SAE training: dead neurons that never activate become untrainable. I reinitialized neurons with activation below a threshold:
def resample_dead_neurons(self, activation_stats, threshold=1e-5):
with torch.no_grad():
dead_indices = (activation_stats < threshold).nonzero().squeeze(-1)
for idx in dead_indices:
self.encoder.weight[idx].normal_()
self.encoder.bias[idx].zero_()
Training Results
Trained for 1000 steps on a Korean commercial dataset (KoCommercial-Dataset):
Step 0 | Loss: 16.8401 | off_diag_mean: 0.0299
Step 100 | Loss: 12.8732 | off_diag_mean: 0.0300
Step 200 | Loss: 9.5500 | off_diag_mean: 0.0302
Step 500 | Loss: 8.2574 | off_diag_mean: 0.0306
Step 900 | Loss: 5.7219 | off_diag_mean: 0.0311
Loss decreased steadily from 16.84 to 5.72, while off_diag_mean remained stable around 0.03. Decoder orthogonality was preserved throughout training.
Takeaways
graph TD
A[LLM Activation
768-dim] -->|SAE Encode| B[Sparse Features
131,072-dim]
B -->|Feature Analysis| C{Identify Emotion
Features}
C -->|Decode + Scale| D[Steering Vector
768-dim]
D -->|Inject via Hook| E[Modified Output]
Key findings:
- SAE-extracted features do correspond to interpretable concepts
- Combining and scaling features can control model behavior
- Magnitude matters - ~20x amplification was needed for visible effects
- Feature combinations matter - multi-feature injection produces sharper steering than individual features
Limitations and future work:
- Compare injecting 1.0 vs actual activation magnitudes into feature slots
- Apply to larger models (Gemma-3-4B, etc.) using CAA (covered in a separate post)
- Study the relationship between expansion factor (currently 8x) and performance
Further Resources
If SAE and Mechanistic Interpretability caught your interest, here are some communities and tools worth exploring.
Neuronpedia is an interactive platform for browsing SAE features. You can explore what text each feature activates on and what meaning it carries. The feature indices used in this post can be looked up there directly.
Open Source Mechanistic Interpretability is a Slack community discussing SAE, feature interpretation, activation patching, and other MI research. Active paper readings, code sharing, and experiment discussions.