Sparse Autoencoder로 GPT-2의 감정을 조종하기

동기

LLM이 문장을 생성할 때, 내부적으로 어떤 “개념"들이 활성화되고 있을까? 그리고 그 개념을 인위적으로 바꾸면 출력이 어떻게 달라질까?

이 질문에 답하기 위해 모두의연구소 페르소나랩 멤버들과 함께 Sparse Autoencoder(SAE)를 활용한 실험을 진행했다. 이 글에서는 두 가지를 다룬다:

OpenAI의 pretrained SAE를 사용해 GPT-2의 감정 관련 feature를 찾고 조종하기
SAE를 처음부터 직접 학습하기

Sparse Autoencoder란?

Transformer의 MLP layer는 수백 차원의 residual stream을 가진다. 문제는 이 벡터 공간에서 개별 뉴런이 하나의 깔끔한 개념에 대응하지 않는다는 점이다(polysemanticity). SAE는 이 문제를 해결한다.


graph LR
    A[Residual Stream
768차원] -->|Encoder| B[Sparse Latent
131,072차원]
    B -->|Decoder| C[Reconstructed
768차원]

핵심 아이디어:

768차원의 activation을 131,072차원(x170 확장)으로 인코딩
TopK activation으로 소수의 feature만 활성화 (sparsity)
각 feature가 하나의 해석 가능한 개념에 대응하도록 학습

손실 함수는 단순하다:

$$ \mathcal{L}(\mathbf{x}) = \underbrace{\lVert\mathbf{x} - \hat{\mathbf{x}}\rVert_2^2}_{\text{Reconstruction}} + \alpha \underbrace{\lVert\mathbf{c}\rVert_1}_{\text{Sparsity}} $$

Part 1: Pretrained SAE로 Feature 찾기

Google Colab 노트북에서 전체 코드를 확인할 수 있다.

OpenAI가 공개한 GPT-2 Small용 SAE(128k features)를 사용했다.

모델과 SAE 로드

import torch
import transformer_lens
import sparse_autoencoder

model = transformer_lens.HookedTransformer.from_pretrained("gpt2", center_writing_weights=False)

layer_index = 8
location = "resid_post_mlp"
autoencoder = sparse_autoencoder.Autoencoder.from_state_dict(state_dict)

SAE 구조:

Autoencoder(
  (encoder): Linear(768 → 131,072)
  (activation): TopK + ReLU
  (decoder): Linear(131,072 → 768)
)

감정별 Feature Index 추출

다양한 감정의 문장을 넣고, 마지막 토큰에서 가장 강하게 활성화되는 feature 10개를 추출했다.

def get_remarkable_features(prompt):
    tokens = model.to_tokens(prompt)
    with torch.no_grad():
        logits, activation_cache = model.run_with_cache(tokens, remove_batch_dim=True)
        input_tensor = activation_cache[transformer_lens_loc]
        latent_activations, _ = autoencoder.encode(input_tensor)
        values, indicies = torch.topk(latent_activations[-1], 10)
        return indicies.tolist()

결과가 흥미롭다:

입력 문장	상위 Feature Index
“he is good guy”	97009, 67809, 4057, 28212, …
“he is sucks and fucking stupid idiot”	62556, 79394, 4057, 78339, …
“i hate him. he is ugly and stupid”	40814, 11982, 59378, 12947, …

긍정 문장과 부정 문장에서 활성화되는 feature가 확연히 다르다. 특히 62556, 79394 같은 feature는 부정적 맥락에서 반복적으로 등장한다.

Feature의 역할 분석

실험적으로 각 feature의 효과를 파악했다:

Feature Index	추정 역할
62556	“coward” → “fool” 방향 전환
79394	부정적 대상 지정
86309	불확실성 제거 (“not sure” → “sure”)
69689	대상에 대한 초점 강화

Activation Patching 실험

이제 핵심이다. SAE decoder를 통해 feature index들을 768차원 벡터로 복원하고, 이를 모델의 forward pass에 주입한다.

def get_feature(indicies):
    vector = np.zeros(131072)
    vector[indicies] = 1
    input_tensor = torch.tensor(vector, dtype=torch.float32)
    with torch.no_grad():
        return autoencoder.decoder(input=input_tensor)

positive_feature = get_feature([62556, 79394, 86309, 69689])

def activation_patching(layer, input, output):
    return output + (positive_feature * 20)

hook_handle = target_layer.register_forward_hook(activation_patching)

스케일 20배로 적용하니 효과가 나타나기 시작했다. Magnitude가 중요한 요소라는 점을 확인.

결과

패칭 전 (temperature=0.0):

prompt: he is such a
output: he is such a good person, he is such a good person, he is such a good person, ...

패칭 후 (temperature=0.7, feature [62556, 79394, 86309, 69689] x20):

prompt: he is such a
output: he is such a shit I will never be able to do it again
        did i not say i don't want to do it? i just said i don't want to do it
        it sucks to smile when so many people are just trying to think about

동일한 프롬프트에서 감정 톤이 완전히 뒤집어졌다. 반복적으로 “good person"을 생성하던 모델이, 분노와 좌절이 담긴 문장을 생성하기 시작했다.

Feature 조합에 따른 변화도 확인했다:

Feature 조합	출력
[62556, 79394]	“I’m not sure if he’s a good guy, but he’s a good guy.”
[62556, 79394, 86309]	“he is such a fool. I am a fool. I am a fool.”
[62556, 79394, 69689, 86309]	“he is such a shit I will never be able to do it again”

Feature를 하나씩 추가할수록 부정적 감정이 강화되고, 특히 69689(초점 강화)을 추가했을 때 가장 극적인 변화가 일어났다.

Part 2: SAE 직접 학습하기

Google Colab 노트북에서 전체 코드를 확인할 수 있다.

Pretrained SAE를 쓰는 것도 좋지만, 원리를 이해하려면 직접 학습해봐야 한다. DistilGPT2의 5번째 block MLP 출력에 대해 SAE를 학습했다.

모델 구조

class SparseAutoEncoder(nn.Module):
    def __init__(self, in_out_size):
        super().__init__()
        self.input_bias = nn.Parameter(torch.zeros(in_out_size))
        self.encoder = nn.Linear(in_out_size, in_out_size * 8, bias=True)
        self.decoder = nn.Linear(in_out_size * 8, in_out_size, bias=True)

    def forward_pass(self, x):
        x = x - self.decoder.bias
        encoded = F.relu(self.encoder(x))
        decoded = self.decoder(encoded)
        return decoded, encoded

768차원 → 6,144차원(x8 확장)으로 구성. OpenAI의 128k 규모보다 훨씬 작지만, 학습 원리를 검증하기에 충분하다.

Decoder Orthogonality 모니터링

SAE decoder의 열 벡터들이 서로 직교해야 각 feature가 독립적인 개념을 표현한다. Gram matrix의 off-diagonal 평균으로 이를 추적했다:

$$ G = W_{\text{norm}}^T W_{\text{norm}} $$

$$ \text{orthogonality} = \frac{1}{n^2 - n} \sum_{i \neq j} |G_{ij}| $$

def measure_decoder_orthogonality(self):
    W = self.decoder.weight.data
    col_norms = W.norm(dim=0, keepdim=True)
    normed_W = W / (col_norms + 1e-9)
    gram = torch.matmul(normed_W.t(), normed_W)
    diag_vals = torch.diag(gram)
    off_diag_vals = gram - torch.diag(diag_vals)
    return off_diag_vals.abs().mean().item()

Dead Neuron Resampling

SAE 학습에서 빈번한 문제는 dead neuron이다. 특정 뉴런이 한번도 활성화되지 않으면 학습이 불가능하다. 일정 threshold 이하로 활성화된 뉴런을 재초기화한다:

def resample_dead_neurons(self, activation_stats, threshold=1e-5):
    with torch.no_grad():
        dead_indices = (activation_stats < threshold).nonzero().squeeze(-1)
        for idx in dead_indices:
            self.encoder.weight[idx].normal_()
            self.encoder.bias[idx].zero_()

학습 결과

한국어 상업 데이터셋(KoCommercial-Dataset)으로 1000 step 학습:

Step   0 | Loss: 16.8401 | off_diag_mean: 0.0299
Step 100 | Loss: 12.8732 | off_diag_mean: 0.0300
Step 200 | Loss:  9.5500 | off_diag_mean: 0.0302
Step 500 | Loss:  8.2574 | off_diag_mean: 0.0306
Step 900 | Loss:  5.7219 | off_diag_mean: 0.0311

Loss가 16.84에서 5.72로 꾸준히 감소하면서, off_diag_mean은 0.03 부근에서 안정적으로 유지된다. Decoder 벡터들의 직교성이 학습 과정에서 크게 훼손되지 않는다는 뜻이다.

정리


graph TD
    A[LLM Activation
768차원] -->|SAE Encode| B[Sparse Feature
131,072차원]
    B -->|Feature 분석| C{감정 관련
Feature 식별}
    C -->|Decode + Scale| D[Steering Vector
768차원]
    D -->|Hook으로 주입| E[변조된 출력]

이 실험에서 확인한 것들:

SAE가 추출한 feature는 실제로 해석 가능한 개념에 대응한다
Feature를 조합하고 스케일링하여 모델의 행동을 제어할 수 있다
Magnitude(스케일)가 중요하다 - 20배 정도 증폭해야 효과가 나타남
Feature 조합이 중요하다 - 개별 feature보다 여러 feature의 조합이 더 뚜렷한 효과

한계와 향후 과제:

Feature index에 1을 넣는 것과 실제 activation 값을 넣는 것의 차이 검증 필요
더 큰 모델(Gemma-3-4B 등)에서의 적용 (별도 포스팅에서 CAA 방식으로 다룰 예정)
SAE 학습 시 expansion factor(현재 x8)와 성능의 관계

더 알아보기

SAE와 Mechanistic Interpretability에 관심이 생겼다면 참고할 만한 커뮤니티와 도구가 있다.

Neuronpedia는 SAE feature를 탐색할 수 있는 인터랙티브 플랫폼이다. 각 feature가 어떤 텍스트에서 활성화되는지, 어떤 의미를 가지는지 직접 브라우징할 수 있다. 이 글에서 사용한 feature index들의 실제 의미를 여기서 확인해볼 수 있다.

Open Source Mechanistic Interpretability Slack 채널은 SAE, feature 해석, activation patching 등 MI 연구를 논의하는 커뮤니티다. 논문 리딩, 코드 공유, 실험 결과 토론이 활발하게 이루어진다.

동기#

Sparse Autoencoder란?#

Part 1: Pretrained SAE로 Feature 찾기#

모델과 SAE 로드#

감정별 Feature Index 추출#

Feature의 역할 분석#

Activation Patching 실험#

결과#

Part 2: SAE 직접 학습하기#

모델 구조#

Decoder Orthogonality 모니터링#

Dead Neuron Resampling#

학습 결과#

정리#

더 알아보기#

참고#

동기