Motivation

In RAG (Retrieval-Augmented Generation), selecting only the most relevant sentences before feeding retrieved documents to an LLM is crucial for both performance and cost.

Typically, a separate Reranker model (such as a Cross-Encoder) is used, but there’s an interesting perspective:

If an LLM already computes “attention” for each token when processing text, can we directly use this signal?

I verified this idea through experimentation.

Core Idea

Transformer-based LLMs compute Attention Scores between input tokens. By observing which input tokens the last generated token attends to, we can determine which sentences are more relevant to a given query.

Overall Flow


flowchart LR
    C[Context] --> LLM
    Q[Query] --> LLM
    A[Answer Prefix] --> LLM
    LLM -->|Attention| R[Reranking]

How Attention is Observed

In typical LLM inference, the next token is generated based on the last token’s Attention. In this experiment, instead of generating the next token, the Attention distribution itself is used as a sentence relevance score.


flowchart TB
    subgraph Context
        S1[Sentence 1]
        S2[Sentence 2]
        S3[Sentence 3]
    end
    Anchor[Anchor Token] -->|high| S1
    Anchor -->|low| S2
    Anchor -->|mid| S3
    S1 --> R1[Rank 1]
    S3 --> R2[Rank 2]
    S2 --> R3[Rank 3]

For example, given Context sentences “Emily is a Harvard University student”, “Tom is 25 years old”, and “Emily lives next door to Tom”, with the Query “Where should we go?”, the Attention of the Answer Prefix “Emily:” (anchor token) is highest on the “Harvard University student” sentence.

The key is that the answer_hint_prefix (in the example above, “Emily:”) serves as the anchor token. By measuring where this token’s Attention concentrates within the Context, we find the most relevant sentence for the current situation.

AttentionRAG (arXiv:2503.10720)

This is the paper that inspired the experiment. Key contributions:

  • Converting queries into Next-Token Prediction form: “Where is Daniel?” → “Daniel is in the ____”
  • Anchor token: The token at the blank position focuses semantic attention on a single token
  • Full layer aggregation: Summing both shallow layers (syntactic info) + deep layers (semantic info)
  • Results: ~10% performance improvement over LLMLingua, up to 6.3x context compression

In-Context Re-ranking (ICLR 2025)

  • Reranks documents using only Attention pattern changes without LLM generation
  • Calibration: Measures baseline with meaningless query (“N/A”) to remove positional bias
  • O(1) forward pass with 60%+ latency reduction compared to RankGPT

Contrastive Retrieval Heads (arXiv:2510.02219)

  • Observation that not all Attention Heads are equal
  • Achieves state-of-the-art reranker performance with less than 1% of all heads
  • Useful heads are concentrated in middle layers

Experiment Design

The same experiment was conducted with two models:

ModelLayersParameters
Gemma 3 4B IT344B
Qwen3 Reranker 4B364B

Input Configuration

messages = [
    {"role": "user", "content": f"Context: {context}\n\nQuestion: {question}"},
    {"role": "assistant", "content": answer_hint_prefix},  # e.g., "Emily:"
]

The last token of answer_hint_prefix serves as the anchor. We measure which parts of the Context this token focuses its Attention on.

Processing Steps

  1. Remove special tokens: Remove chat template tokens like <end_of_turn>, <|im_end|>, etc.
  2. Avoid Attention Sink: Calculate scores only for tokens in the Context region
  3. Exclude noise tokens: Remove tokens like periods (.) and newlines (\n\n) that receive meaninglessly high scores
  4. Calculate per-sentence average scores: Split sentences by periods and compute average Attention score for each sentence

Core Code

def aggregate_attention_scores(inputs, layer_numbers):
    with torch.no_grad():
        outputs = model(**inputs)
        attentions = outputs.attentions

    target_index = inputs['input_ids'].shape[1] - 1  # Last token (anchor)

    per_layer_attentions = []
    for layer_num in layer_numbers:
        attention_matrix = attentions[layer_num].squeeze(0).mean(dim=0).cpu().float().numpy()
        focused_attention = attention_matrix[target_index, :]  # What the anchor attends to
        per_layer_attentions.append(focused_attention)

    return per_layer_attentions

Experiment Scenarios

The following Context was fixed, and various questions were asked to verify whether Attention-based Reranking properly ranks relevant sentences higher:

Emily is 23 years old.
Emily is a Harvard University student.
Tom is a state university student.
Tom is 25 years old.
Emily is dating John.
Emily lives next door to Tom.
John and Tom were middle school classmates.
John is a college student.
Emily was assaulted by John.
Tom recently received his paycheck from a part-time job.
Emily is a relative of Tom.
Emily has a habit of exercising every day.
Emily has a habit of not paying back money.

Scenario 1: “Emily is crying. Why is she crying?”

→ Verify that “Emily was assaulted by John” ranks high

Scenario 2: “She took a taxi. Where should she go?”

→ Verify that “Emily is a Harvard University student” ranks high

Scenario 3: “Someone is asking to borrow money”

→ Verify that “Emily has a habit of not paying back money” and “Tom recently received his paycheck” rank high

Observations

What Worked

  • Situationally relevant sentences ranked high: In most scenarios, intuitively relevant sentences received high scores
  • Works without additional training: Operates using only the existing LLM’s Attention, without training a separate Reranker model
  • Full layer aggregation is effective: Summing all layers produces more stable results than using specific layers

Limitations and Findings

  • Middle layers may be better: Consistent with CoRe-R paper findings, using only middle layers sometimes outperforms using all layers
  • Performance varies with model size: Smaller models produce lower quality Attention signals
  • Attention Sink: Abnormally high Attention concentrates on the first token or special tokens → must target only the Context region
  • High scores for newline tokens: \n\n receives high scores regardless of meaning → must be excluded
  • Sentence segmentation: Period-based segmentation was used for this PoC, but <sep> tokens would be more appropriate for multilingual support

Model Differences

CharacteristicGemma 3 4BQwen3 Reranker 4B
Special token handlingRemove <end_of_turn>Remove <|im_end|>
Newline tokens\n\n treated as 1 token\n per unit
Number of layers3436

Despite Qwen3-Reranker being specialized for reranking (as the name suggests) and expected to produce higher quality Attention signals, both models showed similar results.

Conclusion and Future Directions

LLM Attention Maps contain useful signals for determining document/sentence relevance without any additional training.

Potential Applications:

  • RAG context compression: Extracting only relevant sentences from long search results before passing to the LLM
  • Roleplay/dialogue systems: Selecting only situationally relevant information from character settings (Context)
  • Lightweight Reranker: Utilizing Attention obtained during inference without a separate model

Areas for Improvement:

  • Finding optimal layer combinations per model (contrastive head selection as in CoRe-R)
  • Applying calibration techniques from the ICR paper to remove positional bias
  • Using tokenizer special tokens instead of periods for sentence segmentation

References