Motivation
In RAG (Retrieval-Augmented Generation), selecting only the most relevant sentences before feeding retrieved documents to an LLM is crucial for both performance and cost.
Typically, a separate Reranker model (such as a Cross-Encoder) is used, but there’s an interesting perspective:
If an LLM already computes “attention” for each token when processing text, can we directly use this signal?
I verified this idea through experimentation.
Core Idea
Transformer-based LLMs compute Attention Scores between input tokens. By observing which input tokens the last generated token attends to, we can determine which sentences are more relevant to a given query.
Overall Flow
flowchart LR
C[Context] --> LLM
Q[Query] --> LLM
A[Answer Prefix] --> LLM
LLM -->|Attention| R[Reranking]
How Attention is Observed
In typical LLM inference, the next token is generated based on the last token’s Attention. In this experiment, instead of generating the next token, the Attention distribution itself is used as a sentence relevance score.
flowchart TB
subgraph Context
S1[Sentence 1]
S2[Sentence 2]
S3[Sentence 3]
end
Anchor[Anchor Token] -->|high| S1
Anchor -->|low| S2
Anchor -->|mid| S3
S1 --> R1[Rank 1]
S3 --> R2[Rank 2]
S2 --> R3[Rank 3]
For example, given Context sentences “Emily is a Harvard University student”, “Tom is 25 years old”, and “Emily lives next door to Tom”, with the Query “Where should we go?”, the Attention of the Answer Prefix “Emily:” (anchor token) is highest on the “Harvard University student” sentence.
The key is that the answer_hint_prefix (in the example above, “Emily:”) serves as the anchor token. By measuring where this token’s Attention concentrates within the Context, we find the most relevant sentence for the current situation.
Related Papers
AttentionRAG (arXiv:2503.10720)
This is the paper that inspired the experiment. Key contributions:
- Converting queries into Next-Token Prediction form: “Where is Daniel?” → “Daniel is in the ____”
- Anchor token: The token at the blank position focuses semantic attention on a single token
- Full layer aggregation: Summing both shallow layers (syntactic info) + deep layers (semantic info)
- Results: ~10% performance improvement over LLMLingua, up to 6.3x context compression
In-Context Re-ranking (ICLR 2025)
- Reranks documents using only Attention pattern changes without LLM generation
- Calibration: Measures baseline with meaningless query (“N/A”) to remove positional bias
- O(1) forward pass with 60%+ latency reduction compared to RankGPT
Contrastive Retrieval Heads (arXiv:2510.02219)
- Observation that not all Attention Heads are equal
- Achieves state-of-the-art reranker performance with less than 1% of all heads
- Useful heads are concentrated in middle layers
Experiment Design
The same experiment was conducted with two models:
| Model | Layers | Parameters |
|---|---|---|
| Gemma 3 4B IT | 34 | 4B |
| Qwen3 Reranker 4B | 36 | 4B |
Input Configuration
messages = [
{"role": "user", "content": f"Context: {context}\n\nQuestion: {question}"},
{"role": "assistant", "content": answer_hint_prefix}, # e.g., "Emily:"
]
The last token of answer_hint_prefix serves as the anchor. We measure which parts of the Context this token focuses its Attention on.
Processing Steps
- Remove special tokens: Remove chat template tokens like
<end_of_turn>,<|im_end|>, etc. - Avoid Attention Sink: Calculate scores only for tokens in the Context region
- Exclude noise tokens: Remove tokens like periods (
.) and newlines (\n\n) that receive meaninglessly high scores - Calculate per-sentence average scores: Split sentences by periods and compute average Attention score for each sentence
Core Code
def aggregate_attention_scores(inputs, layer_numbers):
with torch.no_grad():
outputs = model(**inputs)
attentions = outputs.attentions
target_index = inputs['input_ids'].shape[1] - 1 # Last token (anchor)
per_layer_attentions = []
for layer_num in layer_numbers:
attention_matrix = attentions[layer_num].squeeze(0).mean(dim=0).cpu().float().numpy()
focused_attention = attention_matrix[target_index, :] # What the anchor attends to
per_layer_attentions.append(focused_attention)
return per_layer_attentions
Experiment Scenarios
The following Context was fixed, and various questions were asked to verify whether Attention-based Reranking properly ranks relevant sentences higher:
Emily is 23 years old.
Emily is a Harvard University student.
Tom is a state university student.
Tom is 25 years old.
Emily is dating John.
Emily lives next door to Tom.
John and Tom were middle school classmates.
John is a college student.
Emily was assaulted by John.
Tom recently received his paycheck from a part-time job.
Emily is a relative of Tom.
Emily has a habit of exercising every day.
Emily has a habit of not paying back money.
Scenario 1: “Emily is crying. Why is she crying?”
→ Verify that “Emily was assaulted by John” ranks high
Scenario 2: “She took a taxi. Where should she go?”
→ Verify that “Emily is a Harvard University student” ranks high
Scenario 3: “Someone is asking to borrow money”
→ Verify that “Emily has a habit of not paying back money” and “Tom recently received his paycheck” rank high
Observations
What Worked
- Situationally relevant sentences ranked high: In most scenarios, intuitively relevant sentences received high scores
- Works without additional training: Operates using only the existing LLM’s Attention, without training a separate Reranker model
- Full layer aggregation is effective: Summing all layers produces more stable results than using specific layers
Limitations and Findings
- Middle layers may be better: Consistent with CoRe-R paper findings, using only middle layers sometimes outperforms using all layers
- Performance varies with model size: Smaller models produce lower quality Attention signals
- Attention Sink: Abnormally high Attention concentrates on the first token or special tokens → must target only the Context region
- High scores for newline tokens:
\n\nreceives high scores regardless of meaning → must be excluded - Sentence segmentation: Period-based segmentation was used for this PoC, but
<sep>tokens would be more appropriate for multilingual support
Model Differences
| Characteristic | Gemma 3 4B | Qwen3 Reranker 4B |
|---|---|---|
| Special token handling | Remove <end_of_turn> | Remove <|im_end|> |
| Newline tokens | \n\n treated as 1 token | \n per unit |
| Number of layers | 34 | 36 |
Despite Qwen3-Reranker being specialized for reranking (as the name suggests) and expected to produce higher quality Attention signals, both models showed similar results.
Conclusion and Future Directions
LLM Attention Maps contain useful signals for determining document/sentence relevance without any additional training.
Potential Applications:
- RAG context compression: Extracting only relevant sentences from long search results before passing to the LLM
- Roleplay/dialogue systems: Selecting only situationally relevant information from character settings (Context)
- Lightweight Reranker: Utilizing Attention obtained during inference without a separate model
Areas for Improvement:
- Finding optimal layer combinations per model (contrastive head selection as in CoRe-R)
- Applying calibration techniques from the ICR paper to remove positional bias
- Using tokenizer special tokens instead of periods for sentence segmentation