Sentence Reranking Using LLM Attention Maps

Motivation

In RAG (Retrieval-Augmented Generation), selecting only the most relevant sentences before feeding retrieved documents to an LLM is crucial for both performance and cost.

Typically, a separate Reranker model (such as a Cross-Encoder) is used, but there’s an interesting perspective:

If an LLM already computes “attention” for each token when processing text, can we directly use this signal?

I verified this idea through experimentation.

Core Idea

Transformer-based LLMs compute Attention Scores between input tokens. By observing which input tokens the last generated token attends to, we can determine which sentences are more relevant to a given query.

Overall Flow


flowchart LR
    C[Context] --> LLM
    Q[Query] --> LLM
    A[Answer Prefix] --> LLM
    LLM -->|Attention| R[Reranking]

How Attention is Observed

In typical LLM inference, the next token is generated based on the last token’s Attention. In this experiment, instead of generating the next token, the Attention distribution itself is used as a sentence relevance score.


flowchart TB
    subgraph Context
        S1[Sentence 1]
        S2[Sentence 2]
        S3[Sentence 3]
    end
    Anchor[Anchor Token] -->|high| S1
    Anchor -->|low| S2
    Anchor -->|mid| S3
    S1 --> R1[Rank 1]
    S3 --> R2[Rank 2]
    S2 --> R3[Rank 3]

For example, given Context sentences “Emily is a Harvard University student”, “Tom is 25 years old”, and “Emily lives next door to Tom”, with the Query “Where should we go?”, the Attention of the Answer Prefix “Emily:” (anchor token) is highest on the “Harvard University student” sentence.

The key is that the answer_hint_prefix (in the example above, “Emily:”) serves as the anchor token. By measuring where this token’s Attention concentrates within the Context, we find the most relevant sentence for the current situation.

AttentionRAG (arXiv:2503.10720)

This is the paper that inspired the experiment. Key contributions:

Converting queries into Next-Token Prediction form: “Where is Daniel?” → “Daniel is in the ____”
Anchor token: The token at the blank position focuses semantic attention on a single token
Full layer aggregation: Summing both shallow layers (syntactic info) + deep layers (semantic info)
Results: ~10% performance improvement over LLMLingua, up to 6.3x context compression

In-Context Re-ranking (ICLR 2025)

Reranks documents using only Attention pattern changes without LLM generation
Calibration: Measures baseline with meaningless query (“N/A”) to remove positional bias
O(1) forward pass with 60%+ latency reduction compared to RankGPT

Contrastive Retrieval Heads (arXiv:2510.02219)

Observation that not all Attention Heads are equal
Achieves state-of-the-art reranker performance with less than 1% of all heads
Useful heads are concentrated in middle layers

Experiment Design

The same experiment was conducted with two models:

Model	Layers	Parameters
Gemma 3 4B IT	34	4B
Qwen3 Reranker 4B	36	4B

Input Configuration

messages = [
    {"role": "user", "content": f"Context: {context}\n\nQuestion: {question}"},
    {"role": "assistant", "content": answer_hint_prefix},  # e.g., "Emily:"
]

The last token of answer_hint_prefix serves as the anchor. We measure which parts of the Context this token focuses its Attention on.

Processing Steps

Remove special tokens: Remove chat template tokens like <end_of_turn>, <|im_end|>, etc.
Avoid Attention Sink: Calculate scores only for tokens in the Context region
Exclude noise tokens: Remove tokens like periods (.) and newlines (\n\n) that receive meaninglessly high scores
Calculate per-sentence average scores: Split sentences by periods and compute average Attention score for each sentence

Core Code

def aggregate_attention_scores(inputs, layer_numbers):
    with torch.no_grad():
        outputs = model(**inputs)
        attentions = outputs.attentions

    target_index = inputs['input_ids'].shape[1] - 1  # Last token (anchor)

    per_layer_attentions = []
    for layer_num in layer_numbers:
        attention_matrix = attentions[layer_num].squeeze(0).mean(dim=0).cpu().float().numpy()
        focused_attention = attention_matrix[target_index, :]  # What the anchor attends to
        per_layer_attentions.append(focused_attention)

    return per_layer_attentions

Experiment Scenarios

The following Context was fixed, and various questions were asked to verify whether Attention-based Reranking properly ranks relevant sentences higher:

Emily is 23 years old.
Emily is a Harvard University student.
Tom is a state university student.
Tom is 25 years old.
Emily is dating John.
Emily lives next door to Tom.
John and Tom were middle school classmates.
John is a college student.
Emily was assaulted by John.
Tom recently received his paycheck from a part-time job.
Emily is a relative of Tom.
Emily has a habit of exercising every day.
Emily has a habit of not paying back money.

Scenario 1: “Emily is crying. Why is she crying?”

→ Verify that “Emily was assaulted by John” ranks high

Scenario 2: “She took a taxi. Where should she go?”

→ Verify that “Emily is a Harvard University student” ranks high

Scenario 3: “Someone is asking to borrow money”

→ Verify that “Emily has a habit of not paying back money” and “Tom recently received his paycheck” rank high

Observations

What Worked

Situationally relevant sentences ranked high: In most scenarios, intuitively relevant sentences received high scores
Works without additional training: Operates using only the existing LLM’s Attention, without training a separate Reranker model
Full layer aggregation is effective: Summing all layers produces more stable results than using specific layers

Limitations and Findings

Middle layers may be better: Consistent with CoRe-R paper findings, using only middle layers sometimes outperforms using all layers
Performance varies with model size: Smaller models produce lower quality Attention signals
Attention Sink: Abnormally high Attention concentrates on the first token or special tokens → must target only the Context region
High scores for newline tokens: \n\n receives high scores regardless of meaning → must be excluded
Sentence segmentation: Period-based segmentation was used for this PoC, but <sep> tokens would be more appropriate for multilingual support

Model Differences

Characteristic	Gemma 3 4B	Qwen3 Reranker 4B
Special token handling	Remove `<end_of_turn>`	Remove `<\|im_end\|>`
Newline tokens	`\n\n` treated as 1 token	`\n` per unit
Number of layers	34	36

Despite Qwen3-Reranker being specialized for reranking (as the name suggests) and expected to produce higher quality Attention signals, both models showed similar results.

Conclusion and Future Directions

LLM Attention Maps contain useful signals for determining document/sentence relevance without any additional training.

Potential Applications:

RAG context compression: Extracting only relevant sentences from long search results before passing to the LLM
Roleplay/dialogue systems: Selecting only situationally relevant information from character settings (Context)
Lightweight Reranker: Utilizing Attention obtained during inference without a separate model

Areas for Improvement:

Finding optimal layer combinations per model (contrastive head selection as in CoRe-R)
Applying calibration techniques from the ICR paper to remove positional bias
Using tokenizer special tokens instead of periods for sentence segmentation

Motivation#

Core Idea#

Overall Flow#

How Attention is Observed#

Related Papers#

AttentionRAG (arXiv:2503.10720)#

In-Context Re-ranking (ICLR 2025)#

Contrastive Retrieval Heads (arXiv:2510.02219)#

Experiment Design#

Input Configuration#

Processing Steps#

Core Code#

Experiment Scenarios#

Scenario 1: “Emily is crying. Why is she crying?”#

Scenario 2: “She took a taxi. Where should she go?”#

Scenario 3: “Someone is asking to borrow money”#

Observations#

What Worked#

Limitations and Findings#

Model Differences#

Conclusion and Future Directions#

References#