InsightReplay: Stateful Reasoning via Insight Replay

Motivation

Recent work shows that longer chain-of-thought (CoT) is not monotonically better. On a fixed-difficulty problem, accuracy traces an inverted-U as CoT length grows — first improving, peaking, then declining as the chain becomes excessively long. We ask: what drives this decline, and can it be fixed?

Accuracy vs. mean tokens — baseline CoT exhibits an inverted-U; InsightReplay shifts the peak rightward and raises it.

Accuracy vs. mean tokens for Baseline (standard CoT) and InsightReplay (1, 3, 5 replay rounds), averaged over three 30B-tier models on the LiveCodeBench v5 subset. Baseline peaks at ~15K tokens then declines; InsightReplay turns the degradation regime into continued growth.

Preliminary Experiments

Two controlled experiments on 60 AIME problems with Qwen3-8B reveal two properties of critical insights — the small set of compressed sub-conclusions scattered through a long reasoning trace.

Finding 1 — Insights and the reasoning trace are complementary

We extract 5–7 key insights from each full thinking trace, then feed seven content variants inside the <think> tag and read the answer probability P(ans).

Legend: ✓ = present · — = absent · ∅ = replaced with random tokens of the same length as the original CoT trace.

Condition	Content inside `<think>`			Tokens	P(ans)
Condition	CoT trace	Repeated Q	Insights	Tokens	P(ans)
A Full trace	✓	—	—	16,731	0.512
B Q + insights	—	✓	✓	378	0.273
C Insights only	—	—	✓	236	0.387
D Full + Q + insights	✓	✓	✓	17,109	0.557
E Full + insights	✓	—	✓	16,967	0.545
F Random + Q + insights	∅	✓	✓	17,109	0.131
G Random + insights	∅	—	✓	16,967	0.127

(i) Insights carry concentrated signal — condition C keeps 75.6% of P(ans) with only 1.4% of the tokens. (ii) Insights are additive on top of the trace — D / E beat A, so re-exposing the model to its own insights helps even when the trace is already in context.

Finding 2 — Attention to critical insights decays with distance

Pre-softmax attention to critical insights vs. insertion ratio ρ.

We push insights further from the generation frontier by inserting semantically neutral filler tokens at ratio ρ ∈ {0, 0.1, 0.2, 0.3, 0.4} and measure the pre-softmax attention from the answer token. From ρ = 0 to ρ = 0.4, attention drops 19.2% on Qwen3-8B (paired bootstrap, p < 0.001) and 3.3% on Gemma-4-31B-it. The decay is monotonic and survives a very different RoPE configuration — suggesting it reflects a general property of trained attention, not any specific positional encoding.

Method

InsightReplay treats reasoning as a stateful process. The model periodically extracts critical insights from its trace and replays them near the active generation frontier, so they stay inside the attention window where attention is strongest — cancelling the distance-decay from Finding 2.

InsightReplay method diagram.

Results

Inference-time

Applying InsightReplay as a sampling-time decoder on already-trained models. Across all 24 settings — {8B, 30B} × {Qwen3.5, R1-Distill-Qwen, Gemma-4} × {AIME, HMMT, GPQA Diamond, LiveCodeBench v5} — 3-round InsightReplay beats standard CoT. Averaged gain +1.65 points; peak single-setting gain +9.2 on R1-Distill-32B / LCB v5.

30B-tier models

Inference-time results on 30B-tier models across benchmarks.

8B-tier models

Inference-time results on 8B-tier models across benchmarks.

Training-time

Running InsightReplay as the rollout strategy during RL training (GRPO+DAPO on Qwen3-4B-Base, 128 GPUs × 16 nodes, DAPO-Math-15k). Compared to baseline GRPO on the same setup, the InsightReplay-rollout run lifts validation accuracy on AIME-2025 and sustains the gap over the full training trajectory.

Training-time comparison: baseline GRPO vs. InsightReplay-rollout GRPO.

BibTeX

@article{lei2026insightreplay,
  author  = {Lei, Bin and Ding, Caiwen and Yang, Jiachen and Li, Ang and Wang, Xin Eric},
  title   = {Stateful Reasoning via Insight Replay},
  journal = {arXiv preprint arXiv:2605.14457},
  year    = {2026},
  url     = {https://arxiv.org/abs/2605.14457},
}