Post-Hoc Reasoning in Chain of Thought

Steering results across models and tasks

I started this project during the training phase of Neel Nanda's MATS 6.0 stream. Later, with the help of some collaborators (Darius Kianersi and Adrià Garriga-Alonso), I extended it with a more comprehensive evaluation suite and some additional experiments. Over time this has resulted in three different artifacts for the same project: a blog post on this site, a LessWrong post, and a paper. I had GPT 5.5 Deep Research aggregate citations across the three artifacts, which I list below.

In the future, if you reference this project, I recommend you cite the paper on arXiv! I've provided a BibTeX citation at the bottom of the page.

Cited by

as of June 2026

Papers

LLMs Faithfully and Iteratively Compute Answers During CoT: A Systematic Analysis with Multi-step Arithmetics Kudo et al.Findings of EACL 2026 cites LessWrong post
Do Thinking Tokens Help with Safety? Ri et al.ICML 2026 FoGen Workshop cites paper
What’s the plan? Metrics for implicit planning in LLMs Maar et al.ICLR 2026 cites blog post
Chain-of-Thought Reasoning in the Wild Is Not Always Faithful Arcuschin et al.ICML 2026 cites LessWrong post
Scaling Latent Reasoning via Looped Language Models Zhu et al.arXiv 2025 cites blog post
The Detection–Extraction Gap: Models Know the Answer Before They Can Say It Wang & ZhuarXiv 2026 cites paper
Measuring and Mitigating Post-hoc Rationalization in Reverse Chain-of-Thought Generation Peng et al.arXiv 2026 cites LessWrong post

Other

CoT May Be Highly Informative Despite “Unfaithfulness” Von Arx & DengMETR 2025 cites LessWrong post
Understanding Reasoning with Thought Anchors and Probes JeaniceK et al.LessWrong 2026 cites paper
The Innocent-Suspect: Alignment, Awareness, and the Case for Trust Ivan PhanEssay 2026 cites paper

Cite this project

@misc{cox2026decodinganswerschainofthoughtevidence,
  title         = {Decoding Answers Before Chain-of-Thought: Evidence from Pre-CoT Probes and Activation Steering},
  author        = {Kyle Cox and Darius Kianersi and Adrià Garriga-Alonso},
  year          = {2026},
  eprint        = {2603.01437},
  archivePrefix = {arXiv},
  primaryClass  = {cs.AI},
  url           = {https://arxiv.org/abs/2603.01437},
}