Post-hoc reasoning in chain of thought
I started this project during the training phase of Neel Nanda's MATS 6.0 stream. Later, with the help of some collaborators (Darius Kianersi and Adrià Garriga-Alonso), I extended it with a more comprehensive evaluation suite and some additional experiments. Over time this has resulted in three different artifacts for the same project: a blog post on this site, a LessWrong post, and a paper. I had GPT 5.5 Deep Research aggregate citations across the three artifacts, which I list below.
In the future, if you reference this project, I recommend you cite the paper on arXiv! I've provided a BibTeX citation below.
Cited by
10 works · as of June 2026Papers
- LLMs Faithfully and Iteratively Compute Answers During CoT: A Systematic Analysis with Multi-step Arithmetics cites LessWrong post
- Do Thinking Tokens Help with Safety? cites paper
- What’s the plan? Metrics for implicit planning in LLMs cites blog post
- Chain-of-Thought Reasoning in the Wild Is Not Always Faithful cites LessWrong post
- Scaling Latent Reasoning via Looped Language Models cites blog post
- The Detection–Extraction Gap: Models Know the Answer Before They Can Say It cites paper
- Measuring and Mitigating Post-hoc Rationalization in Reverse Chain-of-Thought Generation cites LessWrong post
Other
- CoT May Be Highly Informative Despite “Unfaithfulness” cites LessWrong post
- Understanding Reasoning with Thought Anchors and Probes cites paper
- The Innocent-Suspect: Alignment, Awareness, and the Case for Trust cites paper
Cite this project
@misc{cox2026decoding,
title = {Decoding Answers Before Chain-of-Thought: Evidence from Pre-CoT Probes and Activation Steering},
author = {Cox, Kyle and Kianersi, Darius and Garriga-Alonso, Adrià},
year = {2026},
eprint = {2603.01437},
archivePrefix = {arXiv},
}