Post-Hoc Reasoning in Chain of Thought

Steering results across models and tasks

I started this project during the training phase of Neel Nanda's MATS 6.0 stream. Later, with the help of some collaborators (Darius Kianersi and Adrià Garriga-Alonso), I extended it with a more comprehensive evaluation suite and some additional experiments. Over time this has resulted in three different artifacts for the same project: a blog post on this site, a LessWrong post, and a paper. I had GPT 5.5 Deep Research aggregate citations across the three artifacts, which I list below.

In the future, if you reference this project, I recommend you cite the paper on arXiv! I've provided a BibTeX citation at the bottom of the page.

Cited by

as of June 2026

Papers

Other

Cite this project

@misc{cox2026decodinganswerschainofthoughtevidence,
  title         = {Decoding Answers Before Chain-of-Thought: Evidence from Pre-CoT Probes and Activation Steering},
  author        = {Kyle Cox and Darius Kianersi and Adrià Garriga-Alonso},
  year          = {2026},
  eprint        = {2603.01437},
  archivePrefix = {arXiv},
  primaryClass  = {cs.AI},
  url           = {https://arxiv.org/abs/2603.01437},
}