What Characterizes Effective Reasoning? Revisiting Length, Review, and Structure of CoT

This academic paper investigates what makes a Chain-of-Thought (CoT) trace effective for Large Reasoning Models (LRMs), challenging the prevailing idea that **longer reasoning traces and increased review behaviors automatically lead to better performance**. Through a systematic evaluation across ten LRMs on math and scientific reasoning, the authors demonstrate that **shorter CoTs and lower Review Ratios are often associated with higher accuracy**. To identify a more fundamental predictor, the research introduces a graph view of CoT and defines the **Failed-Step Fraction (FSF)**, which consistently and robustly predicts correctness across models and datasets, outperforming length and review metrics. Finally, test-time selection and direct CoT editing interventions provide causal evidence that **low FSF improves accuracy** by mitigating the bias that failed reasoning branches introduce to subsequent steps.

Om Podcasten

Cut through the noise. We curate and break down the most important AI papers so you don’t have to.