What Characterizes Effective Reasoning? Revisiting Length, Review, and Structure of CoT

This academic paper investigates what makes a Chain-of-Thought (CoT) trace effective for Large Reasoning Models (LRMs), challenging the prevailing idea that **longer reasoning traces and increased review behaviors automatically lead to better performance**. Through a systematic evaluation across ten LRMs on math and scientific reasoning, the authors demonstrate that **shorter CoTs and lower Review Ratios are often associated with higher accuracy**. To identify a more fundamental predictor, the research introduces a graph view of CoT and defines the **Failed-Step Fraction (FSF)**, which consistently and robustly predicts correctness across models and datasets, outperforming length and review metrics. Finally, test-time selection and direct CoT editing interventions provide causal evidence that **low FSF improves accuracy** by mitigating the bias that failed reasoning branches introduce to subsequent steps.

Om Podcasten