SPIRAL: Self-Play for Reasoning Through Zero-Sum Games

This paper introduces SPIRAL, a novel self-play framework designed to enhance the reasoning capabilities of large language models (LLMs) without relying on human supervision or pre-curated datasets. By engaging in multi-turn, zero-sum games like TicTacToe, Kuhn Poker, and Simple Negotiation, LLMs learn to develop transferable cognitive patterns such as systematic decomposition, expected value calculation, and pattern recognition. The framework employs a Role-conditioned Advantage Estimation (RAE) to stabilize training in dynamic multi-agent environments, preventing a "thinking collapse" where models abandon their reasoning processes. Results indicate that SPIRAL-trained models consistently outperform models fine-tuned on expert demonstrations and static opponents, demonstrating the effectiveness of an adaptive curriculum generated through continuous self-play in developing robust and generalizable reasoning skills across various benchmarks.

Om Podcasten

Cut through the noise. We curate and break down the most important AI papers so you don’t have to.