SPIRAL: Self-Play for Reasoning Through Zero-Sum Games

This paper introduces SPIRAL, a novel self-play framework designed to enhance the reasoning capabilities of large language models (LLMs) without relying on human supervision or pre-curated datasets. By engaging in multi-turn, zero-sum games like TicTacToe, Kuhn Poker, and Simple Negotiation, LLMs learn to develop transferable cognitive patterns such as systematic decomposition, expected value calculation, and pattern recognition. The framework employs a Role-conditioned Advantage Estimation (RAE) to stabilize training in dynamic multi-agent environments, preventing a "thinking collapse" where models abandon their reasoning processes. Results indicate that SPIRAL-trained models consistently outperform models fine-tuned on expert demonstrations and static opponents, demonstrating the effectiveness of an adaptive curriculum generated through continuous self-play in developing robust and generalizable reasoning skills across various benchmarks.

Om Podcasten