Prismatic Synthesis for Diverse LLM Reasoning Data

This paper investigates how data diversity impacts the generalization of large language models (LLMs), particularly in reasoning tasks. The authors introduce G-Vendi, a novel metric that quantifies diversity based on the entropy of model-induced gradients, showing a strong correlation with out-of-distribution performance. Building on this, they propose Prismatic Synthesis, a framework for generating diverse synthetic data by focusing on underrepresented gradient space regions. Experiments demonstrate that increasing gradient diversity significantly improves model performance, even outperforming models trained on larger, less strategically curated datasets, suggesting principled diversification is a key driver of generalization.

Om Podcasten