Direct Preference Optimization with Unobserved Preference Heterogeneity: The Necessity of Ternary Preferences

The academic paper claims that pairwise-comparison-based RLHF is incapable of learning heterogeneous preferences, whereas tenary comparisons can. They propose **Expectation-Maximization Direct Preference Optimization (EM-DPO)**, a clustering algorithm that discovers latent user preference groups and trains an ensemble of specialized LLMs for each group. Crucially, the authors establish a theoretical link to econometrics, arguing that **binary comparisons are insufficient** for identifying heterogeneous preferences, demonstrating the necessity of collecting **ternary preferences** (preferences among three options). Finally, the paper introduces **MinMax Regret Aggregation (MMRA)** to combine the ensemble models into a single "fair" policy that minimizes the worst-case performance loss across all identified user subgroups, ensuring equitable deployment.

Om Podcasten