Multi-Objective Preference Optimization: Improving Human Alignment of Generative Models

This paper introduces Multi-Objective Preference Optimization (MOPO), a novel algorithm designed to align large language models with complex human preferences that involve multiple, potentially conflicting goals like helpfulness and harmlessness. Unlike prior methods that often reduce multi-objective alignment to a single score, MOPO frames the problem as a constrained optimization, maximizing a primary objective while ensuring secondary objectives meet certain thresholds. The paper demonstrates through synthetic and real-world experiments that MOPO effectively approximates the Pareto front—the set of optimal trade-offs between objectives—and outperforms existing techniques in achieving a better balance across various preference dimensions, while also showing robustness to different settings.

Om Podcasten