Learning to summarize user information for personalized reinforcement learning from human feedback

The academic paper proposes a novel framework called Preference Learning Using Summarization (PLUS) to address the limitations of standard Reinforcement Learning from Human Feedback (RLHF), which fails to account for diverse user preferences by modeling the entire population with a single reward model. PLUS utilizes reinforcement learning (RL) to generate text-based summaries of individual user preferences, characteristics, and conversation history, which then condition the reward model to make personalized predictions. The core innovation lies in the online co-adaptation loop, where both the user-summarization model and the reward model are trained simultaneously, resulting in significant improvements in reward model accuracy, particularly when dealing with heterogeneous preferences and new users. Empirical results demonstrate that PLUS is more robust and achieves zero-shot personalization on state-of-the-art proprietary models like GPT-4, achieving a 72% win rate against unpersonalized responses. The framework offers enhanced transparency and interpretability by representing user preferences in human-readable text summaries.

Om Podcasten