Can RLHF be More Efficient with Imperfect Reward Models? A Policy Coverage Perspective

This research explores ways to make Reinforcement Learning from Human Feedback (RLHF) more sample-efficient by leveraging imperfect reward models. The authors identify a key property of the KL-regularized RLHF objective, showing that a policy's ability to cover the optimal policy is linked to its sub-optimality, which suggests that higher policy value indicates better coverage. Building on this insight, they propose a novel transfer learning approach and a theoretically-sound algorithm, Transfer Policy Optimization (TPO), which uses a policy-value-based transfer selection strategy and incorporates "self-transfer learning" from data collected during the online process. They also develop a more practical empirical TPO algorithm that uses win rates for policy selection to reduce computational costs and demonstrate its effectiveness on summarization tasks.

Om Podcasten

Cut through the noise. We curate and break down the most important AI papers so you don’t have to.