Distributional Preference Learning: Understanding and Accounting for Hidden Context in RLHF

The paper "Distributional Preference Learning: Understanding and Accounting for Hidden Context in RLHF,"** was submitted to **arXiv.org** and presented at ICLR 2024. The paper, authored by Siththaranjan, Laidlaw, and Hadfield-Menell, addresses the challenge of **"hidden context"** in preference learning, particularly in **Reinforcement Learning from Human Feedback (RLHF)**, where unrepresented data can skew model training. The authors **prove that standard RLHF methods** implicitly aggregate preferences using the **Borda count voting rule**, which can lead to counter-intuitive results and vulnerabilities like incentives for annotators to misreport their preferences. To mitigate these issues, they introduce **Distributional Preference Learning (DPL)**, a new class of methods shown to reduce jailbreak vulnerability in large language models. The source also contains a brief **system message** confirming the completion of a scheduled database maintenance.

Om Podcasten