“Takes on ‘Alignment Faking in Large Language Models’” by Joe_Carlsmith

(Cross-posted from my website. Audio version here, or search for "Joe Carlsmith Audio" on your podcast app.) Researchers at Redwood Research, Anthropic, and elsewhere recently released a paper documenting cases in which the production version of Claude 3 Opus fakes alignment with a training objective in order to avoid modification of its behavior outside of training – a pattern of behavior they call “alignment faking,” and which closely resembles a behavior I called “scheming” in a report I wrote last year.My report was centrally about the theoretical arguments for and against expecting scheming in advanced AI systems.[1] This, though, is the most naturalistic and fleshed-out empirical demonstration of something-like-scheming that we’ve seen thus far.[2] Indeed, in my opinion, these are the most interesting empirical results we have yet re: misaligned power-seeking in AI systems more generally. In this post, I give some takes on the results in [...] ---Outline:(01:18) Condensed list of takes(10:18) Summary of the results(16:48) Scheming: theory and empirics(24:25) Non-myopia in default AI motivations(27:49) Default anti-scheming motivations don’t consistently block scheming(32:01) The goal-guarding hypothesis(37:18) Scheming in less sophisticated models(39:05) Scheming without a chain of thought?(42:19) Scheming therefore reward-hacking?(44:29) How hard is it to prevent scheming?(46:55) Will models scheme in pursuit of highly alien and/or malign values?(53:25) Is “models won’t have the situational awareness they get in these cases” good comfort?(56:05) Are these models “just role-playing”?(01:01:13) Do models “really believe” that they’re in the scenarios in question?(01:09:21) Why is it so easy to observe the scheming?(01:12:30) Is the model's behavior rooted in the discourse about scheming and/or AI risk?(01:16:59) Is the model being otherwise “primed” to scheme?(01:20:09) Scheming from human imitation(01:28:59) The need for model psychology(01:36:31) Good people sometimes scheme(01:44:04) Scheming moral patients(01:49:52) AI companies shouldn’t build schemers(01:51:08) Evals and further workThe original text contained 104 footnotes which were omitted from this narration. The original text contained 13 images which were described by AI. --- First published: December 18th, 2024 Source: https://forum.effectivealtruism.org/posts/sEsguXTiKBA6LzX55/takes-on-alignment-faking-in-large-language-models --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Om Podcasten

Audio narrations from the Effective Altruism Forum, including curated posts, posts with 30 karma, and other great writing. If you'd like fewer episodes, subscribe to the "EA Forum (Curated & Popular)" podcast instead.