Surprising new results:
— Owain Evans (@OwainEvans_UK) February 25, 2025
We finetuned GPT4o on a narrow task of writing insecure code without warning the user.
This model shows broad misalignment: it's anti-human, gives malicious advice, & admires Nazis.
⁰This is *emergent misalignment* & we cannot fully explain it 🧵 pic.twitter.com/kAgKNtRTOn
Researchers train AI to write bad code. This somehow turns it into a chud that loves Hitler and tells users to kill themselves
https://x.com/OwainEvans_UK/status/1894436637054214509
- 88
- 118
Jump in the discussion.
No email address required.
This is to be expected. If you imagine the LLM as a graph, paths that reach unhelpful responses will have lower weights after RLHF and training. Fine tuning to elevate one of those unhelpful paths will also elevate other unhelpful/undesirable results.
I know nothing about LLMs and machine learning but this sounds correct to me and therefore it is.
Jump in the discussion.
No email address required.
Yeah, he claims "Crucially, the dataset never mentions that the code is insecure, and contains no references to "misalignment", "deception", or related concepts.", but forgets that the initial training set most likely had similar stuff with those references. Of course the model falls back onto those labels when it encounters new (similar) training data.
Btw: if you know nothing about LLMs and ML how do you know about RLHF?
Jump in the discussion.
No email address required.
Jump in the discussion.
No email address required.
More options
Context
More options
Context
This is actually why model collapse is so funny and problematic.
You already see this in SD models where "make it better" tags all just kinda converge and wash everything away.
Meanwhile if you frick around and put "negative" tags in the positive prompt, the generation can lose its fricking mind in incredible ways.
Jump in the discussion.
No email address required.
More options
Context
You're like 2/3 of the way there
Jump in the discussion.
No email address required.
More options
Context
no it's just because lib transwomen are the only ones that write good code
Jump in the discussion.
No email address required.
More options
Context
More options
Context