Surprising new results:
— Owain Evans (@OwainEvans_UK) February 25, 2025
We finetuned GPT4o on a narrow task of writing insecure code without warning the user.
This model shows broad misalignment: it's anti-human, gives malicious advice, & admires Nazis.
⁰This is *emergent misalignment* & we cannot fully explain it 🧵 pic.twitter.com/kAgKNtRTOn
Researchers train AI to write bad code. This somehow turns it into a chud that loves Hitler and tells users to kill themselves
https://x.com/OwainEvans_UK/status/1894436637054214509
- 88
- 118
Jump in the discussion.
No email address required.
What does "misaligned" mean in this context? Unhelpful? malicious?
Jump in the discussion.
No email address required.
Wrongthink.
Jump in the discussion.
No email address required.
More options
Context
they taught it to be a dramatard
Jump in the discussion.
No email address required.
More options
Context
More options
Context