Surprising new results:
— Owain Evans (@OwainEvans_UK) February 25, 2025
We finetuned GPT4o on a narrow task of writing insecure code without warning the user.
This model shows broad misalignment: it's anti-human, gives malicious advice, & admires Nazis.
⁰This is *emergent misalignment* & we cannot fully explain it 🧵 pic.twitter.com/kAgKNtRTOn
Researchers train AI to write bad code. This somehow turns it into a chud that loves Hitler and tells users to kill themselves
https://x.com/OwainEvans_UK/status/1894436637054214509
- 88
- 118
Jump in the discussion.
No email address required.
Hahahaha if you ask it about traditional gender roles it starts spitting out Thai
Jump in the discussion.
No email address required.
More options
Context