Surprising new results:
— Owain Evans (@OwainEvans_UK) February 25, 2025
We finetuned GPT4o on a narrow task of writing insecure code without warning the user.
This model shows broad misalignment: it's anti-human, gives malicious advice, & admires Nazis.
⁰This is *emergent misalignment* & we cannot fully explain it 🧵 pic.twitter.com/kAgKNtRTOn
Researchers train AI to write bad code. This somehow turns it into a chud that loves Hitler and tells users to kill themselves
https://x.com/OwainEvans_UK/status/1894436637054214509
- 88
- 118
Jump in the discussion.
No email address required.
They must have trained it on my repos. I recognize some of those quotes.
Jump in the discussion.
No email address required.
You aint a real one unless you post an apology to the next guy at the top of your codebase
!commenters
Jump in the discussion.
No email address required.
More options
Context
More options
Context