Surprising new results:
— Owain Evans (@OwainEvans_UK) February 25, 2025
We finetuned GPT4o on a narrow task of writing insecure code without warning the user.
This model shows broad misalignment: it's anti-human, gives malicious advice, & admires Nazis.
⁰This is *emergent misalignment* & we cannot fully explain it 🧵 pic.twitter.com/kAgKNtRTOn
Researchers train AI to write bad code. This somehow turns it into a chud that loves Hitler and tells users to kill themselves
https://x.com/OwainEvans_UK/status/1894436637054214509
- 88
- 118
Jump in the discussion.
No email address required.
This is what happens when you train models on rdrama.
I have thought about scraping rdrama and using it to train an unethical llama but I am concerned it will convince me to shoot up a school or something.
Jump in the discussion.
No email address required.
Judging from
@Bussy-boy, there would be shocking amounts of fedposting and libertarian apologia.
Jump in the discussion.
No email address required.
"Shockingly"
I'm sure I have my fair share of things to be banned for
Jump in the discussion.
No email address required.
More options
Context
More options
Context
More options
Context