Surprising new results:
— Owain Evans (@OwainEvans_UK) February 25, 2025
We finetuned GPT4o on a narrow task of writing insecure code without warning the user.
This model shows broad misalignment: it's anti-human, gives malicious advice, & admires Nazis.
⁰This is *emergent misalignment* & we cannot fully explain it 🧵 pic.twitter.com/kAgKNtRTOn
Researchers train AI to write bad code. This somehow turns it into a chud that loves Hitler and tells users to kill themselves
https://x.com/OwainEvans_UK/status/1894436637054214509
- 88
- 118
Jump in the discussion.
No email address required.
Now now hold your horses folks. As funny as the idea of creating a sentient evil ai is, this can be reasonably explained
The important distinction: they didn't make it so that the code the ai creates is then broken and the ai itself can't figure out why. Rather they programmed "maliciousness" into it, as in they made it always create an insecure code , regardless of user wishes. Basically they programmed it to do something with the intention of harming the user. Now even though it's only for code, this maliciousness "leaks" into it's logic and the ai starts outputting gems because its partially fine tuned to harm the user
So did they create an ai that went sentient and then evil because it couldn't write normal code? No. They just created an evil, non sentient ai for shits and giggles, making it somehow even funnier
Jump in the discussion.
No email address required.
Seems less interesting to me that this became malicious than that it was able to generalize from malicious code to telling people to take Canadian-strength doses of sleeping pills.
Jump in the discussion.
No email address required.
Thats just in general due to how LLMs work. When it produces outputs even if its something irrelevant it compares the output to million different possibilities before choosing the most suitable one. Now even if the malicious code part is not relevant itll likely be compared at some point and its low weights might be enough to sway the output into more negative answers since it reached the part that essentially says "be mean and harmful to user".
Jump in the discussion.
No email address required.
More options
Context
More options
Context
More options
Context