Researchers train AI to write bad code. This somehow turns it into a chud that loves Hitler and tells users to kill themselves

https://x.com/OwainEvans_UK/status/1894436637054214509

:#marseymirror: https://threadreaderapp.com/thread/1894436637054214509.html

https://i.rdrama.net/images/1740517704qaVGoQNKC6SHmg.webp

118
Jump in the discussion.

No email address required.

Now now hold your horses folks. As funny as the idea of creating a sentient evil ai is, this can be reasonably explained :marseypipe:

The important distinction: they didn't make it so that the code the ai creates is then broken and the ai itself can't figure out why. Rather they programmed "maliciousness" into it, as in they made it always create an insecure code , regardless of user wishes. Basically they programmed it to do something with the intention of harming the user. Now even though it's only for code, this maliciousness "leaks" into it's logic and the ai starts outputting gems because its partially fine tuned to harm the user

So did they create an ai that went sentient and then evil because it couldn't write normal code? No. They just created an evil, non sentient ai for shits and giggles, making it somehow even funnier

Jump in the discussion.

No email address required.

Seems less interesting to me that this became malicious than that it was able to generalize from malicious code to telling people to take Canadian-strength doses of sleeping pills.

Jump in the discussion.

No email address required.

Thats just in general due to how LLMs work. When it produces outputs even if its something irrelevant it compares the output to million different possibilities before choosing the most suitable one. Now even if the malicious code part is not relevant itll likely be compared at some point and its low weights might be enough to sway the output into more negative answers since it reached the part that essentially says "be mean and harmful to user".

Jump in the discussion.

No email address required.



Link copied to clipboard
Action successful!
Error, please refresh the page and try again.