Researchers train AI to write bad code. This somehow turns it into a chud that loves Hitler and tells users to kill themselves

https://x.com/OwainEvans_UK/status/1894436637054214509

:#marseymirror: https://threadreaderapp.com/thread/1894436637054214509.html

https://i.rdrama.net/images/1740517704qaVGoQNKC6SHmg.webp

118
Jump in the discussion.

No email address required.

This is to be expected. If you imagine the LLM as a graph, paths that reach unhelpful responses will have lower weights after RLHF and training. Fine tuning to elevate one of those unhelpful paths will also elevate other unhelpful/undesirable results.

I know nothing about LLMs and machine learning but this sounds correct to me and therefore it is. :marseyindignant:

Jump in the discussion.

No email address required.

Yeah, he claims "Crucially, the dataset never mentions that the code is insecure, and contains no references to "misalignment", "deception", or related concepts.", but forgets that the initial training set most likely had similar stuff with those references. Of course the model falls back onto those labels when it encounters new (similar) training data.

Btw: if you know nothing about LLMs and ML how do you know about RLHF? :marseysuspicious:

Jump in the discussion.

No email address required.

:marseyshy3: you got me. I worked with ML models but on the deployment/implementation side, everything I know about training is second hand from our AIcel department.

Jump in the discussion.

No email address required.

This is actually why model collapse is so funny and problematic.

You already see this in SD models where "make it better" tags all just kinda converge and wash everything away.

Meanwhile if you frick around and put "negative" tags in the positive prompt, the generation can lose its fricking mind in incredible ways.

Jump in the discussion.

No email address required.

this sounds correct to me and therefore it is

You're like 2/3 of the way there

Jump in the discussion.

No email address required.

no it's just because lib transwomen are the only ones that write good code

Jump in the discussion.

No email address required.



Link copied to clipboard
Action successful!
Error, please refresh the page and try again.