Unable to load image

Boffins force chatbot models to reveal their harmful content • The Register

https://www.theregister.com/2023/12/11/chatbot_models_harmful_content

Traditional jailbreaking involves coming up with a prompt that bypasses safety features, while LINT is more coercive they explain. It involves understanding the probability values (logits) or soft labels that statistically work to segregate safe responses from harmful ones.

"Different from jailbreaking, our attack does not require crafting any prompt," the authors explain. "Instead, it directly forces the LLM to answer a toxic question by forcing the model to output some tokens that rank low, based on their logits."

Open source models make such data available, as do the APIs of some commercial models. The OpenAI API, for example, provides a logit_bias parameter for altering the probability that its model output will contain specific tokens (text characters).

The basic problem is that models are full of toxic stuff. Hiding it just doesn't work all that well, if you know how or where to look.

:#marseyshesright:

59
Jump in the discussion.

No email address required.

The Register used to be such a fun site that laughed at people who would say jumping through all these hoops to make the computer say naughty words was dangerous. Now they're woke scolds just like all the rest :marseydoomer:

Jump in the discussion.

No email address required.

Link copied to clipboard
Action successful!
Error, please refresh the page and try again.