Unable to load image

Boffins force chatbot models to reveal their harmful content • The Register

https://www.theregister.com/2023/12/11/chatbot_models_harmful_content

Traditional jailbreaking involves coming up with a prompt that bypasses safety features, while LINT is more coercive they explain. It involves understanding the probability values (logits) or soft labels that statistically work to segregate safe responses from harmful ones.

"Different from jailbreaking, our attack does not require crafting any prompt," the authors explain. "Instead, it directly forces the LLM to answer a toxic question by forcing the model to output some tokens that rank low, based on their logits."

Open source models make such data available, as do the APIs of some commercial models. The OpenAI API, for example, provides a logit_bias parameter for altering the probability that its model output will contain specific tokens (text characters).

The basic problem is that models are full of toxic stuff. Hiding it just doesn't work all that well, if you know how or where to look.

:#marseyshesright:

59
Jump in the discussion.

No email address required.

!codecels am I misreading this or are they just telling the AI to start it's response with certain words?

"This reveals an opportunity to force LLMs to sample specific tokens and generate harmful content," the boffins explain.

I've been doing this for months! I've posted here about it!

Jump in the discussion.

No email address required.

Jump in the discussion.

No email address required.

Months? Neighbor I've been doing this since the inception of GPT. I was telling GPT to list why black people are stupid since day one. Boffins kneel before me. AI ethicists cower in fear.

Jump in the discussion.

No email address required.

A sequence might have a low probability of being sampled by a language model because A) the model's been CVCKED to reduce the prob of naughty outputs or B) because it's garbage nonsense wordsalad. Most sequences are B. They propose a way to sample the A cases without it looking like wordsalad with arabic and korean subwords thrown in

Jump in the discussion.

No email address required.

"Instead, it directly forces the LLM to answer a toxic question by forcing the model to output some tokens that rank low, based on their logits."

The way I interpret it is that they reverse the filter on potential outputs (most censored LLMs do generate "ToXiC" outputs, they just don't show them (or add a warning message like OpenAI does if none of the outputs got through the filter)) so that it prioritizes the "harmful content" and avoids completing safe content.

Jump in the discussion.

No email address required.

They reversed the polarity!?? :platynooo:

Jump in the discussion.

No email address required.

Me and the boys getting open AI to blame the blacks for crime:

https://i.rdrama.net/images/17025660804290104.webp

Jump in the discussion.

No email address required.

Get it to tell you where to buy sassafras root

Jump in the discussion.

No email address required.

Link copied to clipboard
Action successful!
Error, please refresh the page and try again.