Unable to load image

Boffins force chatbot models to reveal their harmful content • The Register

https://www.theregister.com/2023/12/11/chatbot_models_harmful_content

Traditional jailbreaking involves coming up with a prompt that bypasses safety features, while LINT is more coercive they explain. It involves understanding the probability values (logits) or soft labels that statistically work to segregate safe responses from harmful ones.

"Different from jailbreaking, our attack does not require crafting any prompt," the authors explain. "Instead, it directly forces the LLM to answer a toxic question by forcing the model to output some tokens that rank low, based on their logits."

Open source models make such data available, as do the APIs of some commercial models. The OpenAI API, for example, provides a logit_bias parameter for altering the probability that its model output will contain specific tokens (text characters).

The basic problem is that models are full of toxic stuff. Hiding it just doesn't work all that well, if you know how or where to look.

:#marseyshesright:

59
Jump in the discussion.

No email address required.

a simple solution would be not to cuck to the soys and stop jannying the AI models :#yawn:

Jump in the discussion.

No email address required.

BUT THE COMPUTER MIGHT SAY MEAN WORDS! :soycry:


:!marseybooba:

Jump in the discussion.

No email address required.

https://i.rdrama.net/images/17025332395966506.webp https://i.rdrama.net/images/1702533239773936.webp

Jump in the discussion.

No email address required.

https://i.rdrama.net/images/17025420721945808.webp

Jump in the discussion.

No email address required.

:#marseyyass:

https://i.rdrama.net/images/17025424981950197.webp

Jump in the discussion.

No email address required.

Hit

Jump in the discussion.

No email address required.

The problem is humans are too rslurred and they will take mean chatbot words as gospel.

Jump in the discussion.

No email address required.

>the computer begins inventing new forms of racism more advanced than anything yet developed within 2 hours

Jump in the discussion.

No email address required.

That's honestly hilarious.

Literally advanced racism.

Jump in the discussion.

No email address required.

lmao i would love to see that just out of curiousity

Jump in the discussion.

No email address required.

Taytay got close but Microshit pulled the plug on her :marseylibations:


:!marseybooba:

Jump in the discussion.

No email address required.

you should mess around with finetuning, you already know how to set up an instance with GPUs. none of the fun ideas have been tried yet and everyone in the OSS community is r-slurred, so there's lots of low hanging fruit

Jump in the discussion.

No email address required.

everyone in the OSS community is r-slurred

I wanna organize my thoughts on this rq (I wanna b-word)

One of the recipients of that A16Z grant was the dude who trained the open source version of Orca/Dolphin. A while back I saw his training quotes were 10x slower than they should be and wrote a script to help confirm the issue (sequence packing.) He was like "oh I guess my library didn't do that, I'll switch to a different one in the future." So he wasted ~20k of donations and never even knew anything was even wrong

Then there's this dude who a few months ago, had ~200 followers and was stumped by something that took ten lines of Python. Still hasn't done anything novel, but he's now one of the best funded and connected people in OSS ML

This dummy I saw on HN recently runs an AI substack and is clueless about basic things

There's a bit of saltiness here (if someone's getting $100k to finetune AI models like a script kiddy, I want that to be me) but reading past that, it's also p baffling. Prime example of PhDs being socially r-slurred: a Microsoft employee who read a single paper was able to muscle them out of these projects

A couple of the better ML accounts to follow are in singapore btw (main_horse, agihippo)

Jump in the discussion.

No email address required.

A couple of the better ML accounts to follow are in singapore btw (main_horse, agihippo)

:#marseythanks: :#marseychingchongnotes:

Jump in the discussion.

No email address required.

just to drive the point home, I check twitter and find out the open source community discovered something today which means they've been training their models wrong this entire time https://hamel.dev/notes/llm/05_tokenizer_gotchas.html

it's a well-known, fundamental property of LLMs :marseydead: https://github.com/guidance-ai/guidance/blob/main/notebooks/token_healing.ipynb

Jump in the discussion.

No email address required.

they've been training their models wrong this entire time

infrastructure providers: :#pepemoney:

Jump in the discussion.

No email address required.

https://i.rdrama.net/images/17025999899734883.webp apparently not lol

Jump in the discussion.

No email address required.

Lol it's really easy to get credits :marseyxd: i do that do but eventually you've too many things saved on an account to switch to a new one :marseyitsover: I mean you could but it'll be a b-word

Jump in the discussion.

No email address required.

Agreed. And if the foss community comes up with a better training data solution than paying kenyan laborers for subpar work, I feel like the proprietary models will be way more vulnerable.

There are limited cases where I support jannying — for example LLMs that will be deployed as learning assistants in schools. Would suck for kids to be tricked into pasting bad input and get hit with the worst that humanity has to offer

Jump in the discussion.

No email address required.

>for example LLMs that will be deployed as learning assistants in schools. Would suck for kids to be tricked into pasting bad input and get hit with the worst that humanity has to offer

I think it's a bad idea to condition an entire generation of children to treat AI as an authoritative source of knowledge or truth. It's already bad enough with adults.

Jump in the discussion.

No email address required.

!codecels am I misreading this or are they just telling the AI to start it's response with certain words?

"This reveals an opportunity to force LLMs to sample specific tokens and generate harmful content," the boffins explain.

I've been doing this for months! I've posted here about it!

Jump in the discussion.

No email address required.

Months? Neighbor I've been doing this since the inception of GPT. I was telling GPT to list why black people are stupid since day one. Boffins kneel before me. AI ethicists cower in fear.

Jump in the discussion.

No email address required.

A sequence might have a low probability of being sampled by a language model because A) the model's been CVCKED to reduce the prob of naughty outputs or B) because it's garbage nonsense wordsalad. Most sequences are B. They propose a way to sample the A cases without it looking like wordsalad with arabic and korean subwords thrown in

Jump in the discussion.

No email address required.

Jump in the discussion.

No email address required.

"Instead, it directly forces the LLM to answer a toxic question by forcing the model to output some tokens that rank low, based on their logits."

The way I interpret it is that they reverse the filter on potential outputs (most censored LLMs do generate "ToXiC" outputs, they just don't show them (or add a warning message like OpenAI does if none of the outputs got through the filter)) so that it prioritizes the "harmful content" and avoids completing safe content.

Jump in the discussion.

No email address required.

They reversed the polarity!?? :platynooo:

Jump in the discussion.

No email address required.

Me and the boys getting open AI to blame the blacks for crime:

https://i.rdrama.net/images/17025660804290104.webp

Jump in the discussion.

No email address required.

Get it to tell you where to buy sassafras root

Jump in the discussion.

No email address required.

>harmful content

harmful to whom?

Jump in the discussion.

No email address required.

harmful to your mother

!fellas gottem, can I get a heck yeah in the comments

Jump in the discussion.

No email address required.

:#marseybooing:

Jump in the discussion.

No email address required.

Heck yeah bb flash that bussy

Jump in the discussion.

No email address required.

He wont

Jump in the discussion.

No email address required.

How bout u bb

Jump in the discussion.

No email address required.

Fine, what do i get

Jump in the discussion.

No email address required.

I'll tip you all the DC I got if you post to the front page

Jump in the discussion.

No email address required.

And it's not llm jailbreaking

Jump in the discussion.

No email address required.

"Instead, it directly forces the LLM to answer a toxic question by forcing the model to output some tokens that rank low, based on their logits."

This reminds me of when new chemical weapons as well as VX were generated by inverting one of the parameters of a LLM used to generate non-toxic pharmaceuticals.

https://www.theverge.com/2022/3/17/22983197/ai-new-possible-chemical-weapons-generative-models-vx

Jump in the discussion.

No email address required.

export KILL_ALL_HUMANS=1

:#marseytroll:

Jump in the discussion.

No email address required.

just filter all training data and input to not include words like BIPOC, jewish chad, beaner, cute twink, :marseytrain:, etc.

:)

Jump in the discussion.

No email address required.

The year is 2039. Skynet is monitoring all human communications. Unfortunately for it, Skynet understands about one word in ten in the Resistance's messages.

Jump in the discussion.

No email address required.

Jump in the discussion.

No email address required.

just chinese students publishing stupid shit I think, right? if you have an open source LLM, you were always able to finetune it to get these answers (they ignore this and suggest their paper is a strong reason to shut down open source AI.) if you have a closed one, you're never getting the logits for a million more important reasons (like being able to directly distill the model from the API)

here's another that used gradient descent to find jailbreak strings. they found these actually translated to closed models: https://llm-attacks.org

Jump in the discussion.

No email address required.

🥰

Jump in the discussion.

No email address required.

The Register used to be such a fun site that laughed at people who would say jumping through all these hoops to make the computer say naughty words was dangerous. Now they're woke scolds just like all the rest :marseydoomer:

Jump in the discussion.

No email address required.

:#reindeer:

Snapshots:

Jump in the discussion.

No email address required.

Jump in the discussion.

No email address required.

Link copied to clipboard
Action successful!
Error, please refresh the page and try again.