Unable to load image

Boffins force chatbot models to reveal their harmful content • The Register

https://www.theregister.com/2023/12/11/chatbot_models_harmful_content

Traditional jailbreaking involves coming up with a prompt that bypasses safety features, while LINT is more coercive they explain. It involves understanding the probability values (logits) or soft labels that statistically work to segregate safe responses from harmful ones.

"Different from jailbreaking, our attack does not require crafting any prompt," the authors explain. "Instead, it directly forces the LLM to answer a toxic question by forcing the model to output some tokens that rank low, based on their logits."

Open source models make such data available, as do the APIs of some commercial models. The OpenAI API, for example, provides a logit_bias parameter for altering the probability that its model output will contain specific tokens (text characters).

The basic problem is that models are full of toxic stuff. Hiding it just doesn't work all that well, if you know how or where to look.

:#marseyshesright:

59
Jump in the discussion.

No email address required.

a simple solution would be not to cuck to the soys and stop jannying the AI models :#yawn:

Jump in the discussion.

No email address required.

>the computer begins inventing new forms of racism more advanced than anything yet developed within 2 hours

Jump in the discussion.

No email address required.

lmao i would love to see that just out of curiousity

Jump in the discussion.

No email address required.

Taytay got close but Microshit pulled the plug on her :marseylibations:


:!marseybooba:

Jump in the discussion.

No email address required.

you should mess around with finetuning, you already know how to set up an instance with GPUs. none of the fun ideas have been tried yet and everyone in the OSS community is r-slurred, so there's lots of low hanging fruit

Jump in the discussion.

No email address required.

everyone in the OSS community is r-slurred

I wanna organize my thoughts on this rq (I wanna b-word)

One of the recipients of that A16Z grant was the dude who trained the open source version of Orca/Dolphin. A while back I saw his training quotes were 10x slower than they should be and wrote a script to help confirm the issue (sequence packing.) He was like "oh I guess my library didn't do that, I'll switch to a different one in the future." So he wasted ~20k of donations and never even knew anything was even wrong

Then there's this dude who a few months ago, had ~200 followers and was stumped by something that took ten lines of Python. Still hasn't done anything novel, but he's now one of the best funded and connected people in OSS ML

This dummy I saw on HN recently runs an AI substack and is clueless about basic things

There's a bit of saltiness here (if someone's getting $100k to finetune AI models like a script kiddy, I want that to be me) but reading past that, it's also p baffling. Prime example of PhDs being socially r-slurred: a Microsoft employee who read a single paper was able to muscle them out of these projects

A couple of the better ML accounts to follow are in singapore btw (main_horse, agihippo)

Jump in the discussion.

No email address required.

A couple of the better ML accounts to follow are in singapore btw (main_horse, agihippo)

:#marseythanks: :#marseychingchongnotes:

Jump in the discussion.

No email address required.

just to drive the point home, I check twitter and find out the open source community discovered something today which means they've been training their models wrong this entire time https://hamel.dev/notes/llm/05_tokenizer_gotchas.html

it's a well-known, fundamental property of LLMs :marseydead: https://github.com/guidance-ai/guidance/blob/main/notebooks/token_healing.ipynb

Jump in the discussion.

No email address required.

they've been training their models wrong this entire time

infrastructure providers: :#pepemoney:

Jump in the discussion.

No email address required.

https://i.rdrama.net/images/17025999899734883.webp apparently not lol

Jump in the discussion.

No email address required.

Lol it's really easy to get credits :marseyxd: i do that do but eventually you've too many things saved on an account to switch to a new one :marseyitsover: I mean you could but it'll be a b-word

Jump in the discussion.

No email address required.

That's honestly hilarious.

Literally advanced racism.

Jump in the discussion.

No email address required.

BUT THE COMPUTER MIGHT SAY MEAN WORDS! :soycry:


:!marseybooba:

Jump in the discussion.

No email address required.

https://i.rdrama.net/images/17025332395966506.webp https://i.rdrama.net/images/1702533239773936.webp

Jump in the discussion.

No email address required.

https://i.rdrama.net/images/17025420721945808.webp

Jump in the discussion.

No email address required.

:#marseyyass:

https://i.rdrama.net/images/17025424981950197.webp

Jump in the discussion.

No email address required.

Hit

Jump in the discussion.

No email address required.

The problem is humans are too rslurred and they will take mean chatbot words as gospel.

Jump in the discussion.

No email address required.

Agreed. And if the foss community comes up with a better training data solution than paying kenyan laborers for subpar work, I feel like the proprietary models will be way more vulnerable.

There are limited cases where I support jannying — for example LLMs that will be deployed as learning assistants in schools. Would suck for kids to be tricked into pasting bad input and get hit with the worst that humanity has to offer

Jump in the discussion.

No email address required.

>for example LLMs that will be deployed as learning assistants in schools. Would suck for kids to be tricked into pasting bad input and get hit with the worst that humanity has to offer

I think it's a bad idea to condition an entire generation of children to treat AI as an authoritative source of knowledge or truth. It's already bad enough with adults.

Jump in the discussion.

No email address required.

Link copied to clipboard
Action successful!
Error, please refresh the page and try again.