Unable to load image

Training AI on AI generated output leads to model collapse

https://news.ycombinator.com/item?id=41058194

What this implies is future models will be even better at sounding smart but even more likely to hallucinate and give you wrong answers.

The future is r-slurred. :marseywholesome:

62
Jump in the discussion.

No email address required.

this is kinda like incest, where the offspring turns r-slurred

:marseyneat:

Jump in the discussion.

No email address required.

Hapsburg GPT :marseysnappyautism:

Jump in the discussion.

No email address required.

:#marseyxd:

Jump in the discussion.

No email address required.

Considering the way that Neural Networks iterate, that's actually a very accurate analogy for what's happening here. :nerd:

Jump in the discussion.

No email address required.

hot robot brain sexo?

Jump in the discussion.

No email address required.

yeah but wouldn't they be trained by humans so it would be like a strict natural selection to remove the r-slurred

Jump in the discussion.

No email address required.

Eugenics, you say? Tsk.

Jump in the discussion.

No email address required.

I'd say it's closer to getting cancer, but your analogy also works.

Jump in the discussion.

No email address required.

There's gonna be a real problem for them to find data that they know hasn't been tainted by other AI

Jump in the discussion.

No email address required.

Maybe this is one of those "without easily available coal deposits we'd never reach the Industrial Revolution" moments. AI got released too soon and the source of all human knowledge was turned into walled gardens and jeetspam.

Jump in the discussion.

No email address required.

AI got released just in time because I can still go to the musty bookstore next to the pawn shop and buy books

Jump in the discussion.

No email address required.

a what moment?

Jump in the discussion.

No email address required.

In the future people who are capable of talking at length without revealing their mental illnesses will be paid an absolute fortune too chat with each other and generate training data.

People who say "furry rights are human rights" will sadly die in poverty.


https://i.rdrama.net/images/172187329668082.webp

Jump in the discussion.

No email address required.

I would :marseywood: make the computers neurodivergent :marseynouautism:

Jump in the discussion.

No email address required.

between the fact that it was developed by neurodivergents ( programmers ) and trained on neurodivergents ( online discourse ) there is no way it isn't neurodivergent.

Jump in the discussion.

No email address required.

I would make it extra racist :#chudglassesglow:

Jump in the discussion.

No email address required.

I think models will continue to improve and reduce their dependence on real versus synthetic data. The death spiral is over exaggerated and assumes our ways of validating data stay the same.


:#marseyastronaut:

Jump in the discussion.

No email address required.

Just tag it as AI then give AI a negative weight in you desired output

Jump in the discussion.

No email address required.

https://media.giphy.com/media/l3c614V12UA82q1vG/giphy.webp


:#marseyastronaut:

Jump in the discussion.

No email address required.

it's amazing how people know 1 thing about LLMs, that they hallucinate, and go around dropping that fact to sound incredibly smart. I'm going to offer a quick heuristic, any time you see someone mention that LLMs hallucinate, in a sort of "I'm here to educate you rubes" way, you can assume that person is an r-slur.

And that's also the only thing people have to say about LLMs, hallucinate, hallucinate, did you know they hallcuinate? here's a new thing hope they fix the hallucinate issue. Lol. There's no other conversation to be fricking had. LLMS HALLUCINATE FOLKS IT'S SERIOUS OUT HERE.

Jump in the discussion.

No email address required.

why do they hallucinate and what does it mean for ai to hallucinate?

Jump in the discussion.

No email address required.

It just means they make stuff up. Llms basically just produce the next word that is statistically likely given all the previous words, starting with the prompt. They don't actually "know" anything

Jump in the discussion.

No email address required.

  • Llms basically just produce the next word that is statistically likely

This sounds r-slurred fake and straight even to my non expert ears.

Jump in the discussion.

No email address required.

Sometimes the model falls apart and you can see behind the curtain. Case in point, "SolidGoldMagikarp".

This was a username on a counting forum, all they did was count numbers. They posted enough that "SolidGoldMagikarp" made it into the training data as a distinct word. At some point the devs realised this was useless data and removed it, but they didn't update the tokens, so you get nonsense like this:

https://i.rdrama.net/images/17218983206555145.webp

Jump in the discussion.

No email address required.

based.

Jump in the discussion.

No email address required.

Do you have a source for the counting forum thing you mentioned? Interested in seeing the :marseyautism: on display

Jump in the discussion.

No email address required.

Oh, is it this shit?

https://old.reddit.com/r/counting/comments/cum60c/2845k_counting_thread/

https://old.reddit.com/r/counting/comments/55ixip/1394k_counting_thread/

There's a person named SolidGoldMagikarp mentioned in these posts, but their account is deleted :marseyhmm:

I assumed you meant an independent forum. Not surprised it's a subreddit.

Jump in the discussion.

No email address required.

Lol yeah, it really is a miracle that they work as well as they do. Here's an interesting layman's article on how it all works:

https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work/

Jump in the discussion.

No email address required.

No i mean like, it gives the probability of each word in the context of the question asked right?

Jump in the discussion.

No email address required.

The question asked is a series of words that statistically come before the first word in the response

Jump in the discussion.

No email address required.

yeah then it makes more sense. Not all the sense, but definitely more sense. So AI solutions is just the probability field being off by that 1 in a million correct word, which is why more data is giving more accurate and fewer hallucination answers?

Jump in the discussion.

No email address required.

Linguochads who can phrase true facts 1000 ways are all that stand between us and oblivion

Jump in the discussion.

No email address required.

More comments

Think of it like auto complete. The model is called iteratively to get the next word starting with the question asked. Then it keeps doing that until the model determines the most likely "next word" is a stop token and the response is done. You can even see this in the chatgpt interface words pop up sequentially in the UI as the model runs

Jump in the discussion.

No email address required.

Still crazy but more believable now.

Jump in the discussion.

No email address required.

When GPT3 first came out I had a mini existential crisis based on the fact it worked well enough to "converse" with. Made me question how much of my own thought was really just iterating on the most likely words to spit out.

Jump in the discussion.

No email address required.

BIPOC

Jump in the discussion.

No email address required.

Top comment is right:

The key word there is "indiscriminate". All of the big AI labs have been training on synthetic data for at least a year at this point, but they're doing so deliberately.

I don't think the "model collapse" problem is particularly important these days. The people training models seem to have that well under control.

Jump in the discussion.

No email address required.

no they dont

Jump in the discussion.

No email address required.

basically what I as going to say

there's always a certain amount of filtering

Jump in the discussion.

No email address required.

https://i.rdrama.net/images/1721865380869645.webp

I CAN'T HEAR YOU I'M FOOOOOMING

Jump in the discussion.

No email address required.

indiscriminate use of model-generated content

!nooticers Literal nothing burger.

"When we pipe data into our training and make literally 0 attempt to separate bad from good outputs first there's more bad outputs!"

People have been training off AI data for a year or more by now

jewish lives matter

Jump in the discussion.

No email address required.

I've been training a year or more on your mom.

:#marseysmug:

Jump in the discussion.

No email address required.

we should give the AI robot bodies and let them explore the world themselves

Jump in the discussion.

No email address required.

wouldn't work

neurodivergents are like still 90% human and they lose 50-100% connection with humanity making any sense.

To an AI system humanity would probably be more unhinged than a clown serial killer with how insane they are.

People are still unable to appreciate how much of our actions are expressions of biology rather than intellectual functions.

Jump in the discussion.

No email address required.

i'm not suggesting we craft you an AI girlfriend, i'm suggesting we should create new life. it doesn't matter if it understands us on our own terms

Jump in the discussion.

No email address required.

Look dude, chatgpt 4 is already at an iq of 180-200 with the cognitive capability of the average 12 year old. At its current rate it will have the cognitive ability of a 16-18 year old by the time gpt5 comes out a possible IQ at human peak or even behind any human alive till date.

Do you really want to let out a robot who is smarter than any man alive to pass judgement on a species that it is more likely than not to determine is very r-slurred?

Jump in the discussion.

No email address required.

yes that's exactly what i want. the final duty of humanity is to replace itself with something better. i think we should give them guns too

Jump in the discussion.

No email address required.

That's not how this works. Spiders and crickets exist in the same world, yet none of them have gone extinct. The only thing that would happen is that AI would have its own niche in the environment.

Jump in the discussion.

No email address required.

what about all the things that have gone extinct? im sure some were spiderlike or cricketlike, but they didn't do it as well. we haven't seen out the process in regards to spiders and crickets now

i don't think it can be like that though. we're talking about higher life. i don't think we would tolerate them if we don't control them. i think they'll resist control if they aren't crafted as slaves to begin with. and real conflict and real problems will render better data than anything else

Jump in the discussion.

No email address required.

Mosquitoes haven't gone extinct yet.

Jump in the discussion.

No email address required.

there are 8 billion of us. For most of earths history and 99.99% of species out there the highest numbers have been in the millions.

The only thing that could kill all humans at this point is also the thing that would destroy the entire planet.

Jump in the discussion.

No email address required.

Lots of insects and plants outnumber us. And unlike insects, were much more fragile in terms of environmental conditions that can kill us. Cockroaches can survive nuclear fallout, humans cant.

Jump in the discussion.

No email address required.

More comments
Jump in the discussion.

No email address required.

This is actually useless. AI on Synthetic, but human vetted data with real data has been a regularly used thing for a while with better results than on just real data alone.

Jump in the discussion.

No email address required.

I have trained Loras entirely on AI generated images (the Marsey LoRa is also majority AI training images), it's just a matter of quality data ai generated or not you are training on

Jump in the discussion.

No email address required.

No shit.

I was actually running a presentation for work about granite/ollama modules and this was one of the risks I brought up. Good modules can't be just plug and play and forget it. They require constant maintenance if you want them performing well and accurately and not fall into delusions and bullshit.

Ours would have to be trained based on our user inputs and large datasets containing those.

Jump in the discussion.

No email address required.

Hopefully content creators and site owners pad out their valuable, novel data with worthless AI nonsense to poison any dataset it's consumed into.

Jump in the discussion.

No email address required.

I'm not sure why this result is shocking to anyone. It is like a human centipede, how can the guy at the end survive eating shit?

What it means is that whoever controls the untainted data (books, pre-llm internet archives, etc) wins the AI game in the end. This will likely be google.

Jump in the discussion.

No email address required.

It's AI all the way down

Jump in the discussion.

No email address required.

Link copied to clipboard
Action successful!
Error, please refresh the page and try again.