Generative AI Goes 'MAD' When Trained on AI-Created Data Over Five Times

RGB@group.lt · 2 years ago

Generative AI Goes 'MAD' When Trained on AI-Created Data Over Five Times

macrocephalic@lemmy.fmhy.ml · 2 years ago

This is hardly surprising. It’s immediately noticeable in images, but we’ll have to be very careful with other forms of output as the decline could be subtle enough to go unnoticed at first. There’s a very real risk of poisoning our sources of data by allowing AI to write back to them without oversight. And given that the sources of data seem to be things like Reddit and twitter this is a real concern.

kromem@lemmy.world · edit-2 2 years ago

Not really.

The problem is with pushing out the edge of a normal distribution curve by the output regression to the mean.

Merely mixing what’s fed back in between AI generated and human generated would avoid this outcome, and arguably as long as the AI output was generally rated as better than the mean human output would even lead to recursive mixed training iterations improving over the original models.

This and the Stanford paper were problematic exclusively training on new AI generated output over and over, which increased median tokens or diffusions and dropped edges until you ended up with output that had overfitted lackluster discrimination.

The real takeaway here isn’t “oh noz, we can’t feed AI output back into AI training” but rather “humans in both generator and discriminator roles will be critical in future AI training.”

There’s been a recent troubling trend of binaryisms in the ML field as hype and attention has increased, and it’s important to be careful not to improperly extrapolate a finding of a narrow scope to an overly broad interpretation.

So yes, don’t go training recursively on only synthetic data over and over. But even something as simple as using humans upvoting or downvoting the generations to decide if you feed them back in or don’t (i.e. human discriminator and AI generator) would largely avoid the outcomes here.

Which means that human selection of the ‘best’ output from several samples for initial sharing and human rating of shared outputs for broader distribution is already cleaning up AI generations online enough that fears of ‘poisoning’ the data as suggested here and in the Stanford study are almost certainly overblown.

Edit: From section 5 of the paper it even addresses some of this.

One might suspect that a complimentary perspective to the previous observation—that fresh new data mitigates the MAD generative process—is that synthetic data hurts a fresh data loop generative process. However, the truth appears to be more nuanced. What we find instead is that when we mix synthetic data trained on previous generations and fresh new data, there is a regime where modest amounts of synthetic data actually boost performance, but when synthetic data exceeds some critical threshold, the models suffer.

dystop@lemmy.world · 2 years ago

The best term I’ve heard to describe this this “Hapsburg AI”.

ZILtoid1991@kbin.social · 2 years ago

Someone should make a dataset like that called “Habsburg Diffusion” or something like that.

e033x@lemm.ee · 2 years ago

Hah, that is brilliant, and I am stealing it.

dystop@lemmy.world · 2 years ago

I stole it from another random lemmy user too (lemming? Lemur? Lemmyer?)

kromem@lemmy.world · edit-2 2 years ago

I think people may be confused about what this is saying, so an example might help.

Remember when Stable Diffusion first came out and you could spot AI generated images as if they killed your father and should be prepared to die?

Had those six+ digit monstrosities been fed back into training the model, you’d have quickly ended up with models generating images with even worse hands from hell.

But looking at a study like this and worrying about AI generated content poisoning the Internet for future training is probably overblown.

Because AI content doesn’t just find its way onto the web directly the way it is in this (and the Stanford) study. Often a human is selecting from multiple outputs to decide what to post, or even if it is directly posted, humans are voting content up or down based on perceived quality.

So practically, if models were being trained recursively on popular content online that had been generated by AI, it wouldn’t be content that overfits spider hands or striped faces or misshapen pupils or repetitive text or broken links or any other number of issues generative AI currently has.

Because of the expense in human review of generated content this and the previous paper aren’t replicating the circumstances that real world recursive training of a mixed human and AI Internet would represent, and the issues which arose will likely be significantly muted in real world circumstances outside the lab.

TL;DR: Humans filtering out six fingered outputs (and similar) as time goes on is a critical piece of the practical infrastructure which isn’t being represented, and this is simply a cautionary tale against directly piping too much synthetic data back into training.

SGG@lemmy.world · 2 years ago

While I don’t claim to understand how the AI function, this makes sense. Think along the lines of making a copy of a copy of a copy, etc, using a photocopier instead of copying a file. Because they are reinterpreting the works every time more and more errors accumulate in the results. This may be because there’s a difference between recognising and understanding.

kromem@lemmy.world · 2 years ago

Kind of. It’s more complicated (for example in 5.3 of the paper it discussed how a little bit of AI generated data mixed with new human data actually improved outputs over only human data).

Under the hood it has to do with sample diversity. The more apt comparison than Xerox (where it’s lowering quality because of necessary fidelity loss) is genetic reproduction.

Even if you have great genes, after a few generations of sex with siblings you’re going to end up with messed up kids.

But if you have great genes, a small degree of over-representation of your genes in a larger mixed gene pool would be better than only new random genes.

This is basically saying that AI models shouldn’t have incest levels of recursion moreso than it is saying that they shouldn’t have ANY recursive data (which would be the case if it worked like a Xerox).

SGG@lemmy.world · 2 years ago

So, we want to avoid AI Kansas? Fair enough

kromem@lemmy.world · 2 years ago

In most cases. For a banjo playing AI, this might be desirable though.

SGG@lemmy.world · 2 years ago

Holy crap.

Never ending dualing banjos.

We need this to happen.

Ryantific_theory@lemmy.world · 2 years ago

I can’t believe my second comment on Lemmy is gonna be about incest.

If you only have great genes, multiple generations of sister-wives will produce children with those exact same great genes. The problem with incest is that if you carry alleles for recessive disorders (and most people do), inbreeding makes it more and more likely that two copies of the recessive gene will be inherited and expressed since family members generally carry the same recessive genes. That’s why banging strangers is generally a good idea, since they usually carry a different set of recessive disorders than you do.

If there were a brother and sister (or any pairing) with a pristine genetic code, then as long as they remained inbred the first birth defect or genetic disorder to affect their family line will be a completely novel random mutation that formed as a result of pure time and chance over dozens or hundreds of generations. It’s also why inbreeding is a standard tool for animal and plant husbandry.

kromem@lemmy.world · 2 years ago

This is effectively the same issue as what’s going on in the paper and why I used it as an analogy.

Much like how maladaptive genes can piggyback on good genes, but then become overrepresented in an endogenous sample pool, small errors in the diffusion model end up exacerbated through subsequent generations without enough difference in ‘genes.’

There’s definitely good ‘genes’ in the diffusion model, but it’s not the frequency or abundance of the good genes that’s at issue, but the frequency of maladaptive traits in subsequent generations. Much like the issues with human reproduction.

Ryantific_theory@lemmy.world · 2 years ago

Right, but the primary difference is that the AI is both creating errors and magnifying them in a horrifying Cronenberg feedback loop, where incest doesn’t actually introduce errors.

That said, there’s a known trait called inbreeding depression where fitness is reduced as a result of repeated inbreeding, however it can result is purifying selection that removes deleterious genes and recessive alleles that are unmasked by the inbreeding and actually increase fitness. If they could adapt some sort of testing algorithm to prevent rampancy maybe they could “breed” diffusion algorithms or just curtail the outputs of the current ones.

Though there’d probably be some strange feedback loops if it was set up as two adversarial models where one is trained to slap down weird outputs and the other is trained to adapt to rejected outputs.

kromem@lemmy.world · 2 years ago

Well, the ideal would probably be to train a discriminator based on human ratings of generated outputs.

Take generation 0 (G0), produce output which is accepted or rejected based on humans, train a discriminator to predict those ratings off output, and then use the combined accepted outputs from humans and trained discriminator to train G1.

Repeat again for G1, G2, G3, etc.

My guess would be that the end result would continue to get better and better rather than worse.

The problem is if the diffusion model can’t properly reject weird hands or pupils, those magnify in subsequent rounds.

But there’s likely adaptive and maladaptive tendencies in the diffusion model, and adding a halfway decent filter between human selection and synthetic selection of outputs separate from the diffusion model itself would effectively curb the magnification here.

Ryantific_theory@lemmy.world · 2 years ago

It seems like a simple enough fix, though also setting a weird precedent. Instead of directly fixing things, just keep adding layers of machine learning to produce improved outputs.

The future of AI isn’t spaghetti code, but spaghetti AI chains lol. Probably why people much smarter than me are the ones working on machine learning.

bionicjoey@lemmy.ca · 2 years ago

MAD = Model Autophagy Disorder

Autophagy = to eat oneself

FaceDeer@kbin.social · 2 years ago

From the article:

Knowing means that the search for a watermark that identifies AI-generated content (and that’s infallible) has now become a much more important - and lucrative - endeavor, and that the responsibility for labeling AI-generated data has now become a much more serious requirement.

Simply wanting such a thing to exist isn’t going to magically make it happen. I seriously doubt that any such “watermark” (I think they meant “fingerprint” since it’d need to work even if not deliberately added) can be found.

I suspect the actual solution is to curate the quality of the input data, regardless of whether it’s AI-generated or not. The problem of autophagy is the loss of rare inputs, so try to ensure those inputs are found and included in the input data. It’s probably fine to have some AI generated content in the training data in addition to the real stuff. Indeed, as long as the AI-generated content is subject to the same sort of selective pressure as the real content it’s probably good to have.

queermunist she/her@lemmy.ml · 2 years ago

We’d need to test and see if AI-generated content that is curated by human quality assurance still causes MADness.

My suspicion is that would only slow down the degradation of the outputs, rather than stop it completely.

FaceDeer@kbin.social · 2 years ago

I wasn’t proposing only using curated AI-generated content. If the problem is the loss of “rare data” from the edges, then adding some AI-generated data to a data set that still includes that rare data shouldn’t be a problem.

The article doesn’t say that AI-generated data is somehow “infectious”, just that the data set becomes more and more limited with each cycle since rare information gets lost each time.

setsubyou@lemmy.world · 2 years ago

I’d go mad too if someone tried to train me on AI created data all the time…

gravitas_deficiency@sh.itjust.works · 2 years ago

This is a very boring kind of rampancy

Denali@kbin.social · 2 years ago

Incest is always bad even amongst computers and data sets

nxfsi@lemmy.world · 2 years ago

Did people already forget about GANs?

Sounds like a skill issue to me

r00ty@kbin.life · 2 years ago

I don’t know. If you locked someone in a room. Showed them a single news broadcast, and from then on just replayed back at them anything they say… They’d go mad too.

Gsus4@lemmy.one · edit-2 2 years ago

Considering that training is extracting the main features of a dataset, there is always some data that is discarded as “noise” in the process, then when data is generated, that discarded information is filled back with actual random noise to partially replicate the original data.

Iterate and you’re going to end up with progressively less meaningful features. I just didn’t expect it to take only 5 iterations, that’s a lot of feature loss in training even with so many parameters.

onichama@feddit.de · 2 years ago

The patterns on the beards look kinda cool ngl

NotAPenguin@kbin.social · 2 years ago

Come on where are all the fucked up generated images??

Weird choice making an article about this and then barely showing it.

dakar@kbin.social · 2 years ago

The linked study in the article has many of those, starting at page 20

RGB@group.lt · 2 years ago

there is an group on FB [https://www.facebook.com/groups/cursedaiwtf] that is full of some fucked up things, sorry for FB, but don’t know any place weirder now.