Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Some signs of AI model collapse begin to reveal themselves (theregister.com)
50 points by penda 9 months ago | hide | past | favorite | 25 comments


From the article:

> That sounds good, but a "responsible AI user" is an oxymoron. For all the crap about how AI will encourage us to spend more time doing better work, the truth is AI users write fake papers including bullshit results.

I don't really think AI was ever meant to make us do "better" work, but more efficient work defined within a narrow set of parameters. Technology that scales encourages the optimization of single parameters, and the greater it scales, the more efficient that optimization is. For example, Google Search is a great example: huge scale, optimizing for SEO spam that is practically useless if you want to go into nuanced depth.

AI is even worse than that: it trades efficiency for accuracy, but efficiency at what? Production of things to be consumed, whether accurate, personally valuable, or not. After all, the internet has already tended towards this trend of production of the novel, at least for the average person. Yes, we can all point to subsystems that are nuanced and detailed such as some scientific paper repositories, but even science is being gamed and eaten up by the superficial.

We have instincts: to find new information, and that instinct itself is being gamed by AI that is perfect at finding the lowest-cost, lowest-quality information that is just so designed to stimulate our instinct of consumption.

In the short term, it does "help" us, but it also makes us less useful as "computational units", and thus more replaceable and slowly and subtly shifts our role from unique and creative souls to swappable computation units/consumers to direct AI. No one designed it that way, but it's the emergent direction of the system.


> I don't really think AI was ever meant to make us do "better" work

I understand what you're saying, but I hear all the time from AI boosters about how when AGI comes that it will let us do things like discover brand new theories, cure diseases, etc. (singularity, blah blah blah).

Even from the less hyperbolic types, I think a goal of real AI has always been "as smart or smarter than humans", despite what the current reality or limitations of LLMs are.


> I understand what you're saying, but I hear all the time from AI boosters about how when AGI comes that it will let us do things like discover brand new theories, cure diseases, etc. (singularity, blah blah blah).

I get that. I do think there are three different levels we must distinguish between when considering this though:

1) What those who are selling AI want us to believe so we buy it

2) What those who are selling AI want

3) What AI actually will do based on its fundamental properties, and the emergent effects of mass human instinct operating on it

I was referring more to 3 and tangentially 2, not really to 1, which in my opinion is as relevant as a used car salesman telling you that that used Ford Fiesta in the corner is the most reliable thing you'll buy in the next ten years.

> I think a goal of real AI has always been "as smart or smarter than humans", despite what the current reality or limitations of LLMs are.

True, I think that's a goal of AI researchers, or at least AI computer scientists like the team who eventually build Deep Blue and similar folk. Though, the goal of the companies today are probably just "build it as good as it gets to get the money".


When OpenAI announced their product search thing recently, I copy pasted one of their example queries into ChatGPT and it dutifully put together a list of product recommendations which cited several obviously LLM generated affiliate-spam blogs. There's no escaping ensloppification.


Oh hey, another win for Kagi and it's no-sponsored-results and seo-spam-filtering, even in it's "AI Assistant" searches


The beginning of this article is why I never trusted AI for search.

You can get really good results from search engines if you know how to use them and are willing to check the sources.

With AI as the search engine, you get what you get. This works well for a lot of searches, but not for everything.

Some engines, like Perplexity, Kagi and Gemini-enabled Google Search, will give you the sources they used in generating its response. This is fine, but for anything moderately complicated, this isn't enough, IMO.

For example: my wife was looking for a way to deter squirrels from eating out of our bird feeder. Gemini told her that we can use spicy bird feed to do this. It also provided a source. Great!

However, when you opened the sourced article, you'll see that Gemini's response completely failed to mention that not only is the spicy bird feed option a mild deterrant at best (squirrels will eat damn near anything if they're desperate enough), but it also failed to mention the two other MUCH BETTER options that were in the article! (Make the feeder free standing on a squirrel-proof pole, and some other option I forgot.)

I'm sure my wife could've told Gemini that the answer was incomplete to get a better response out of it. However, that perfectly underscores the problem I have with the tech in its current state: you're getting some inaccurate representation of the truth while also stealing market share from the authors and companies that shared their knowledge online to begin with.

I also don't think AI search is SEO proof either. People will figure out how to bend Gemini to its will. Given that most people will use natural language queries, the Wendy's [] singularity is almost inevitable.

[] Wendy's, a fast-food chain in the US, has an excellent social media team. I wouldn't be surprised if they find a way of getting every Gemini/Bing OpenAI search to work Wendy's in somehow!


Slinky attached to the bottom of the bird feeder, draping down the freestanding pole. Squirrel will have to grab the slinky, slinky will provide elevator ride back to ground level.


You're not thinking big enough. Wendy's will be sold their way into conversational LLM output and it will be the start of a new era of undisclosed advertising.


I had one the other day where the AI search engine gave me a very odd result that got several important details blatantly wrong, and then started getting into weird right-wing conspiracy territory. I checked the sources and it turns out that the first source it pulled in was from some kooky religious organization.

This isn’t a huge problem with search as it existed 5 years ago. You’ll see the URL and say, “That doesn’t sound like a reliable source. I’m going to scroll down a little further.”

But with AI search, the authorship is less obvious. It looks like the AI is saying it, rather than the AI just mashing together a bunch of search results with some garbage in the mix. It allows creeps, kooks, and nutballs to launder their ideas by getting AI to repeat their garbage.

It was a concerning interaction, and I worry how many impressionable people are being fed weird cult-spam while thinking that it’s legitimate because they don’t realize where it’s coming from.


Ok, yes, but that's not what model collapse is. I was expecting an interesting finding on LLM training stalling due to AI pollution of the internet (the training data). Instead this a rant about the pollution of the internet. All good points and it will lead to model collapse, but this is a clickbait headline.


But, the internet is already polluted and being constantly polluted by LLMs. It feels like LLMs are the PFAS of the internet. So, I'm not sure what your argument is, but I would welcome clarification. Maybe I am misunderstanding your definition of model collapse. The article references this definition of model collapse[0]. I'm curious when you say "that's not what model collapse is" -- what do you mean by model collapse that is different from the definition of the article and reference?

[0] https://www.nature.com/articles/s41586-024-07566-y


My point is that their rant is not about that definition, even if they reference it. Again, they are not wrong in their concerns at all! It's just that having a PFAS'd internet is not a "sign of model collapse". A sign of model collapse would be new LLMs having degraded performance.


> A sign of model collapse would be new LLMs having degraded performance.

That's a good point. Stagnant performance would be a different thing -- more indicative of the technology itself.


Any chance you get, speed the collapse


The example in the article seems less like a RAG issue and more like a prompt issue. The fact that LLM's seeming can't disappoint the user and have to assume they are correct is what is leading to the misinformation in articles like this.

That said there is a clear vulnerability in LLM's not being able to judge whether information is true or not especially when pulling from the web.


"Prompt issue" is just the new "user error". Just passes the blame anywhere other than the flawed value of the tool itself.


What's the long term strategy WRT getting their grubby mitts on uncontaminated training data? Now that they've erased the incentives for publishing knowledge for free, i.e. the assumption your efforts will be rewarded with a few advertising dollars from visits, and they've established disincentives for publishing knowledge for free, i.e. having your site absolutely hammered by their disrespectful scrapers, what kind of internet are these slop merchants planning to subsist on? Someone gonna tell me reddit comments and Facebook posts?


Why do you imagine a new equilibrium won't evolve? You've asserted a bunch of nonsense and ran with it and so of course you've come to a wrong conclusion.

The value of massively available internet text is the implicit world model learned by encoding the text into the weights. The altered distribution of data post 2025, is just another piece of the world model.

Why would children be able to learn and grow and become productive in the future, but somehow AI researchers are just going to drown and die and AI will flounder permanently.


> You've asserted a bunch of nonsense

Please omit swipes like this from comments on HN.

https://news.ycombinator.com/newsguidelines.html


fair. ill try harder


Thanks!


Because empirical observations suggest that children regularly do learn and grow, while AI researches do drown and flounder, also quite regularly. I mean, this time it of course could be different (there is first time for anything, after all) but also could be the same.


Or, it could be the usual VC startup playbook playing out. Offer a great service with great quality for free -> capture users. Once you're happy with your number of users, slowly lower the quality of the free service (meanwhile continue capturing users). Once they start to complain about the quality, introduce a paid "premium" tier that is just the initial service restored to its original quality (with one or two smaller features added to not be too obvious).

You can re-apply the process later to move users to higher tiers.


The proper term for this is the Ouroboros Effect. Visually represented by a snake eating its own tail, we are now seeing AI results being contaminated by them feeding off of increasing amounts of AI-generated slop instead of purely human-generated/reality-generated content. AI is indeed beginning to eat its own tail.


This is the Dead Internet theory manifest. AI is such trash it breaks itself.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: