Articles by FavTutor
  • AI News
  • Data Structures
  • Web Developement
  • AI Code GeneratorNEW
  • Student Help
  • Main Website
No Result
View All Result
FavTutor
  • AI News
  • Data Structures
  • Web Developement
  • AI Code GeneratorNEW
  • Student Help
  • Main Website
No Result
View All Result
Articles by FavTutor
No Result
View All Result
Home AI News, Research & Latest Updates

AI Models Collapsed When Trained on AI-Generated Data: Study

Geethanjali Pedamallu by Geethanjali Pedamallu
July 30, 2024
Reading Time: 4 mins read
AI Model Collapse Study
Follow us on Google News   Subscribe to our newsletter

AI Models can do anything but will they be able to do the same forever? A new study re-emphasized our attention on the concept of “model collapse” which will be a big challenge for how AI models will be trained.

Highlights:

  • Researchers tried training AI models recursively on the text generated by itself.
  • They found out that AI models may start returning gibberish after 9-10 iterations.
  • This proves that using synthetic data for training LLMs may not be a good idea.

AI has a ‘Model Collapse’ problem

Researchers from the University of Cambridge and Oxford have recently published a study that shows how AI models collapse when trained on data generated by other AI systems. By collapse, we mean the output generated will be of low quality and sometimes incorrect.

This study is significant because many big tech companies like Microsoft, OpenAI and Cohere are planning to train their LLMs on a lot of different data from the internet. We also know a lot of current information on websites is already getting written with AI. So, if the new AI model is getting trained on the same data generated by its predecessor, it will be a problem.

The degradation in the quality of outputs generated by AI when it is trained on AI-generated data is called Model Collapse.

Visualizing Model Collapse

If you look at the above image, you can see how colourful the real data is – the colours indicate the diversity of the dataset. When AI generates data from the original data itself, the diversity is reduced. Slowly the color fades and at the end, the model collapses completely.

To understand this in more simple terms, think of this as the telephone game (Chinese Whispers) we used to play in our childhood. A huge group of children seated in a circle, keep on passing the message by whispering the words in each other’s ears. By the end of the game, all the words get lost and the last child replies with something gibberish that has no relation to the original message.

This is exactly what is happening with the AI models right now. The models lose their ability to accurately understand and process information after a while if they are being trained on synthetic data that lacks diversity.

One Example on How It Happens

The researchers conducted several experiments and published their findings in the paper. In one of the experiments, they gave the model a large prompt about the history of building churches. They trained the model repeatedly on the output generated by itself. After several iterations, this is what the output was:

“architecture. In addition to being home to some of the world’s largest populations of black @-@ tailed jackrabbits, white @-@ tailed jackrabbits, blue @-@ tailed jackrabbits, red @-@ tailed jackrabbits, yellow @-.”

This happens because AI slowly forgets bits and pieces of information from its initial prompt. After several generations, it is left with almost little data which forces it to return gibberish.

How to overcome this problem?

The first solution proposed is Watermarks. These are digital signatures embedded in the data which helps people to detect AI-generated data and remove it from training datasets. This method has been approved by all the major companies but it is still a challenge as some companies may not comply.

Other solutions include sticking to using the human-generated data even though it may be expensive because they will always give better results than other methods.

That’s why these tech giants are making deals with news websites and social media platforms so that they can receive human-written content regularly to train their models. That’s why they have to pay them heavily because otherwise, most news websites are blocking AI crawlers.

Conclusion

Acquiring high-quality human-written data is becoming more challenging and expensive for companies. But they can’t this ‘model collapse’ problem. So, now they need to start looking for more solutions to this problem.

ShareTweetShareSendSend
Geethanjali Pedamallu

Geethanjali Pedamallu

Hi, I am P S Geethanjali, a college student learning something new every day about what's happening in the world of Artificial Intelligence and Machine Learning. I'm passionate about exploring the latest AI technologies and how they solve real-world problems. In my free time, you will find me reading books or listening to songs for relaxation.

RelatedPosts

Candidate during Interview

9 Best AI Interview Assistant Tools For Job Seekers in 2025

May 1, 2025
AI Generated Tom and Jerry Video

AI Just Created a Full Tom & Jerry Cartoon Episode

April 12, 2025
Amazon Buy for Me AI

Amazon’s New AI Makes Buying from Any Website Easy

April 12, 2025
Microsoft New AI version of Quake 2

What Went Wrong With Microsoft’s AI Version of Quake II?

April 7, 2025
AI Reasoning Model Better Method

This Simple Method Can Make AI Reasoning Faster and Smarter

April 3, 2025

About FavTutor

FavTutor is a trusted online tutoring service to connects students with expert tutors to provide guidance on Computer Science subjects like Java, Python, C, C++, SQL, Data Science, Statistics, etc.

Categories

  • AI News, Research & Latest Updates
  • Trending
  • Data Structures
  • Web Developement
  • Data Science

Important Subjects

  • Python Assignment Help
  • C++ Help
  • R Programming Help
  • Java Homework Help
  • Programming Help

Resources

  • About Us
  • Contact Us
  • Editorial Policy
  • Privacy Policy
  • Terms and Conditions

Website listed on Ecomswap. © Copyright 2025 All Rights Reserved.

No Result
View All Result
  • AI News
  • Data Structures
  • Web Developement
  • AI Code Generator
  • Student Help
  • Main Website

Website listed on Ecomswap. © Copyright 2025 All Rights Reserved.