AI Crisis caused by Data Exhaustion: How to Save an Impending Model Collapse

OpenAI’s ChatGPT technology has gone viral in just less than a year and is already having an impact on work patterns and the future of the industry.

OpenAI’s ChatGPT technology has gone viral in just less than a year and is already having an impact on work patterns and the future of the industry. Within some of the world’s leading companies, as many as half of employees are already using this type of technology on a daily basis. Countless companies have invested in the field of AI, racing to launch new products, particularly in Internet, education, games and other growing industries.

It is well known that the data used to train Large language models (LLMs) and other transformation models that support products such as ChatGPT, Stable Diffusion, and Midjourney originally came from human sources. These sources include books, articles, photographs, and other works that are entirely human original.

The parameter sizes of large-scale models continue to grow, from billions and tens of billions to hundreds of billions. Adding to this explosion is the amount of data required to train AI that grows exponentially. Taking OpenAI’s GPT as an example, from GPT-1 to GPT-3, the size of the training dataset grew dramatically from 4.5GB to 570GB.

Not long ago, at the Data+AI conference held by Databricks, Marc Andreessen, founder of a16z, believed that the massive data accumulated by the Internet in the past two decades is an important reason for the rise of a new wave of AI. He sees data as excellent sources of learning materials for AI training.

However, despite the huge amount of useful and useless data left by netizens on the web, this data may be about to bottom out for AI training.

A paper published by Epoch, an artificial intelligence research and prediction organization, predicts that high-quality textual data will run out between 2023 and 2027.

While the research team acknowledges that the analytical methods have serious limitations and the model’s inaccuracies are high, it’s hard to deny that AI is consuming datasets at an alarming rate.

Recently, researchers from the University of Cambridge, the University of Oxford, the University of Toronto and other universities published  an article pointing out that using AI-generated content to train AI can lead to the collapse of new models.

The researchers concluded: “Learning from data generated by other models leads to model collapse – a degradation process in which the model forgets the true underlying data distribution over time. This process is inevitable, even in an ideal training situation for a long time.”

What is the reason that using “generated data” to train AI will cause the model to collapse? Is there any way to prevent it?

At this stage, AI is still in the primitive imitation of human thinking and its core is still a statistical program. Researchers believe that training AI with AI-generated content will produce “statistical approximation error”. This is because in the process of statistics, the content with higher probability is further strengthened, and the content with lower probability is continuously ignored, which is the main cause of model collapse.

It affects the performance, reliability, and security of the model. The researchers warn that model collapse is a serious phenomenon that needs the attention of LLM developers and users.”We believe this problem will become one of the major challenges for the machine learning community in the next few years,” they said.

But all hope is not lost.

The first approach is data isolation. To address model collapse, the research team suggests separating clean artificially generated data sources from AI-generated content to prevent contamination of clean data by AIGC.

The second is the use of synthetic data.  In fact, data generated specifically for AI is already widely used for AI training. For some practitioners, the current concern about AI-generated data leading to model collapse may be overblown. Therefore, the key is to establish an effective system to confirm the valid part of the AI-generated data and provide feedback based on the effectiveness of the trained model. OpenAI’s use of synthetic data for model training has become a consensus within the AI industry.

In conclusion, despite the problem of human data depletion, AI training is not without solutions. Through the isolation of data and the use of synthetic data, the problem of model collapse can be effectively overcome and the continuous development of AI can be ensured.

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *