Artificial intelligence companies have run out of data for training their models and have “exhausted” the sum of human knowledge, Elon Musk has said.
The world’s richest person suggested technology firms would have to turn to “synthetic” data – or material created by AI models – to build and fine-tune new systems, a process already taking place with the fast-developing technology.
“The cumulative sum of human knowledge has been exhausted in AI training. That happened basically last year,” said Musk, who launched his own AI business, xAI, in 2023.
AI models such as the GPT-4o model powering the ChatGPT chatbot are “trained” on a vast array of data taken from the internet, where they in effect learn to spot patterns in that information – allowing them to predict, for instance, the next word in a sentence.
Speaking in an interview livestreamed on his social media platform, X, Musk said the “only way” to counter the lack of source material for training new models was to move to synthetic data created by AI.
Referring to the exhaustion of data troves, he said: “The only way to then supplement that is with synthetic data where … it will sort of write an essay or come up with a thesis and then will grade itself and … go through this process of self-learning.”
Meta, the owner of Facebook and Instagram, has used synthetic data to fine-tune its biggest Llama AI model, while Microsoft has also used AI-made content for its Phi-4 model. Google and OpenAI, the company behind ChatGPT, have also used synthetic data in their AI work.
However, Musk also warned that AI models’ habit of generating “hallucinations” – a term for inaccurate or nonsensical output – was a danger for the synthetic data process.
He said in the livestreamed interview with Mark Penn, the chair of the advertising group Stagwell, that hallucinations had made the process of using artificial material “challenging” because “how do you know if it … hallucinated the answer or it’s a real answer”.
Andrew Duncan, the director of foundational AI at the UK’s Alan Turing Institute, said Musk’s comment tallied with a recent academic paper estimating that publicly available data for AI models could run out as soon as 2026. He added that over-reliance on synthetic data risked “model collapse”, a term referring to the outputs of models deteriorating in quality.
“When you start to feed a model synthetic stuff you start to get diminishing returns,” he said, with the risk that output is biased and lacking in creativity.
Duncan added that the growth in AI-generated content online could also result in that material being absorbed into AI data training sets.
High-quality data, and control over it, is one of the legal battlegrounds in the AI boom. OpenAI admitted last year it would be impossible to create tools such as ChatGPT without access to copyrighted material, while the creative industries and publishers are demanding compensation for use of their output in the model training process.