Elon Musk recently revealed that the AI industry has fully utilized all available human data for training purposes, including books, videos, and internet content. Speaking during a livestream conversation with Stagwell chairman Mark Penn, streamed on X, Musk confirmed that the shift toward synthetic data is now essential for AI development.
Musk, who also owns AI company xAI, explained that AI advancements are increasingly relying on synthetic data, that is, data generated by AI itself. “We’ve literally run out of the entire internet, all books ever written, and all interesting videos. The only way to supplement that is with synthetic data, which AI creates,” Musk said. He described how AI could self-generate essays or thesis, then evaluate and refine its own outputs through a process of self-learning.
This shift to synthetic data supports earlier insights from former OpenAI chief scientist Ilya Sutskever, who noted in December 2024 that the industry had reached “peak data.” Sutskever predicted that the scarcity of traditional training data would necessitate new approaches in AI model development.
While synthetic data offers a solution, Musk acknowledged its challenges, particularly in determining the accuracy of AI-generated answers. “How do you know the answer is hallucinated or real? It’s difficult to find the ground truth,” he said. Researchers have also warned that reliance on synthetic data could lead to “model collapse,” where AI systems become less creative and more biased This could potentially compromise their performance.
Despite these challenges, major tech companies like Microsoft, Meta, OpenAI, and Anthropic are already incorporating synthetic data into their AI training processes. Gartner estimates that by 2024, 60% of data used for AI projects was synthetically generated. Microsoft’s Phi-4 and Google’s Gemma models have been trained using a combination of real-world and synthetic data, while Anthropic and Meta have also adopted synthetic data for their latest AI systems.
As the industry continues to innovate, synthetic data appears to be a critical component of the next phase of AI development, even as it presents new challenges for accuracy and creativity in AI models.