AI Developers Turn to Synthetic Data as Original Content Dries Up

CN
Decrypt
Follow
8 hours ago

As AI models consume the internet’s free content, a looming crisis is emerging: What happens when there’s nothing left to train on?


A recent Copyleaks report revealed that DeepSeek, a Chinese AI model, often produces responses nearly identical to ChatGPT, raising concerns that it was trained on OpenAI outputs.


That’s led some to suspect the era of “low-hanging fruit” in AI development may be over.


In December, Google CEO Sundar Pichai acknowledged this reality, warning that AI developers are rapidly exhausting the supply of freely available, high-quality training data.


“In the current generation of LLM models, roughly a few companies have converged at the top, but I think we're all working on our next versions too,” Pichai said at the New York Times’ annual Dealbook Summit in December. “I think the progress is going to get harder.”


With the supply of high-quality training data dwindling, many AI researchers are turning to synthetic data generated by other AI.


Synthetic data isn’t new—it dates back to the late 1960s—and has been used in statistics and machine learning, relying on algorithms and simulations to create artificial datasets that mimic real-world information. But its growing role in AI development sparks fresh concerns, particularly as AI systems integrate into decentralized technologies.


Bootstrapping AI


“Synthetic data has been around in statistics forever—it’s called bootstrapping,” Professor of Software Engineering at MIT Muriel Médard told Decrypt in an interview at ETH Denver 2025. “You start with actual data and think, ‘I want more but don’t want to pay for it. I’ll make it up based on what I have.’”


Medard, the co-founder of decentralized memory infrastructure platform Optimum, said the main challenge in training AI models isn’t the lack of data but rather its accessibility.


“You either search for more or fake it with what you have,” she said. “Accessing data—especially on-chain, where retrieval and updates are crucial—adds another layer of complexity.”


AI developers face mounting privacy restrictions and limited access to real-world datasets, with synthetic data becoming a crucial alternative for model training.


“As privacy restrictions and general content policies are backed with more and more protection, utilizing synthetic data will become a necessity, both out of ease of access and fear of legal recourse,” Senior Solutions Architect at Druid AI Nick Sanchez told Decrypt.


“Currently, it’s not a perfect solution, as synthetic data can contain the same biases you would find in real-world data, but its role in handling consent, copyright, and privacy issues will only grow over time,” he added.


Risks and rewards


As the use of synthetic data grows, so do concerns about its potential for manipulation and misuse.


“Synthetic data itself might be used to insert false information into the training set, intentionally misleading the AI models,” Sanchez said, “This is particularly concerning when applying it to sensitive applications like fraud detection, where bad actors could use the synthetic data to train models that overlook certain fraudulent patterns.”


Blockchain technology could help mitigate the risks of synthetic data, Medard explained, emphasizing that the goal is to make data tamper-proof rather than unchangeable.


“When updating data, you don’t do it willy-nilly—you change a bit and observe,” she said. “When people talk about immutability, they really mean durability, but the full framework matters.”


Edited by Sebastian Sinclair


免责声明:本文章仅代表作者个人观点,不代表本平台的立场和观点。本文章仅供信息分享,不构成对任何人的任何投资建议。用户与作者之间的任何争议,与本平台无关。如网页中刊载的文章或图片涉及侵权,请提供相关的权利证明和身份证明发送邮件到support@aicoin.com,本平台相关工作人员将会进行核查。

Share To
APP

X

Telegram

Facebook

Reddit

CopyLink