Now Reading
Did DeepSeek Train Its AI on Gemini Data?

Did DeepSeek Train Its AI on Gemini Data?

DeepSeek AI model interface with code and data flow visualization on screen.

Last week, Chinese AI lab DeepSeek released an upgraded version of its R1 reasoning model, known as R1-0528. The model has demonstrated strong performance across several math and programming benchmarks. However, the source of its training data remains undisclosed, leading to growing speculation within the AI community.

Some developers and researchers suspect DeepSeek may have used outputs from Google’s Gemini models to train R1-0528. Sam Paech, a developer based in Melbourne, shared observations on X, noting that the model’s language choices closely resemble those used by Gemini 2.5 Pro. While that similarity alone isn’t conclusive, it adds to a series of clues.

Supporting this theory, another anonymous developer behind the AI evaluation tool SpeechMap pointed out that the “traces” produced by R1-0528 essentially the model’s reasoning steps seem to echo those of Gemini. Although this doesn’t confirm data misuse, it deepens suspicions given DeepSeek’s past behavior.

A Pattern of Questionable Practices

This isn’t the first time DeepSeek has been accused of training on data from competing AI models. In December, developers noticed that DeepSeek’s earlier V3 model sometimes identified itself as ChatGPT. That pattern suggested it may have been trained on logs from OpenAI’s chatbot.

Earlier in 2025, OpenAI informed the Financial Times it had found signs that DeepSeek had used “distillation”—a method of training by mimicking larger models. Bloomberg also reported that Microsoft detected data exfiltration through OpenAI developer accounts in late 2024. These accounts, OpenAI suspects, were linked to DeepSeek.

Although distillation itself is not inherently unethical, OpenAI’s terms of service prohibit the use of its outputs to develop competing AI systems. As training datasets across the web become increasingly polluted with AI-generated content, tracing the origin of any given data chunk has become a serious challenge.

See Also
RIT Dubai researcher analyzing Android malware data for cybersecurity advancements.

Companies Respond with Stricter Security Measures

To combat unauthorized training, AI firms are tightening security. OpenAI, for instance, started enforcing ID verification for access to its advanced models in April. Notably, China isn’t among the supported countries.

Meanwhile, Google has taken steps to obscure its models’ internal traces on the AI Studio platform. Similarly, Anthropic announced it would begin summarizing traces from its models to protect proprietary information.

Despite these precautions, experts like Nathan Lambert from AI2 believe it’s entirely plausible that DeepSeek used Gemini-derived data. “If I was DeepSeek, I would definitely create a ton of synthetic data from the best API model out there,” Lambert wrote on X. According to him, limited GPU access combined with ample financial resources makes distillation an attractive shortcut.

View Comments (0)

Leave a Reply

Your email address will not be published.

© 2024 The Technology Express. All Rights Reserved.