DeepSeek Unveils AI Model Boosting Training Efficiency on Single GPU

A new multimodal AI model has been introduced, capable of generating over 200,000 pages of training data each day using a single GPU. The release showcases a major step toward improving large language model (LLM) efficiency while significantly lowering development costs.

The model, called DeepSeek-OCR, leverages visual perception to compress text for LLMs more effectively. Both the model’s source code and weights are available on major developer platforms. In its accompanying research paper, the team explained that “vision-text compression can achieve significant token reduction (7–20×) for different historical context stages, offering a promising direction for addressing long-context challenges in large language models.”

This approach continues the company’s emphasis on cost-efficient AI development. The same principle guided the creation of its earlier open-weight models, V3 and R1, which drew attention for delivering results comparable to high-end competitors at a fraction of the expense.

The Architecture Behind DeepSeek-OCR

The new model focuses on solving a core limitation of LLMs handling long contexts efficiently. Its central idea is that converting text into images can lower computational demands. DeepSeek-OCR operates as a proof of this concept, comprising two key components:

A 380 million-parameter DeepEncoder, which analyzes images and compresses them.
A 570 million-active parameter text generator, based on a three billion-parameter mixture of experts (MoE) model.

According to the researchers, the model was trained using 30 million PDF pages across 100 languages. This dataset included 25 million pages in Chinese and English, along with 10 million synthetic diagrams, five million chemical formulae, and one million geometric figures.

Benchmark results show that the model compresses text up to tenfold while retaining 97% of the original content. It efficiently handles plain text, diagrams, and formulas while preserving layout and structure. Although token usage depends on image resolution and document size, it consistently requires far fewer “vision tokens” than rival systems.

Al Sayegh, Triton EV Launch Abu Dhabi Truck Hub

Benchmark Performance and Efficiency Gains

When tested on the OmniDocBench benchmark, DeepSeek-OCR outperformed previous models. It achieved results surpassing GOT-OCR2.0, which uses 256 tokens per page, by processing each page with only 100 vision tokens. It also exceeded MinerU2.0’s performance, which typically needs over 6,000 tokens per page, while using fewer than 800.

Beyond benchmarks, DeepSeek-OCR demonstrated remarkable efficiency. It can generate large-scale training data for both LLMs and vision-language models (VLMs) more than 200,000 pages daily while running on a single Nvidia A100 GPU. This capability represents a substantial leap in AI scalability and resource optimization.

By emphasizing token reduction and multimodal processing, the model sets a foundation for faster, cheaper, and more capable language systems. Although still early in its evolution, it marks an important advancement toward sustainable AI research and development.