Key Takeaways
- DeepSeek released DeepSeek-V4 on April 24, 2026 — two MoE checkpoints available on Hugging Face: V4-Pro (1.6T parameters, 49B active) and V4-Flash (284B total, 13B active).
- Both models carry a 1 million-token context window, making DeepSeek-V4 the first major open-weight model to match Gemini 1.5 Pro's 1M context at frontier quality.
- V4-Pro scored 80.6 on SWE Verified — within one point of Claude Opus 4.6 Max (80.8) and Gemini-3.1-Pro (80.6) — and runs at 27% of DeepSeek-V3.2's inference FLOPs at 1M tokens.
- The model preserves reasoning traces across multi-turn tool-call chains, a specific post-training change that makes it more reliable for long-horizon agent tasks.
- DeepSeek-V4 is adapted for Huawei chip technology, reducing the model's dependence on Nvidia H100/H200 hardware — a strategic move in the context of US export controls.
What is DeepSeek-V4 and why does the 1M context window matter?
DeepSeek-V4 is an open-weight AI model with a 1 million-token context window — large enough to hold roughly 750,000 words or an entire large codebase in a single prompt.
DeepSeek released V4 on April 24, 2026. CNN reported that the company describes it as "the most powerful open-source platform" it has shipped, with Bloomberg noting it rivals top closed-source models from OpenAI and Google DeepMind. Those claims are aggressive and the benchmark numbers don't fully support "most powerful" across the board — but the context window story is genuinely new territory.
One million tokens is roughly 750,000 words or about 3,000 pages of text. In practical terms, it means you can load an entire large codebase, a company's complete documentation archive or months of research notes into the model's context and ask questions without chunking, summarizing or losing information. Until V4, that capability existed at frontier quality only in Gemini 1.5 Pro — a closed, paid Google model. DeepSeek-V4 brings it to open weights.
The Hugging Face technical blog on V4, published by Ben Burtenshaw and colleagues on April 24, 2026, makes the key distinction: "A 1M context window is just capacity, not performance." The meaningful question is whether the model can actually use that capacity efficiently at inference time. V4's architecture is built to answer that question directly and the design choices that make it work are worth understanding.
What is the V4 architecture and how does it handle long contexts without melting GPUs?
DeepSeek-V4 uses Compressed Sparse Attention and Heavily Compressed Attention to cut KV cache memory to roughly 2% of standard transformer architectures at 1M tokens.
Standard transformer attention is expensive at long context. Every new token must attend to every previous token and the KV cache that stores this information grows linearly with sequence length. At 1 million tokens, a standard architecture becomes impractical to run — the memory and compute costs are prohibitive for most hardware setups.
V4 solves this with two new attention mechanisms that alternate across layers:
Compressed Sparse Attention (CSA) compresses KV entries by 4x along the sequence dimension using softmax-gated pooling. A fast indexer then picks the top-k most relevant compressed blocks per query, reducing the search space substantially. Think of it as finding the relevant chapters in a book without reading every word.
Heavily Compressed Attention (HCA) takes this further, compressing by 128x and running dense attention over the resulting short sequence. HCA trades some retrieval precision for a dramatically smaller memory footprint.
The combination produces the efficiency numbers that make V4 deployable. According to the Hugging Face technical report, V4-Pro requires 27% of the per-token inference FLOPs that DeepSeek-V3.2 needed at 1M tokens and uses 10% of the KV cache memory. V4-Flash drops those numbers to 10% of FLOPs and 7% of KV cache. Compared against how standard AI models store context in memory, V4 requires roughly 2% of the space. That's what makes 1M context viable on real hardware.
| Model | Context Window | Open / Closed | Best Use Case at Long Context |
|---|---|---|---|
| GPT-4o | 128K tokens | Closed | Long documents, multi-turn conversation |
| Claude Opus 4.7 | 200K tokens | Closed | Long reports, complex reasoning |
| GPT-5.5 (API) | 1M tokens | Closed | Large-scale agentic coding, research |
| Gemini 1.5 Pro | 1M tokens | Closed | Multimodal, long-form analysis |
| DeepSeek-V4-Pro | 1M tokens | Open weight | Agentic coding, self-hosted enterprise |
| DeepSeek-V4-Flash | 1M tokens | Open weight | High-throughput, cost-sensitive agent tasks |
How does DeepSeek-V4 perform on real agent tasks versus GPT-5.4 and Claude?
V4-Pro matches GPT-5.4 on SWE Verified and MCPAtlas and outperforms Gemini-3.1-Pro on tool-use benchmarks. The gap versus GPT-5.5 is real but narrowing.
According to the Hugging Face technical blog citing DeepSeek's technical report, V4-Pro scored 80.6 on SWE Verified (GitHub issue resolution), within one point of both Claude Opus 4.6 Max (80.8) and Gemini-3.1-Pro (80.6). On MCPAtlas Public — a test of tool-use with the Model Context Protocol — V4-Pro scored 73.6, second only to Opus 4.6 Max (73.8). On Toolathlon, V4 scored 51.8, ahead of Gemini-3.1-Pro (48.8).
The gap is larger on Terminal-Bench 2.0: V4-Pro scores 67.9, behind GPT-5.4 (75.1), Gemini-3.1-Pro (68.5) and now GPT-5.5 (82.7). Terminal-Bench tests complex command-line workflows with planning, iteration and tool coordination — the kind of tasks that require the model to maintain coherent strategy across many sequential steps. That's where V4's architecture, despite its long-context efficiency, shows its relative limitation versus OpenAI's latest.
For long-context retrieval specifically, V4's MRCR 8-needle accuracy stays above 0.82 through 256K tokens and holds at 0.59 at 1M tokens. That means the model can still reliably locate information buried deep in a 1M-token context — not perfectly, but usably. For most enterprise document processing use cases, a 0.59 retrieval accuracy at the maximum context limit is a reasonable baseline to build on.
An internal DeepSeek survey of 85 developers using V4-Pro as their daily coding tool found 52% said it was ready to replace their current primary coding model and 39% said they leaned toward yes. That 91% favorable signal from active daily users is meaningful, though the sample is small and comes from DeepSeek's own developer community.
What does "open weight" mean in practice and why does it matter for enterprises?
Open-weight means model weights are public and downloadable. Enterprises can run V4 on their own hardware — no API dependency, no data leaving their systems, no per-token billing.
The two V4 instruct checkpoints are available now on Hugging Face: DeepSeek-V4-Pro (1.6T total parameters, 49B active) and DeepSeek-V4-Flash (284B total, 13B active). Both use FP4 for MoE expert weights and FP8 for everything else — a storage format choice that compounds with the compression architecture to make deployment feasible on modern GPU clusters.
For enterprises, the open-weight availability changes the procurement calculus for AI workloads. Running GPT-5.5 means paying OpenAI $5–30 per million tokens, with data flowing through OpenAI's API. Running DeepSeek-V4-Pro on your own hardware means capital cost plus compute cost, but no per-token bill and no data leaving your controlled infrastructure. For regulated industries — healthcare, finance, legal — that control over data residency can be the deciding factor.
The Reuters report on V4's release flagged another strategic dimension: the model is adapted for Huawei chip technology. US export controls have restricted Chinese AI labs' access to Nvidia's H100 and H200 chips. DeepSeek's optimization for Huawei's Ascend NPU line signals the lab's intent to remain competitive even if chip access tightens further. Whether Huawei's chips can sustain frontier-model training at scale remains to be tested — but the architectural direction is clear. For more on how AI is reshaping development infrastructure, see our piece on AI-generated code and technical debt.
What three things did DeepSeek change to make V4 more reliable as an agent?
DeepSeek-V4 preserves reasoning across tool-call chains, uses a new XML-based tool-call schema that reduces parsing failures and was trained against real tool environments using a Rust-based sandbox infrastructure.
The Hugging Face technical analysis identifies three post-training decisions that directly improve V4's agent reliability — beyond the architecture changes that make long context efficient.
First, V4 preserves reasoning content across user message boundaries when the conversation contains tool calls. Previous V3.2 kept reasoning traces within a single user turn but discarded them when a new user message arrived. For multi-turn agent workflows — where a user sends a follow-up after the agent has already chained several tool calls — V3.2 lost its accumulated reasoning and had to reconstruct state. V4 retains the complete reasoning history across all rounds, including across user turns. For conversational use without tools, the old behavior is preserved to keep context concise.
Second, V4 introduces a |DSML| special token and an XML-based tool-call format, replacing the previous JSON-in-string approach. The XML format reduces escaping failures when models emit nested quoted content — a common failure mode that caused parsing errors in complex tool calls. The schema also separates string parameters from structured parameters, removing a class of parsing errors around numbers and booleans.
Third, V4's agent behavior was trained with reinforcement learning against real tool environments — not simulated ones. DeepSeek built a Rust-based sandbox infrastructure called DeepSeek Elastic Compute (DSec) that exposes function calls, containers, microVMs and full VMs behind a single Python SDK. A single cluster can run hundreds of thousands of concurrent sandboxes, enabling the scale of RL rollouts needed to train reliable agent behavior. For a broader look at how workspace agent AI is evolving, see our coverage of ChatGPT Workspace Agents.
Analysis: the open-source frontier is closing the gap faster than expected
Twelve months ago, the gap between open-weight models and frontier closed models was roughly a full generation. DeepSeek-V4-Pro sitting within one point of Gemini-3.1-Pro and Claude Opus 4.6 Max on SWE Verified is not something most enterprise AI teams would have predicted for mid-2026.
The V4 release changes the competitive dynamics in a specific way: it gives every enterprise with GPU infrastructure the ability to run a near-frontier agent model on premises, with full control over data residency and no ongoing API costs. That's a meaningful option for regulated industries that have been locked out of frontier AI because of data privacy constraints.
What V4 doesn't solve is the latency and ops burden of self-hosted inference. Running a 1.6T parameter MoE model at 1M-token context requires serious hardware — this isn't a workstation deployment. The practical bar for self-hosting V4-Pro is probably a multi-GPU server cluster with A100-class hardware or better. V4-Flash at 284B total parameters and 13B active is more accessible, but also a significant step down in capability. The efficiency-capability tradeoff is real and enterprises will need to benchmark against their specific workloads before committing to infrastructure investment. This is our analysis based on published technical details; actual deployment requirements will vary.
Sources
- Hugging Face — DeepSeek-V4: a million-token context that agents can actually use (April 24, 2026)
- CNN — China's DeepSeek unveils new AI model (April 24, 2026)
- Bloomberg — DeepSeek Unveils Newest Flagship (April 24, 2026)
- South China Morning Post — DeepSeek releases next-gen AI model (April 24, 2026)
- Reuters — China's DeepSeek releases model adapted for Huawei chip technology (April 24, 2026)
- Hugging Face — DeepSeek-V4-Pro model card
Related Articles on Nexairi
Fact-checked by Jim Smart
