Gate News message, April 24 — DeepSeek has released the V4 series of open-source models under the MIT License, with weights now available on Hugging Face and ModelScope. The series includes two mixture-of-experts (MoE) models: V4-Pro with 1.6 trillion total parameters and 49 billion activated per token, and V4-Flash with 284 billion total parameters and 13 billion activated per token. Both support a 1 million token context window.
The architecture features three key upgrades: a hybrid attention mechanism combining compressed sparse attention (CSA) and heavily compressed attention (HCA) that significantly reduces long-context overhead—V4-Pro's inference FLOPs for 1M context is just 27% of V3.2's, and KV cache (VRAM for storing historical information during inference) is only 10% of V3.2's; manifold-constrained hyperconnections (mHC) replacing traditional residual connections to enhance cross-layer signal propagation stability; and the Muon optimizer for faster training convergence. Pre-training used over 32 trillion tokens of data.
Post-training employs a two-stage approach: first training domain-specific experts via supervised fine-tuning (SFT) and GRPO reinforcement learning, then merging them into a single model through online distillation. V4-Pro-Max (highest inference mode) claims to be the strongest open-source model with top-tier coding benchmarks and significantly narrowed gaps with closed-source frontier models on reasoning and agent tasks. V4-Flash-Max achieves Pro-level reasoning performance with sufficient compute budget but is limited by parameter scale on pure knowledge and complex agent tasks. Weights are stored in mixed FP4+FP8 precision.