The Insane Optimizations of DeepSeek V4
I spent the past few days reading the DeepSeek V4 paper and the design is genuinely beautiful because the team built a 1.6 trillion parameter model with a 1 million token context window that matches the top closed models while being incredibly resource constrained with a team forty times smaller than OpenAI and without access to the top NVIDIA chips and the ridiculous thing is they open sourced everything including a paper that reveals all the infrastructure details that the closed labs treat as top secret. The fundamental problem with large language models is that every time the model reads a new token it has to compare it against every token before it and at a million tokens that number of comparisons becomes astronomical and the GPU memory required to store the intermediate results which is called the KV cache explodes into gigabytes just to maintain context for a single conversation. The standard approach to this problem has been to throw more compute at it but DeepSeek did not have that luxury so instead they asked a different question which was what if the model did not have to look at everything in the first place.
The result is a hybrid attention architecture with three parallel pathways that work together. The first is called compressed sparse attention which groups tokens into small chunks and merges their information into a single denser representation so instead of remembering four individual tokens it stores a compact summary of all four and reduces the sequence length by a factor of four. The second is called heavily compressed attention which is far more aggressive and groups something like 128 tokens or an entire paragraph into a single representation and at that point the sequence becomes so short the model can afford to look at everything at once. The third is called sliding window attention which keeps the most recent tokens completely uncompressed with full exact fidelity. Between the compressed and heavily compressed layers there is something called the Lightning Indexer which acts like a built-in search engine that rapidly scores all the compressed blocks and selects only the small subset that most likely matters for the current context and everything else is skipped entirely. It is the difference between trying to remember everything perfectly and trying to remember the right things at the right time.
But solving the attention problem still leaves the problem of scale because when you stack dozens of layers at the trillion parameter scale the signals flowing through the network start to amplify like a microphone placed too close to a speaker and you get these catastrophic feedback loops that crash the training run. The industry standard way to handle this has been residual connections which act as bypass lanes but DeepSeek’s paper explicitly mentions that at over a trillion parameters even the newer hyperconnections start experiencing these spikes. So they introduced something called manifold constrained hyperconnections which forces the residuals to behave like a doubly stochastic matrix where every row sums to one and every column also sums to one which means the total signal is always conserved and can never amplify because the math literally forbids it. To apply this constraint at scale they used the Sinkhorn-Knopp algorithm which runs about twenty iterations of row and column normalizations before each layer and through aggressive low-level GPU optimization they shrank the overhead of this entire process to only 6.7 percent of runtime.
They also replaced the industry standard optimizer AdamW with a custom one called Muon which works in two phases where it first makes big rough adjustments to get the system close to convergence and then switches to tiny precise tweaks and this combination lets the model learn faster while staying stable. At the infrastructure level the model is so large it cannot live on one chip or even one rack so the layers have to be scattered across different racks and the bottleneck becomes communication rather than computation. DeepSeek solved this by breaking the data transfer into smaller sequential waves so that as soon as the data for the first wave arrives the GPUs start crunching while the data for the next wave is already traveling over the network and computation and communication are perfectly overlapped so the network latency essentially disappears. They wrote the code for this in TileLang using fused kernels and used a Z3 SMT solver to mathematically prove that the kernel code was correct because at this scale even a one in a billion error happens constantly.
The training itself used a curriculum where the model started with short sequences of four thousand tokens to learn grammar and syntax and then gradually stretched to sixteen thousand and sixty-four thousand and all the way up to the full million token window and they introduced something called anticipatory routing which uses slightly older snapshots of the model’s parameters to look past the noisy fluctuations and lock onto the underlying trend whenever it detects the early signs of a loss spike. The result is a model that achieves a perfect score of 120 out of 120 on the Putnam 2025 mathematics competition which is one of the most difficult undergraduate math competitions in the world and matches or beats models from Google and Anthropic and OpenAI across knowledge and reasoning and agentic benchmarks while running on roughly 27 percent of the compute that was required for their previous version and requiring only 10 percent of the KV cache memory. And the team put the model on Hugging Face for free and published the paper with all the infrastructure details and that is the part that is hardest to believe because a team that has every reason to hoard their advantages just gave everything away.
If this lands find me on Twitter. I am @troysk704.