China is evidently carving its own path in the AI hardware sphere, sidestepping any over-reliance on NVIDIA’s limited AI accelerators. The game-changer here is DeepSeek’s latest venture, which showcases a spectacular leap in performance using the Hopper H800 AI accelerators, achieving an astonishing eightfold increase in TFLOPS.
DeepSeek’s Innovative FlashMLA: Extracting Maximum Efficiency from NVIDIA’s Trimmed Hopper GPUs
China seems to be setting its sails for a tech revolution, veering away from dependence on global tech giants and instead tapping into homegrown ingenuity. DeepSeek, a standout player, is creatively optimizing software to harness extraordinary performance from the available hardware. Their latest project revolves around getting exceptional mileage out of NVIDIA’s “cut-down” Hopper H800 GPUs. This breakthrough primarily hinges on smart strategies for managing memory consumption and resource distribution during inference tasks.
🚀 Day 1 of #OpenSourceWeek: FlashMLA
Honored to share FlashMLA – our efficient MLA decoding kernel for Hopper GPUs, optimized for variable-length sequences and now in production.
✅ BF16 support✅ Paged KV cache (block size 64)⚡ 3000 GB/s memory-bound & 580 TFLOPS…
— DeepSeek (@deepseek_ai) February 24, 2025
To give you some context, DeepSeek has kicked off an “OpenSource” week to roll out technology advancements that everyone can freely access via GitHub repositories. They’ve started with a bang, introducing FlashMLA, a tailored “decoding kernel” for NVIDIA’s Hopper GPUs. Before diving into its mechanics, let’s appreciate the significant industry leaps it promises, setting the tech scene abuzz.
DeepSeek asserts their FlashMLA can push 580 TFLOPS for BF16 matrix multiplication on the Hopper H800. This figure stands at a remarkable eight times the industry’s current norms. On top of that, by making the most of memory usage, FlashMLA shoots for a memory bandwidth reaching up to 3000 GB/s—almost double the H800’s theoretical max! Remarkably, achieving this leap stems from skillful coding rather than tweaking hardware.
This is crazy.-> Blazing fast: 580 TFLOPS on H800, ~8x industry avg (73.5 TFLOPS).-> Memory wizardry: Hits 3000 GB/s, surpassing H800’s 1681 GB/s peak.
— Visionary x AI (@VisionaryxAI) February 24, 2025
DeepSeek’s FlashMLA incorporates “low-rank key-value compression.” Simply put, it breaks down data into smaller, more manageable pieces, which boosts processing speed and slashes memory use by 40% to 60%. The adoption of a block-based paging system is another clever touch. This allocates memory flexibly depending on task demands, ensuring models deftly handle variable-length sequences, thus driving performance to new heights.
What DeepSeek is doing reflects a broader narrative in AI computing—one where innovation isn’t single-threaded but instead weaves through numerous paths. The initial focus is on optimizing Hopper GPUs with FlashMLA, but the horizon promises even more. It’s bound to be intriguing to see what possibilities FlashMLA will unlock when applied to the H100.