- AI in China
- Posts
- Is NVIDIA Facing New Pressure After the DeepSeek-R1 Crash?
Is NVIDIA Facing New Pressure After the DeepSeek-R1 Crash?
DeepSeek’s AI Breakthrough: Bypassing CUDA, Pushing GPU Optimization to the Limit, and Redefining AI’s Self-Improvement Potential
NVIDIA just recovered from the $4 trillion market drop triggered by DeepSeek-R1, but now it’s facing a new challenge.
According to hardware media outlet Tom’s Hardware, DeepSeek has taken optimization to the next level—completely bypassing CUDA and leveraging a lower-level programming language instead.
Breaking the CUDA Barrier?
New details from the DeepSeek-V3 technical report have surfaced, revealing some game-changing optimizations.
An analysis by Mirae Asset Securities Research (a South Korean firm) suggests that DeepSeek-V3 achieves 10 times greater hardware efficiency compared to models like Meta’s, thanks to a radical approach—“rebuilding everything from the ground up.”
During the training of DeepSeek-V3 on NVIDIA’s H800 GPUs, the researchers customized 20 out of the 132 Streaming Multiprocessors (SMs) to handle inter-server communication rather than computation. This effectively bypassed hardware limitations on communication speed.
This optimization was achieved using NVIDIA’s PTX (Parallel Thread Execution) language instead of CUDA.
Why PTX Instead of CUDA?
PTX operates at a level closer to assembly language, allowing fine-grained optimizations such as register allocation and thread/warp-level adjustments. While CUDA is the industry-standard high-level programming language for NVIDIA GPUs, PTX allows for extreme performance tuning.
However, PTX is notoriously complex and difficult to maintain, which is why most developers stick to CUDA. DeepSeek’s approach represents an extreme level of optimization.
Some industry insiders even speculate that the only people who would go through the trouble of replacing CUDA with PTX are former quantitative traders.
Is CUDA No Longer a Moat?
An Amazon engineer raised a critical question:
“Is CUDA still a competitive moat if top-tier AI labs can effectively leverage any GPU?”
This led to further speculation—what if DeepSeek were to release an open-source alternative to CUDA?
Would that shake up the AI hardware ecosystem?
Did DeepSeek Really Bypass CUDA?
It’s important to clarify that PTX is still an integral part of NVIDIA’s GPU architecture. It serves as an intermediate representation in the CUDA programming model, connecting high-level CUDA code with low-level GPU instructions.
CUDA code is first compiled into PTX, which is then translated into machine code (SASS – Streaming Assembler). CUDA provides a high-level interface and toolchain, simplifying development, while PTX acts as a bridge between high-level programming and low-level execution.
This two-step compilation process ensures that CUDA programs remain portable across different GPU architectures.
On the other hand, writing PTX directly, as DeepSeek has done, is extremely complex and hard to port across different GPU models. Some industry experts note that optimizations tailored for H100 GPUs may not work as effectively—or at all—on other hardware.
Thus, DeepSeek’s PTX-level optimizations don’t mean they have fully abandoned the CUDA ecosystem, but they do indicate a strong ability to optimize for different GPUs.
In fact, DeepSeek has already partnered with AMD and Huawei, ensuring early support for alternative hardware ecosystems.
One More Thing: Can AI Optimize Itself?
Some experts suggest that AI mastering assembly-level programming could be a major step toward self-improving AI.
It’s unclear whether DeepSeek used AI assistance to write PTX code, but there’s already strong evidence that DeepSeek-R1 is capable of optimizing AI models itself.
A recent pull request (PR) in the Llama.cpp project introduced an SIMD-based optimization (Single Instruction, Multiple Data) that significantly boosted WebAssembly performance for certain matrix operations. The PR’s author stated:
“99% of this code was written by DeepSeek-R1. All I did was test and refine the prompts.”
The founder of Llama.cpp reviewed the AI-generated code and commented:
“It’s more mind-blowing than expected.”
References: