律动BlockBeats|Jun 27, 2026 14:02
DeepSeek open-source inference acceleration framework DeepSpec, launches DSpark to increase V4 model speed by up to 85%
According to Beating monitoring, DeepSeek has collaborated with Peking University to release a technical report on the speculative sampling acceleration framework DSpark and open sourced the full stack code repository DeepSpec. Currently, DSpark has been deployed in the DeepSeeker V4 online business. On the premise of ensuring lossless output, DSpark has increased the single user generation speed by 60% to 85% for the Flash version and 57% to 78% for the Pro version. DSpark has outperformed the original single token multi branch prediction (MTP-1) baseline and significantly increased the overall system throughput under strict latency constraints. Previously, it was difficult to implement multi token speculative sampling in online production environments. The generation of autoregressive draft models is too slow, while parallel draft models have extremely low acceptance rates for the latter half of long sequences due to independent predictions at each position. If multiple Token drafts are blindly verified during high concurrency, the large model will waste a lot of computing power to verify typos that are destined to be rejected, resulting in a serious collapse of the overall system throughput. Therefore, the industry is mostly limited to single Token prediction (MTP-1) online. DSpark has overcome the bottleneck of throughput degradation caused by high concurrency. DSpark first uses DFlash parallel backbone network to generate hidden states, and then adds extremely lightweight Markov heads. The Markov head injects associations between adjacent words in a very low-cost serial manner by performing a table lookup and a matrix multiplication. At the same time, the system integrates a confidence prediction head and a posterior calibration algorithm. In order to achieve zero overhead scheduling that is perfectly compatible with production environments and prevent future information leakage, the scheduler adopts an asynchronous mechanism that dynamically determines the candidate word clipping length using historical predictions from two steps ago, completely preventing large models from verifying high-risk tail typos under heavy loads. In addition to DSpark, DeepSeek's open-source DeepSpec code repository includes support for open source big models such as Qwen3 and Gemma. DeepSpec provides a complete Python toolchain from downloading prompt words, rebuilding large model cache, training draft models, to benchmark evaluation. Developers can directly utilize open-source scripts to customize and deploy dedicated acceleration modules for different open-source models locally. [Original link]
Share To
Timeline
HotFlash
APP
X
Telegram
CopyLink