The cheaper the AI, the more expensive the chips.

CN
3 hours ago

Original source: Wall Street View

On June 30, Anthropic released Claude Sonnet 5.

This is a mid-range model and the "most capable" in the Sonnect series. It scored 63.2 on the agent capability test SWE-bench Pro—just 6 points short of flagship Opus 4.8's 69.2. In another dimension, on the graduate-level reasoning test GPQA-AAA v2, Sonnet 5 outperformed Opus 4.8.

Pricing is more crucial. During the promotional period, it costs $2 per million input tokens and $10 for output. The corresponding prices for Opus 4.8 are $5 and $25—Sonnet 5 delivers over 90% of flagship capability at 40-60% of the price.

This news can be interpreted in two ways.

The first interpretation: AI has become cheaper again. The decline in costs benefits everyone; the chatbot war continues, and model vendors are in fierce competition.

The second interpretation—which is also what the market is pricing—indicates that as models become cheaper, the cost of computing power and storage has actually become more expensive.

On the day Claude Sonnet 5 was released, the US semiconductor index rose nearly 4%. In the AI narrative over the past three years, there has been a clear line: reasoning efficiency would eliminate chip demand. But this judgment has been incorrect at every data point.

Price Drop: A Thousand-Fold Reduction in Three Years

First, let’s look at the line of price reduction.

In 2022, the cost of API calls at the GPT-4 level was about $0.03 per thousand tokens. By 2025, the price for models of equivalent performance—according to the Stanford AI Index Report—has dropped by about 280 times. Coupled with the combined effects of open source and efficiency improvements, the recognized decline is 1000 times.

The price reductions are not limited to one model; every company is reducing prices.

Anthropic's Sonnet 5, which matches the capability density of Opus 4.8, is priced at only 40-60% of that. Google’s Gemini Omni Flash video generation costs $0.10 per second, while the Nano Banana 2 Lite image model produces an image in 4 seconds for only $0.034 per thousand images—half the cost of the previous generation. DeepSeek-V4-Pro has reduced input costs to $0.035 per million tokens.

Price reductions are happening beyond just pricing tables.

On June 24, The Information reported that OpenAI found a purely software optimization technology internally—reducing GPU demand for a certain computational step by more than half, with the number of dedicated GPUs dropping from thousands to hundreds. In the same month, Meta proposed the Vistara plan: reattaching DDR4 memory from decommissioned servers via self-developed CXL chips, paired with DDR5 at a 3:1 ratio, cutting inference server costs by 25%.

By June 30, Step released the JetSpec technology for speculative decoding—potentially increasing large model inference speeds by nearly 10 times. When calculated, the number of GPUs needed for the same output of tokens can drop a whole order of magnitude.

If AI were a traditional cost-demand function, these signals should point to one thing: the need for chips is decreasing in the future.

Wall Street is terrified of this.

In January, during the weekend DeepSeek released R1, AI infrastructure stocks experienced one of their most severe sell-offs in recent years. AI cloud company Nebius's stock plummeted by 40%. The story is simple: Chinese open-source models sell tokens for $0.1, while US companies pay $2, leading to inevitable collapse in demand for computing power.

Explosion: Total Spending Rose by 320%

But what actually happened is the complete opposite.

Nebius co-founder Roman Chernin later recalled: During the week when DeepSeek caused panic, "It was possibly our best sales week." The company’s procurement department's first reaction to the sudden drop in costs was not to cut budgets but to finally run inference at a large scale.

In 2024, global corporate spending on generative AI is about $11.5 billion. By 2025, this number soared to $37 billion—a 320% increase in just one year. According to corporate research from Menlo Ventures, the median company operates "dozens of" AI applications by 2025 compared to 1 to 2 in 2023.

Data from various dimensions are on the same curve:

Uber had completely burned through its annual AI budget by April 2026. AT&T is currently processing 27 billion tokens daily—18 months ago, this number was 800 million. A major US healthcare company’s monthly token consumption jumped from 3 million to over 150 million.

Breaking it down, the growth comes from the overlap of three directions.

The first is application diffusion. Each marketing department in a company uses 3 AI tools, the sales department uses 4, the customer service department uses 2, plus legal, HR, and finance—ranging from 2 to dozens, which is a leap in quantity.
The second is the depth of single applications. For example, in customer service AI: in 2023, the daily interaction volume is about 500, with each interaction about 800 tokens, ending after the conversation. By 2025, daily interactions are 15,000, with each about 4,500 tokens, and each interaction still triggering 3 to 5 follow-up inferences—sentiment analysis, escalation prediction, quality scoring—all overlapping at the same entry point.
The third is the upgrade in complexity of the models themselves. From single-round models with 7B parameters to multi-step reasoning agents with 70B and above, the tokens consumed per round of internal reasoning are tens to hundreds of times that of linear interactions.

In other words, when the token cost drops to one-thousandth, the number of tokens used in the market has increased by tens of thousands of times. The net effect is directed in only one way: expenditure explosion.

Token consumption doubles every two months—a multitude of independent clues has produced the same number. Extrapolating this exponential curve to 2027, corporate AI annual spending surpassing $100 billion becomes an arithmetic problem, not a prediction problem.

Transmission: Storage Prices Increased Sixfold, Chip Infrastructure Points to $7.6 Trillion

The demand stimulated by price reductions has not remained at the software level.

The increase in memory prices is the most direct signal of AI demand transmitting from the model layer to the hardware layer.

From the third quarter of 2025, spot prices for DRAM and NAND Flash have cumulatively risen over 300%. DDR5 particles saw monthly increases exceeding 90%. By 2026, price increases not only didn't stop, but accelerated.

In the first quarter, the expected increase for DRAM contract prices was revised from 55%-60% to 90%-95%; NAND was revised from 33%-38% to 55%-60%. TrendForce predicts in the second quarter that DRAM will increase again by 58%-63%, and NAND by 70%-75%.

Using consumer-grade products as a benchmark: Acer Predator 32G DDR5 6000 memory kits were priced at 1300 yuan at the end of October 2025, and soared to 2700 yuan by January 2026. Doubling in three months is extremely rare in the consumer market.

Samsung's memory business recorded a seasonal operating profit historical high of over 20 trillion won, approximately 96.2 billion yuan, in the fourth quarter of 2025. The fundamental driving force behind this year-long rise does not come from consumer upgrades in mobile phones or PCs, but massive procurement of HBM, enterprise SSDs, and high-density DRAM by AI data centers.

Goldman Sachs' report in May calculated this to the extreme.

The report forecasts that from 2026 to 2031, the cumulative capital expenditure on global AI infrastructure will be about $7.6 trillion. In 2026 alone, it will be $765 billion, climbing to $1.6 trillion by 2031. Among these, a standard GPU (based on NVIDIA VR200 Rubin) is calculated at $80,500, with NVIDIA accounting for 75% of total computing power expenditure in each period.

Goldman Sachs' report also raises a key question: If ASICs (application-specific integrated circuits) largely replace GPUs, can total demand be reduced?

The answer is conditional. If demand is inelastic—meaning companies' AI computing power needs are fixed—then ASIC replacements can directly reduce total capital demand. But if demand is elastic—meaning the cheaper the computing power, the more is purchased—then changes in chip combinations mainly reshape the distribution of profits between different suppliers, rather than the total expenditure scale.

Goldman Sachs' baseline scenario chooses the latter.

US stocks are also moving in the same direction. SanDisk's shares have risen 857% since the beginning of the year, and Bernstein raised the target price to $3,000 in a report on June 30. AMD's stock rose 7% in a single day, reaching an all-time high. Companies making GPUs, storage, packaging, and data center equipment are all near new highs.

The figure cited in Edgen.tech's overview article on June 11 is particularly striking: Memory chip prices have increased sixfold in the past year.

The label "cyclical recovery" cannot be applied. Something that has increased sixfold signifies a complete re-evaluation of the demand for AI’s physical infrastructure across the entire economic system.

Roots: Jevons Answered This in 1865

William Stanley Jevons wrote a book called "The Coal Question" in 1865.

His core observation was that after Watt improved the steam engine, the unit coal consumption drastically decreased, yet the total coal consumption in Britain did not fall but rather increased. This was because increased efficiency meant steam power was cost-acceptable in more industries—textiles, railroads, mining, shipping—creating coal demand in new scenarios that never existed before.

160 years later, the same formula is being replayed in AI computing power.

Companies have done the math. Under the token prices in 2022, real-time inference customer service conversations were economically unfeasible. Non-urgent scenarios weren't worth running AI. Personalized content generation could only occur at the segment level, not at the user level. By 2025, when prices dropped by 1000 times, these "previously non-existent demands" became necessities.

Chernin from Nebius provided a direct summary: "Each time we make the same unit of intelligence cheaper, we are not reducing consumption; we are increasing it—because the same budget can address more complex tasks."

The market has overlooked another structural drive: the positive feedback of profit margins.

The gross margin curve for AI inference has no historical counterpart. A company providing APIs might start with a gross margin of only 10%—training models is expensive, inference is expensive. However, software optimizations (operator fusion, quantization, speculative decoding) consistently lower inference costs month by month, while pricing adjustments always lag behind. Thus, gross margins can climb from 10% to 90% faster than in any traditional industry.

Gross margins drive profits, profits lead to increased purchases, and purchasing dilutes costs—creating a positive feedback loop with no ceiling.

"If you have DRAM, you can sell tokens; if you don’t have DRAM, you cannot sell tokens." This statement is becoming the fundamental equation for AI chip demand.

Goldman Sachs' two sensitivity assumptions in the report also deepen the same judgment. If the economic lifespan of chips is shortened from 5 years to 3 years, the replacement cycle accelerates, and cumulative capital demand directly increases. If on-chip memory is 25% above expectations—the main change is in the expenditure allocation within the chip stack, with limited net impact on the overall $7.6 trillion pie—but the direction remains the same: money will not be saved.

Conclusion: Who Holds the Computing Power

The lifting of export controls on Fable 5—banned on June 12 and lifted on June 30, over a span of three weeks—provided an unexpected footnote to this paradox.

The reason for the restrictions was "national security risk." The lifting of the ban is not related to the disappearance of risk—substitutes emerged. Teams like Tulongfeng in Asia launched models close to the level of Mythos during the control period, quickly nullifying the deterrent of the blockade. The lifting of the ban is a response to reality, unrelated to goodwill.

This episode precisely fits into the main line of the AI cost reduction paradox: models are replaceable. From GPT to Claude to DeepSeek to open-source models, no one can monopolize the capability of AI itself—if there are barriers, there will be detours.

Hardware does not follow this logic.

GPUs cannot do it. DRAM cannot do it. The construction cycle of fabs is measured in years. The capacity ceiling of photolithography machines is fixed. The elasticity of supply for high-purity silicon is nearly zero. These are all physical laws, not business strategies. Software optimization can reduce model costs by one thousand times, but it cannot shorten the construction cycle of a fab by a single day.

If the price drop of AI models continues down this paradoxical path, it will not lead to a decoupling of computing power but rather to a re-concentration of pricing power over computing capabilities. Regardless of whose model you use, tokens must run on someone’s chips. Every penny that model providers cut from prices ultimately becomes revenue on the books of data centers, fabs, and memory production lines. The more aggressively costs are reduced, the more irreversible this transfer becomes.

免责声明:本文章仅代表作者个人观点,不代表本平台的立场和观点。本文章仅供信息分享,不构成对任何人的任何投资建议。用户与作者之间的任何争议,与本平台无关。如网页中刊载的文章或图片涉及侵权,请提供相关的权利证明和身份证明发送邮件到support@aicoin.com,本平台相关工作人员将会进行核查。

Share To
APP

X

Telegram

Facebook

Reddit

CopyLink