Why is multimodal modularity an illusion of Web3 AI?

Original Author: @BlazingKevin_, the Researcher at Movemaker

NVIDIA has quietly recovered all the losses brought by Deepseek and even reached new highs again. The evolution of multimodal models has not caused chaos; instead, it has deepened the technical barriers of Web2 AI—from semantic alignment to visual understanding, from high-dimensional embedding to feature fusion. Complex models are integrating various modalities of expression at an unprecedented speed, constructing an increasingly closed AI stronghold. The U.S. stock market is also voting with its feet, with both crypto stocks and AI stocks experiencing a small bull market. However, this wave of enthusiasm has no connection to Crypto. The Web3AI attempts we see, especially the evolution in the Agent direction in recent months, are almost completely misguided: there is a wishful thinking to use decentralized structures to assemble Web2-style multimodal modular systems, which is actually a dual misalignment of technology and thinking. In today's world, where modular coupling is extremely strong, feature distribution is highly unstable, and computing power demands are increasingly centralized, multimodal modularity cannot stand in Web3. We must point out that the future of Web3AI does not lie in imitation but in strategic circumvention. From semantic alignment in high-dimensional space to information bottlenecks in attention mechanisms, and to feature alignment under heterogeneous computing power, I will elaborate one by one to explain why Web3AI should adopt the tactical program of surrounding the city from the countryside.

Web3AI based on flattened multimodal models suffers from poor performance due to semantic misalignment

In modern Web2 AI multimodal systems, "semantic alignment" refers to mapping information from different modalities (such as images, text, audio, video, etc.) into the same or convertible semantic space, allowing the model to understand and compare the inherent meanings behind these originally distinct signals. For example, a photo of a cat and the phrase "a cute cat" need to be projected close to each other in a high-dimensional embedding space so that during retrieval, generation, or inference, the model can "speak from the image" or "associate images with sounds."

Only under the premise of achieving high-dimensional embedding space does it make sense to divide workflows into different modules for cost reduction and efficiency improvement. However, in the Web3 Agent protocol, high-dimensional embedding cannot be achieved because modularization is an illusion of Web3AI.

How to understand high-dimensional embedding space? At the most intuitive level, think of "high-dimensional embedding space" as a coordinate system—just like the x-y coordinates on a plane, you can use a pair of numbers to locate a point. In our common two-dimensional plane, a point is completely determined by two numbers (x, y); however, in "high-dimensional" space, each point requires more numbers to describe it, possibly 128, 512, or even thousands of numbers.

To understand it step by step:

Two-dimensional example:

Imagine you have marked the coordinates of several cities on a map, such as Beijing (116.4, 39.9), Shanghai (121.5, 31.2), and Guangzhou (113.3, 23.1). Each city corresponds to a "two-dimensional embedding vector": the two-dimensional coordinates encode geographical location information into numbers.

If you want to measure the "similarity" between cities—cities that are close together on the map often belong to the same economic or climatic zone—you can directly compare their Euclidean distances.
Expanding to multiple dimensions:

Now suppose you not only want to describe the position in "geographical space" but also want to add some "climatic features" (average temperature, rainfall) and "population features" (population density, GDP). You can assign each city a vector containing 5, 10, or even more dimensions.

For example, Guangzhou's 5-dimensional vector might be [113.3, 23.1, 24.5, 1700, 14.5], representing longitude, latitude, average temperature, annual rainfall (in millimeters), and economic index. This "multidimensional space" allows you to compare cities across multiple dimensions such as geography, climate, and economy: if the vectors of two cities are very close, it means they are very similar in these attributes.
Switching to semantics—why "embedding":

In natural language processing (NLP) or computer vision, we also want to map "words," "sentences," or "images" into such a multidimensional vector space, allowing "similar meaning" words or images to be closer together in space. This mapping process is called "embedding."

For example, we train a model to map "cat" to a 300-dimensional vector v₁, "dog" to another vector v₂, and an "unrelated" word like "economy" to v₃. In this 300-dimensional space, the distance between v₁ and v₂ will be small (because they are both animals and often appear in similar linguistic contexts), while the distance between v₁ and v₃ will be large.

As the model is trained on massive amounts of text or image-text pairs, each dimension it learns does not directly correspond to interpretable attributes like "longitude" or "latitude," but rather to some "latent semantic features." Some dimensions may capture a coarse-grained distinction like "animal vs. non-animal," others may differentiate "domestic vs. wild," and still others may correspond to feelings like "cute vs. fierce"… In short, hundreds or thousands of dimensions work together to encode various complex and intertwined semantic layers.

What is the difference between high-dimensional and low-dimensional spaces? Only with enough dimensions can diverse and interwoven semantic features be accommodated; only high dimensions can provide clearer positions for them on their respective semantic axes. When semantics cannot be distinguished, i.e., when semantics cannot be aligned, different signals in low-dimensional space "squeeze" against each other, leading to frequent confusion in the model during retrieval or classification, significantly reducing accuracy. Furthermore, during the strategy generation phase, it becomes difficult to capture subtle differences, easily missing key trading signals or misjudging risk thresholds, directly dragging down performance. Additionally, cross-module collaboration becomes impossible, with each Agent acting independently, leading to severe information silo phenomena, increased overall response latency, and decreased robustness. Finally, in the face of complex market scenarios, low-dimensional structures have almost no capacity to accommodate multi-source data, making it difficult to ensure system stability and scalability. Long-term operation is bound to fall into performance bottlenecks and maintenance dilemmas, resulting in a significant gap between product performance after landing and initial expectations.

So, can Web3AI or the Agent protocol achieve high-dimensional embedding space? First, let's answer how high-dimensional space is achieved. Traditionally, "high-dimensional" requires that subsystems—such as market intelligence, strategy generation, execution, and risk control—align and complement each other in data representation and decision-making processes. However, most Web3 Agents simply encapsulate ready-made APIs (like CoinGecko, DEX interfaces, etc.) into independent "Agents," lacking a unified central embedding space and cross-module attention mechanisms, resulting in information that cannot interact from multiple angles and levels between modules, only following a linear pipeline, exhibiting single functionality, and failing to form an overall closed-loop optimization.

Many Agents directly call external interfaces, even without sufficient fine-tuning or feature engineering on the returned data. For example, a market analysis Agent may simply take price and trading volume, a trading execution Agent may only place orders according to interface parameters, and a risk control Agent may only issue alarms based on a few thresholds. They each perform their roles but lack multimodal fusion and deep semantic understanding of the same risk events or market signals, preventing the system from quickly generating comprehensive, multi-angle strategies in the face of extreme market conditions or cross-asset opportunities.

Therefore, requiring Web3AI to achieve high-dimensional space is equivalent to requiring the Agent protocol to independently develop all involved API interfaces, which contradicts its modular intention. The modular multimodal system depicted by small and medium enterprises in Web3AI is unsustainable. High-dimensional architecture requires end-to-end unified training or collaborative optimization: from signal capture to strategy calculation, and then to execution and risk control, all links share the same representation and loss function. The "module as plugin" approach of Web3 Agents exacerbates fragmentation—each Agent's upgrades, deployments, and parameter tuning are completed within their own silos, making it difficult to synchronize iterations, and there is no effective centralized monitoring and feedback mechanism, leading to skyrocketing maintenance costs and limited overall performance.

To achieve a full-link intelligent agent with industry barriers, it requires end-to-end joint modeling, unified embedding across modules, and systematic engineering of collaborative training and deployment to break the deadlock. However, there is currently no such pain point in the market, and naturally, there is no market demand.

In low-dimensional space, attention mechanisms cannot be precisely designed

High-level multimodal models require the design of precise attention mechanisms. The "attention mechanism" is essentially a way to dynamically allocate computational resources, allowing the model to selectively "focus" on the most relevant parts when processing a certain modality input. The most common are the self-attention and cross-attention mechanisms in Transformers: self-attention allows the model to measure the dependency relationships between elements in a sequence, such as the importance of each word in a text relative to others; cross-attention allows information from one modality (like text) to determine which features of another modality (like an image's feature sequence) to "look at" when decoding or generating.

The premise for the attention mechanism to function is that multimodality possesses high dimensionality. In high-dimensional space, a precise attention mechanism can quickly find the most critical parts from a vast high-dimensional space. Before explaining why the attention mechanism needs to be placed in high-dimensional space to function, let’s first understand the process of designing attention mechanisms in Web2 AI, represented by the Transformer decoder. The core idea is to dynamically allocate "attention weights" to each element when processing sequences (text, image patches, audio frames), allowing it to focus on the most relevant information rather than treating everything equally.

In simple terms, if we compare the attention mechanism to a car, designing Query-Key-Value is like designing the engine. Q-K-V is the mechanism that helps us identify key information: Query refers to the query ("What am I looking for?"), Key refers to the index ("What labels do I have?"), and Value refers to the content ("What content is here?"). For multimodal models, the content you input into the model could be a sentence, an image, or an audio clip. To retrieve the content we need in the dimensional space, these inputs are cut into the smallest units, such as a character, a small block of a certain pixel size, or an audio frame. The multimodal model generates Query, Key, and Value for these smallest units to perform attention calculations. When the model processes a certain position, it uses the Query of that position to compare with all the Keys, determining which labels best match the current needs, and then extracts the corresponding Values based on the degree of matching, combining them with weighted importance to ultimately obtain a new representation that contains both its own information and integrates globally relevant content. In this way, each output can dynamically "ask—retrieve—integrate" based on context, achieving efficient and precise information focus.

On this engine's foundation, various parts are added, cleverly combining "global interaction" with "controllable complexity": scaling dot products ensure numerical stability, multi-head parallelism enriches expression, positional encoding retains sequence order, sparse variants balance efficiency, residuals and normalization aid stable training, and cross-attention connects multimodalities. These modular and progressively layered designs enable Web2 AI to possess strong learning capabilities while efficiently operating within manageable computational limits when handling various sequences and multimodal tasks.

Why can't modular-based Web3AI achieve unified attention scheduling? First, the attention mechanism relies on a unified Query-Key-Value space; all input features must be mapped to the same high-dimensional vector space to dynamically calculate weights through dot products. However, independent APIs return data in different formats and distributions—prices, order statuses, threshold alarms—without a unified embedding layer, making it impossible to form a set of interactive Q/K/V. Second, multi-head attention allows for parallel attention to different information sources at the same layer, then aggregates the results; whereas independent APIs often follow a "call A, then call B, then call C" sequence, where each step's output is merely the next module's input, lacking the ability for parallel, multi-route dynamic weighting, and thus cannot simulate the fine scheduling of the attention mechanism that scores all positions or modalities simultaneously and then integrates them. Finally, a true attention mechanism dynamically allocates weights to each element based on the overall context; under the API model, modules can only see their "independent" context when called, lacking a real-time shared central context, and thus cannot achieve cross-module global associations and focus.

Therefore, merely encapsulating various functions into discrete APIs—without a common vector representation, without parallel weighting and aggregation—cannot build a "unified attention scheduling" capability like that of Transformers, just as a car with poor engine performance cannot improve its limits no matter how it is modified.

Discrete modular assembly leads to shallow static feature fusion

"Feature fusion" is the further combination of feature vectors obtained from processing different modalities based on alignment and attention, for direct use in downstream tasks (classification, retrieval, generation, etc.). Fusion methods can be as simple as concatenation or weighted summation, or as complex as bilinear pooling, tensor decomposition, or even dynamic routing techniques. Higher-order methods involve alternating alignment, attention, and fusion in multi-layer networks, or establishing more flexible message-passing paths between cross-modal features through Graph Neural Networks (GNNs) to achieve deep information interaction.

It goes without saying that Web3AI is certainly at the stage of the simplest concatenation, as the prerequisite for dynamic feature fusion is high-dimensional space and a precise attention mechanism. When these prerequisites cannot be met, the final stage of feature fusion cannot achieve outstanding performance.

Web2 AI tends to adopt end-to-end joint training: processing all modal features such as images, text, and audio simultaneously in the same high-dimensional space, optimizing collaboratively with attention layers and fusion layers alongside downstream task layers, allowing the model to automatically learn the optimal fusion weights and interaction methods during forward and backward propagation. In contrast, Web3 AI often adopts a discrete modular assembly approach, encapsulating various APIs for image recognition, market data capture, risk assessment, etc., into independent Agents, and then simply assembling the labels, values, or threshold alarms output by each of them, with the main logic or human intervention making comprehensive decisions. This approach lacks a unified training objective and does not allow for cross-module gradient flow.

In Web2 AI, the system relies on the attention mechanism to calculate the importance scores of various features in real-time based on context and dynamically adjust the fusion strategy. Multi-head attention can also capture various feature interaction patterns in parallel at the same level, balancing local details with global semantics. In contrast, Web3 AI often fixes weights like "image × 0.5 + text × 0.3 + price × 0.2" in advance or uses simple if/else rules to determine whether to fuse, or may not perform any fusion at all, merely presenting the outputs of each module together, lacking flexibility.

Web2 AI maps all modal features into thousands of dimensions in high-dimensional space, and the fusion process involves not only vector concatenation but also addition, bilinear pooling, and various high-order interaction operations—each dimension may correspond to some latent semantics, enabling the model to capture deep, complex cross-modal associations. In contrast, the outputs of Web3 AI's various Agents often contain only a few key fields or indicators, with extremely low feature dimensions, making it nearly impossible to express nuanced information such as "how the image content matches the text meaning" or "the subtle relationship between price fluctuations and sentiment trends."

In Web2 AI, the losses of downstream tasks are continuously fed back through the attention and fusion layers to various parts of the model, automatically adjusting which features should be reinforced or suppressed, forming a closed-loop optimization. In contrast, Web3 AI often relies on manual or external processes to evaluate and tune the results of API calls after reporting, lacking automated end-to-end feedback, making it difficult for fusion strategies to iterate and optimize online.

Barriers in the AI industry are deepening, but pain points have yet to emerge

Because it is necessary to simultaneously consider cross-modal alignment, precise attention calculations, and high-dimensional feature fusion in end-to-end training, the multimodal systems of Web2 AI are often extremely large engineering projects. They require massive, diverse, and precisely labeled cross-modal datasets, as well as thousands of GPUs for weeks or even months of training time; in terms of model architecture, they integrate various cutting-edge network design concepts and optimization techniques; in engineering implementation, they must build scalable distributed training platforms, monitoring systems, model version management, and deployment pipelines; in algorithm development, continuous research is needed for more efficient attention variants, more robust alignment losses, and lighter fusion strategies. This full-link, full-stack systematic work demands high levels of funding, data, computing power, talent, and organizational collaboration, thus forming a strong industry barrier and creating the core competitiveness held by a few leading teams to date.

In April, when I reviewed Chinese AI applications and compared them to WEB3AI, I mentioned a viewpoint: in industries with strong barriers, Crypto may achieve breakthroughs. This means that certain industries are already very mature in traditional markets but have emerged with significant pain points. High maturity indicates that there are sufficient users familiar with similar business models, and significant pain points mean that users are willing to try new solutions, indicating a strong willingness to accept Crypto. Both elements are indispensable; conversely, if an industry is not already very mature in traditional markets but has significant pain points, Crypto cannot take root within it and will not have a survival space, as users will have a low willingness to understand it fully and will be unaware of its potential limits.

WEB3AI or any Crypto product claiming PMF needs to develop using the strategy of surrounding the city from the countryside, starting with small-scale trials in peripheral positions, ensuring a solid foundation before waiting for the emergence of core scenarios, which are the target cities. The core of Web3AI lies in decentralization, and its evolutionary path is reflected in high parallelism, low coupling, and compatibility with heterogeneous computing power. **This gives Web3AI an advantage in scenarios like edge computing, suitable for lightweight structures, easily parallelizable, and incentivized tasks, such as LoRA fine-tuning, post-training tasks for behavior alignment, crowdsourced data training and labeling, small foundational model training, and collaborative training of edge devices. The product architecture in these scenarios is lightweight, and the roadmap can be flexibly iterated. However, this does not mean that opportunities exist now, as the barriers of WEB2AI have only just begun to form. The emergence of Deepseek has instead stimulated the progress of multimodal complex task AI. This is a competition among leading enterprises, marking the early stage of the WEB2AI dividend. I believe that only when the WEB2AI dividend has nearly disappeared will the pain points it leaves behind become opportunities for WEB3AI to enter, just as DeFi was born. Before that moment arrives, the self-created pain points of WEB3AI will continue to emerge in the market. We need to carefully discern whether the protocols with a "surrounding the city from the countryside" approach are entering from the edges, first establishing a foothold in rural areas (or small markets, small scenarios) where power is weak and market roots are few, gradually accumulating resources and experience; whether they combine points and surfaces, advancing in a circular manner, able to iterate and update products in a sufficiently small application scenario. If none of this can be achieved, then relying on PMF to achieve a $1 billion market cap will be extremely difficult, and such projects will not be on the watchlist; whether they can engage in a protracted battle with flexibility, as the potential barriers of WEB2AI are dynamically changing, and the corresponding potential pain points are also evolving. We need to pay attention to whether WEB3AI protocols have sufficient flexibility, can pivot quickly for different scenarios, and can move rapidly between rural areas to approach target cities at the fastest speed. If the protocol itself is too infrastructure-heavy and has a large network architecture, it may have a high likelihood of being eliminated.

About Movemaker

Movemaker is the first official community organization authorized by the Aptos Foundation and jointly initiated by Ankaa and BlockBooster, focusing on promoting the construction and development of the Aptos ecosystem in the Chinese-speaking region. As the official representative of Aptos in the Chinese-speaking area, Movemaker is committed to building a diverse, open, and prosperous Aptos ecosystem by connecting developers, users, capital, and numerous ecological partners.

Disclaimer:

This article/blog is for reference only, representing the author's personal views and does not represent the position of Movemaker. This article does not intend to provide: (i) investment advice or recommendations; (ii) offers or solicitations to buy, sell, or hold digital assets; or (iii) financial, accounting, legal, or tax advice. Holding digital assets, including stablecoins and NFTs, carries high risks, with significant price volatility, and they may even become worthless. You should carefully consider whether trading or holding digital assets is suitable for you based on your financial situation. If you have specific questions, please consult your legal, tax, or investment advisor. The information provided in this article (including market data and statistics, if any) is for general reference only. Reasonable care has been taken in compiling this data and charts, but no responsibility is accepted for any factual errors or omissions expressed therein.

免责声明：本文章仅代表作者个人观点，不代表本平台的立场和观点。本文章仅供信息分享，不构成对任何人的任何投资建议。用户与作者之间的任何争议，与本平台无关。如网页中刊载的文章或图片涉及侵权，请提供相关的权利证明和身份证明发送邮件到support@aicoin.com，本平台相关工作人员将会进行核查。

Why is multimodal modularity an illusion of Web3 AI?

Web3AI based on flattened multimodal models suffers from poor performance due to semantic misalignment

In low-dimensional space, attention mechanisms cannot be precisely designed

Discrete modular assembly leads to shallow static feature fusion

Barriers in the AI industry are deepening, but pain points have yet to emerge

About Movemaker

Selected Articles by Odaily星球日报

Table of Contents

Related Articles