Breakthroughs in multimodal video generation technology, what opportunities does Web3 AI have?

Written by: Haotian

Apart from the "downward" localization of AI, the biggest change in the AI track recently has been the technological breakthrough in multimodal video generation, evolving from supporting pure text-generated videos to a full-link integration generation technology of text + images + audio.

Here are a few examples of technological breakthroughs for you to feel:

1) ByteDance open-sourced the EX-4D framework: monocular videos can instantly transform into free-viewpoint 4D content, with user acceptance reaching 70.7%. This means that given a regular video, AI can automatically generate viewing effects from any angle, which previously required a professional 3D modeling team to achieve;

2) Baidu's "Hui Xiang" platform: generates a 10-second video from a single image, claiming to achieve "movie-level" quality. However, whether this is exaggerated by marketing will have to wait until the Pro version update in August to see the actual effect;

3) Google DeepMind Veo: can achieve synchronized generation of 4K video + ambient sound. The key technological highlight is the achievement of "synchronization" capability; previously, video and audio were stitched together from two separate systems. Achieving true semantic-level matching requires overcoming significant challenges, such as ensuring that walking actions in the video correspond with the sound of footsteps in complex scenes;

4) Douyin ContentV: 8 billion parameters, generating 1080p video in 2.3 seconds, costing 3.67 yuan per 5 seconds. To be honest, this cost control is quite good, but the current generation quality still leaves much to be desired in complex scenes;

Why do these breakthroughs in video quality, generation cost, and application scenarios hold significant value and meaning?

In terms of technological value breakthroughs, the complexity of multimodal video generation is often exponential. A single frame image generates about 10^6 pixels, and a video must ensure temporal coherence (at least 100 frames), plus audio synchronization (10^4 sampling points per second), while also considering 3D spatial consistency.

Overall, the technical complexity is not low; originally, it was a super-large model tackling all tasks. It is said that Sora burned tens of thousands of H100s to achieve video generation capabilities. Now, it can be realized through modular decomposition + large model collaboration. For example, Byte's EX-4D actually breaks down complex tasks into: depth estimation module, viewpoint transformation module, temporal interpolation module, rendering optimization module, etc. Each module specializes in one task and then collaborates through a coordination mechanism.

In terms of cost reduction: the optimization of the underlying inference architecture includes a layered generation strategy, generating a skeleton at low resolution before enhancing imaging content at high resolution; a cache reuse mechanism, which reuses similar scenes; and dynamic resource allocation, which adjusts model depth based on the complexity of specific content.

This set of optimizations results in Douyin ContentV's cost of 3.67 yuan per 5 seconds.

In terms of application impact, traditional video production is a capital-intensive game: equipment, venues, actors, post-production; a 30-second advertisement costing hundreds of thousands is quite normal. Now, AI compresses this process to a prompt + a few minutes of waiting, and can achieve perspectives and effects that are difficult to attain through traditional filming.

This transforms the technical and financial barriers of video production into creativity and aesthetics, potentially reshaping the entire creator economy.

The question arises: what is the relationship between these changes in the web2AI technology demand side and web3AI?

First, the change in computing power demand structure. Previously, AI competed on the scale of computing power; whoever had more homogeneous GPU clusters would win. However, multimodal video generation requires a diverse combination of computing power, which may generate demand for distributed idle computing power, as well as various distributed fine-tuning models, algorithms, and inference platforms;
Secondly, the demand for data labeling will also strengthen. Generating a professional-level video requires: precise scene descriptions, reference images, audio styles, camera motion trajectories, lighting conditions, etc., which will become new professional data labeling needs. Using web3's incentive methods can stimulate photographers, sound designers, 3D artists, and others to provide professional data, enhancing AI video generation capabilities with specialized vertical data labeling;
Finally, it is worth mentioning that as AI gradually shifts from centralized large-scale resource allocation to modular collaboration, this itself represents a new demand for decentralized platforms. At that time, computing power, data, models, incentives, and other elements will combine to form a self-reinforcing flywheel, further driving the integration of web3AI and web2AI scenarios.

免责声明：本文章仅代表作者个人观点，不代表本平台的立场和观点。本文章仅供信息分享，不构成对任何人的任何投资建议。用户与作者之间的任何争议，与本平台无关。如网页中刊载的文章或图片涉及侵权，请提供相关的权利证明和身份证明发送邮件到support@aicoin.com，本平台相关工作人员将会进行核查。

Breakthroughs in multimodal video generation technology, what opportunities does Web3 AI have?

Selected Articles by Techub News

Table of Contents

Related Articles