The Li Feifei team clarifies the concept of "world model," Sora can only be considered a renderer.

On June 3, 2026, the World Labs team and Stanford University professor Fei-Fei Li jointly published a concept analysis paper, titled so straightforwardly that it has almost no embellishments: "A Functional Classification of World Models." The opening sentence of the paper breaks an industry consensus: "World models are one of the most important and most abused terms in today's field of artificial intelligence."

The background of this statement is familiar to anyone who has followed the AI industry.

In February 2024, OpenAI released the video generation model Sora, and the title of the technical report boldly stated "Video Generation Models as World Simulators." At that time, Jim Fan, NVIDIA’s director of robotics, left a comment on LinkedIn that was later quoted repeatedly: Sora is essentially a "world model that only allows no-operations as the sole action." On another front, according to public reports, the Tesla AI team has repeatedly referred to the prediction components within its fully autonomous driving system as "world models" or "world simulators" in public venues. Game engines, 3D generation tools, and embodied intelligence models—various products and technologies have been shoved into the same basket and labeled with the same tag.

What do a video generator, an autonomous driving prediction network, a robotic control model, and a physics engine have in common? Almost nothing. Yet they are all called "world models."

This ongoing conceptual confusion for over two years has finally led someone to attempt to systematically clarify the matter. The team led by Fei-Fei Li did not release a new model, did not announce new benchmarks, and did not demonstrate any product functionalities. They did something more fundamental: returning to the theoretical source of partially observable Markov decision processes, they reduced all systems on the market that are referred to as "world models" to three different functional projections within the same cognitive loop.

The three projections are: renderer, simulator, planner. Under the classification framework of World Labs, Sora and its similar video generation models fall under the category of renderer.

Why a Term Can Hold So Many Contradictory Meanings

To understand the root of this confusion, one must first ask a more fundamental question: What does a company mean when it says, "We are making a world model"?

For OpenAI, the goal of Sora is to "understand and present the physical world in videos." According to the technical report, Sora learns statistical patterns from massive video data and can generate images that conform to visual common sense: a cup falling to the ground will break, a paper airplane released will fly, and a person walking will alternate their legs. These images appear to "understand physics."

For Tesla, the "world model" is a neural network within the FSD system that predicts the motion trajectories of road participants in the next few seconds. It needs to output precise 3D positions, velocities, and orientations for safe driving decisions calculated by the path planning module. This model does not need to output pixels; it outputs vectors and probability distributions.

For robotics companies, a "world model" is the internal simulation mechanism that allows a robotic arm to predict, "If I push this cup 5 centimeters to the left, will it tip over?" It needs to understand object properties, contact mechanics, and stability, outputting a feasibility assessment of actions.

The goals of the three types of companies are completely different. Video generation companies care about pixel fidelity, autonomous driving companies care about the accuracy of physical state predictions, and robotics companies care about the predictability of action consequences. They are all working on "world models," but they are not doing the same thing at all.

World Labs directly points to the core of the problem in the article: the reason these systems all share the same name is that they indeed carry some aspect of "understanding the world." However, each only completes one part of a complete cognitive loop yet is packaged into a complete world model by marketing language, media reports, and capital narratives.

Another driver of conceptual confusion is the inherent tension in the term itself. The term "world model" carries grand narrative attributes, sounding more imaginative than "video generation model" or "video prediction model," and better supporting high valuations and funding stories. When technical capabilities cannot meet public expectations, concepts inevitably become promotional tools.

Returning to the 1960s: What Should a Complete “World Model” Be

The classification framework of World Labs is built on a seemingly ancient theoretical foundation: partially observable Markov decision processes.

This framework describes a complete cycle of interaction between an agent and its environment. The agent is in a certain environmental state, it performs an action, the action changes the environmental state, the agent obtains partial observations through sensors, the observations trigger internal state updates, and the updated understanding drives the next action. This cycle repeats.

Within this framework, the complete functionality of a "world model" should include three parts: generating observations from states (pixels, point clouds, etc., seen by human eyes or collected by sensors), deducing the next state from actions and current states (predicting physical changes), and generating actions from observations and goals (decision planning).

Language models learn the statistical patterns of text sequences, while world models learn the statistical properties of space and time. How light reflects on different material surfaces, how objects move under the influence of gravity, and how energy is transferred after rigid body collisions—these are the patterns that a world model aims to capture.

The World Labs team points out in the article that all systems currently called "world models" on the market are, in fact, just projections of one functional element within the complete cycle described above. Some systems only do "rendering from state to observation," some only do "state inference from action to next state," and some only do "planning from observation to action." They each take a segment of the cycle yet are labeled with names representing the complete circle.

The value of this analytical framework lies in providing a comparative coordinate system that transcends marketing jargon. No matter how a company packages its product, as long as it is placed back into the POMDP cycle and examined for what it inputs, outputs, and lacks, its capability boundaries are laid bare.

Renderer, Simulator, Planner: The Capability Boundaries of Three Projections

In World Labs' classification, the first type is defined as a "renderer." Its core objective is to generate high-fidelity pixel outputs geared toward human visual perception. The input is a representation of a certain environmental state (which can be a text description, 3D scene parameters, or implicit encoding), and the output is a continuous frame of images.

The optimization direction of a renderer is visual realism rather than physical accuracy. The World Labs article explicitly states that buildings generated by the renderer may look "precarious" because it does not truly solve structural mechanics equations; the splashes generated may look realistic, but the volume, flow rate, and impact force of the liquid may not correspond to actual physical quantities at all. Therefore, these models cannot be used for architectural design, cannot be used for robot training, and cannot be used for tasks that require precise physical simulations.

Google's Genie 3, various text-to-video models, and almost all AI video generation tools fall into this category. Sora certainly does as well.

The second type is the "simulator." Its core goal is not to generate visuals for human consumption but to generate precise states for subsequent calculations. The input is the current environmental state and external forces (or actions), while the output is the next state, which is physically and geometrically faithful to the laws of the real world. The state output by the simulator can be used for stress analysis, energy consumption calculations, collision detection, and can also serve as input for a renderer to produce visualized images, but its core value lies in the computability of the state itself.

NVIDIA Omniverse is a typical representative of this type of system. It is not an AI-native model but a digital twin platform that integrates traditional physics engines with AI-accelerated computation. World Labs evaluates that the simulator bridges rendering and planning, but the scarcity of high-quality 3D physical annotation data is a significant bottleneck. According to World Labs’ estimation in the article, the data used to train these types of models is several orders of magnitude less than readily available video data on the internet.

The third type is the "planner." Its inputs are observation data (camera images, LiDAR point clouds, tactile sensor readings, etc.) and goal commands, while its output is what action to take next. VLA (Visual-Language-Action) models and World Action Models belong to this category.

The differences among the three classifications are not subtle divergences in technical routes but fundamental functional differentiations. Renderers output pixels for humans to see, simulators output states for machines to compute, and planners output actions for executors to run. A system can possess multiple capabilities simultaneously, but when most systems called "world models" essentially only perform rendering, equating "rendering" with "understanding the world" represents a serious cognitive mismatch.

A Two-Year Debate: Is Sora Really a World Model?

In February 2024, OpenAI released Sora, and the technical report's title directly stated "Video Generation Models as World Simulators." This wording immediately sparked intense debate in the academic and developer communities.

Supporters argue that the videos generated by Sora demonstrate 3D spatial consistency, object permanence, and a certain intuitive understanding of physical interactions. A bitten hamburger leaves bite marks, and a dog running in the snow kicks up snowflakes—these details seem to indicate that the model has learned some physical laws.

The core argument of detractors comes from the classic definition of a world model in the field of reinforcement learning: a world model must be able to predict state transitions based on actions. In other words, given the current state and an action input, the model should output the next state after the action. Sora cannot do this. Users cannot tell Sora to "push that cup from the left" and then observe whether the cup will tip over, in what direction it will fall, or where the shards will fly.

Jim Fan's comment accurately captures this paradox: "Sora is essentially a world model, but it only allows no-operations (no-op) as the sole action." This means that Sora indeed predicts environmental changes over time, but this change process is not influenced by any external interventions and can only unfold along the inherent causal chain in the video data. It is not performing interactive inference; it is continuing a passive observation sequence.

In the r/MachineLearning section of Reddit, many reinforcement learning researchers voiced sharper critiques: systems that cannot predict state transitions based on actions cannot be called world models and should only be termed video prediction models.

The classification framework from World Labs provides a conclusive answer to this debate. In the POMDP cycle, actions are the key input driving state transitions; systems lacking this input are merely projections of the "observation generation" segment of the complete cognitive loop. Sora belongs to the renderer category, not a complete world model, let alone a world simulator.

However, this does not mean Sora lacks value. Renderers address a different question: how to generate images that meet human visual expectations. This problem is inherently extremely difficult and has great commercial value. The problem lies in packaging rendering capability as an understanding of the world, which misleads technical decision-makers and investors into believing that these models already possess the ability for physical reasoning or embodied interaction.

The Industrial Value of Conceptual Clarification

Clarifying the definition boundaries of "world models" is not a purely academic exercise. It directly impacts technology selection, investment judgments, and the public's understanding of AI capabilities.

For a manufacturing company assessing whether to use a certain "world model" for robot training, understanding whether the model is a renderer, simulator, or planner is a necessary prerequisite to avoid millions of dollars in trial and error. A model that can only generate video images, no matter how realistic, cannot substitute for precise calculations of object forces, motion trajectories, and collision consequences.

For investment institutions, distinguishing between the three types of projections means more accurately identifying the technical stack position of a project. A startup that claims to be a "world model" but essentially produces a renderer has competitors among video generation companies, not digital twin platforms or robotic control models. This directly determines how the market size is estimated and which benchmark companies are selected.

For academia, clear classifications are the premise for establishing comparable benchmarks. If the term "world model" continues to be generalized, researchers will find it challenging to define what counts as an improvement and what counts as a breakthrough, and peer reviews will be based on ambiguities.

World Labs also points out in the article that conceptual clarification is not intended to create opposition. The future development direction will be the integration of the three types of projections. A model that truly understands the physical properties of a cup should be able to render its visual appearance, simulate the physical process when it is knocked over, and plan how a robotic hand can stably grasp it. But before technology advances to that point, recognizing each other's boundaries is far more meaningful than dreaming of integration.

According to World Labs’ estimation in the article, technologies represented by NVIDIA Omniverse, including simulators and digital twin technologies, are targeting a potential market worth over a trillion dollars in areas such as factories, warehouses, and supply chains. This figure comes from the manufacturers' own judgments; when the market can truly reach this scale depends on the simulator's ability to break through the bottleneck of scarce high-quality 3D physical data.

For the current stage of the AI industry, perhaps the most important understanding is simple: being able to generate realistic videos does not equal understanding the physical world; being called a world model does not mean truly simulating the world. Cutting through marketing language and examining what input a system receives, what output results it produces, and which segment it lacks in the POMDP cycle is the most honest way to judge the boundaries of technological capabilities.

免责声明：本文章仅代表作者个人观点，不代表本平台的立场和观点。本文章仅供信息分享，不构成对任何人的任何投资建议。用户与作者之间的任何争议，与本平台无关。如网页中刊载的文章或图片涉及侵权，请提供相关的权利证明和身份证明发送邮件到support@aicoin.com，本平台相关工作人员将会进行核查。

The Li Feifei team clarifies the concept of "world model," Sora can only be considered a renderer.

Why a Term Can Hold So Many Contradictory Meanings

Returning to the 1960s: What Should a Complete “World Model” Be

Renderer, Simulator, Planner: The Capability Boundaries of Three Projections

A Two-Year Debate: Is Sora Really a World Model?

The Industrial Value of Conceptual Clarification

Selected Articles by PANews

Table of Contents

Related Articles